regressionmodels Download

4 upvotes
Visit on GitHub
Preview Quiz.md

Regression Model Quiz

This quiz is part of Algoritma Academy assessment process. Congratulations on completing the Regression Model course! We will conduct an assessment quiz to test the practical regression model techniques that you have learned on the course. The quiz is expected to be taken in the classroom, please contact our team of instructors if you missed the chance to take it in class.

Data Exploration

In this quiz, you will be using criminologist dataset (crime). You can run the following chunk in your RMarkdown to make sure we are using the same dataset:

crime <- read.csv("crime.csv")

To make sure you have loaded the data correctly, do a simple inspection of the data. Try to peek in using head or tail and see if the columns have been stored in its appropriate data types.

# your code here

Among all variables within our crime dataset, there is a crime_rate variable that describes the measure of crime rate for each State within the United States in 1960. Imagine you are working as a government analyst and would like to see how social-economic conditions could reflect on the crime rate of a State. Recall how we can inspect the correlation for each variable using cor or ggcorr from GGally package. Try it out on your own and see what are the possible predictor variables for our crime_rate variable.

# your code here

Based on the result, you will know how each variable correlates with the crime_rate variable. Referring to that result, we can perform a preliminary variable selection to select suitable predictor variables.


  1. Which variable has little to no correlation with our crime_rate variable and might not be suitable as a predictor?
    • [ ] crime_rate
    • [ ] police_exp59
    • [ ] unemploy_m39
    • [ ] nonwhites_per1000

Building Linear Regression

From the data exploration process, it is known that not all variables show a strong correlation with crime_rate variable. Let's try to build a simple linear model using one of the highly correlated variables: police_exp59. Create a regression model using lm() function to predict crime_rate using police_exp59 from our dataset and assign it to an object named model_crime. Check the summary of that model.

# your code here

  1. Which of the following best describes the slope?

    • [ ] It's a negative slope, and is statistically insignificant (P-value higher than 0.05)
    • [ ] It's a positive slope, and is statistically significant (P-value lower than 0.05)
    • [ ] It's a positive slope, and is statistically insignificant (P-value higher than 0.05)
    • [ ] It's a negative slope, and is statistically significant (P-value lower than 0.05)
  2. What is the most fitting conclusion from the regression model above?

    • [ ] The R-squared does not tell us about the quality of our model fit, we should use p-value instead
    • [ ] The R-squared approximates 0.44, indicating a reasonable fit (the closer to 0 the better)
    • [ ] The R-squared approximates 0.44, indicating a poor fit (the closer to 1 the better)

Feature Selection using Stepwise Regression

The R-squared of model_crime approximates 0.44, which ideally needs to be improved upon, for example, by adding more predictor variables. One of the techniques for selecting predictor variables is using stepwise regression algorithm. Use the step() function with direction="backward" and store the best model under the model_step object.

# your code here

  1. Based on the summary of your final model, which statement is incorrect?
    • [ ] An increase of 1 of police_exp60 causes the value of crime_rate to increase by 10,265
    • [ ] An increase of 1 of unemploy_m24 causes the crime_rate to decrease by 6,087
    • [ ] An increase of 1 of mean_education causes the value of crime_rate to decrease by 18.01
    • [ ] Adjusted R-squared is a better metrics for evaluating our model compared to Multiple R-squared

Shapiro test for Normality test

One of the assumptions for linear regression stated that the error obtained from the model must be distributed normally around the mean of 0. You will need to validate our normality assumption from model_step using shapiro.test() function. This function requires us to pass in the residuals of our model.

# your code here

For your reference, Shapiro testing use the following hypotheses:

$H_0$ : Error is distributed normally

$H_1$ : Error is not distributed normally


  1. Based on the Shapiro test you have performed, what conclusion can be drawn from the result?
    • [ ] Error is distributed normally (P-value higher than 0.05)
    • [ ] Error is distributed normally (P-value lower than 0.05)
    • [ ] Error is not distributed normally (P-value higher than 0.05)
    • [ ] Error is not distributed normally (P-value lower than 0.05)

Breusch-Pagan for Heteroskedasticity Test

Another assumption you need to test is whether or not the error of our model is homoscedastic. Homoscedastic means the error is distributed with equal variance over different data ranges. To test this behavior, you can use the bptest function from lmtest package and pass in our model.

For your reference, Breusch-Pagan testing use the following hypotheses:

$H_0$: Error is considered Homoscedastic

$H_1$: Error is considered Heteroscedastic

# your code here

  1. Based on Breusch-Pagan test you have performed, what conclusion can be drawn from the result?
    • [ ] Heteroscedasticity is not present
    • [ ] Heteroscedasticity is present
    • [ ] The data spreads normally
    • [ ] There is no correlation between residuals and target variable

Variance Inflation Factor

Using VIF value, we can determine whether or not there are multicollinearity between predictor variables. A high VIF value indicates a high correlation between the variables. You can use the vif function from car package. Pass in our model_step object into the function and see if there's a multicollinearity in the model.

# your code here

  1. Based on the VIF value, which interpretation is correct?
    • [ ] inequality does not significantly affect crime_rate
    • [ ] An increase of 1 value on mean_education causes the value of crime_rate to increase by 4.1
    • [ ] Multicollinearity is not present in our model because the VIF values for all variables are below 10
    • [ ] Variables with multicollinearity should not be removed from the model

Predicting Unseen Data

You have performed statistical tests to make sure the model passed the assumptions of a linear regression model. Now imagine you were given a new dataset that records the same socio-economic variables from different observations. The data is stored under crime_test.csv, please read the data and store it under an object named crime_test. Next, predict the crime rate for that new data using model_step. You can store your prediction values under a new column in the crime_test data.

# your code here

Now pay attention to the crime_test data. Among the variables you should see a crime_rate column describing the real crime rate for each observation. Within the workshop you have learned some metrics to evaluate our model performance. Try to calculate the Mean Squared Error (MSE) of our model_step prediction. You can use the MSE function from MLMetrics package by passing in y_true and y_pred parameter.

# your code here

  1. What is the MSE value of the crime_test prediction? (round to two decimal points)
    • [ ] 55027.7
    • [ ] 46447.42
    • [ ] 45269.15

Quiz
You need to score 6 out of a possible 8 to earn a badge.
You have 1 attempt. Only your highest score will be taken into account.
  • Quiz 1

  • Which variable has little to no correlation with our crime_rate variable and might not be suitable as a predictor?
  • Question worth 1 point

  • Quiz 2

    Which of the following best describes the slope?

    Question worth 1 point

  • Quiz 3

    What is the most fitting conclusion from the regression model above?

    Question worth 1 point

  • Quiz 4

  • Based on the summary of your final model, which statement is incorrect?
  • Question worth 1 point

  • Quiz 5

  • Based on the Shapiro test you have performed, what conclusion can be drawn from the result?
  • Question worth 1 point

  • Quiz 6

  • Based on Breusch-Pagan test you have performed, what conclusion can be drawn from the result?
  • Question worth 1 point

  • Quiz 7

  • Based on the VIF value, which interpretation is correct?
  • Question worth 1 point

  • Quiz 8

  • What is the MSE value of the crime_test prediction? (round to two decimal points)
  • Question worth 1 point

Recipients 415

Users with passing score on regressionmodels