Preview Quiz.md

# Classification 1 Quiz

This quiz is part of Algoritma Academy assessment process. Congratulations on completing the first Classification in Machine Learning course! We will conduct an assessment quiz to test practical classification model techniques you have learned on the course. The quiz is expected to be taken in the classroom, please contact our team of instructors if you missed the chance to take it in class.

To complete this assignment, you are required to build your classification model to classify the characteristics of employees who have resigned and have not. Use Logistic Regression and k-Nearest Neighbor algorithms by following these steps:

# Data Exploration

Let us start by preparing and exploring the data first. In this quiz, you will be using the turnover of employee data (`turnover`). The data is stored as a .csv format in this repository as `turnover_balance.csv` file. Import your data using `read.csv` or `read_csv` and save as `turnover` object. Before building your classification model, you will need to perform an exploratory analysis to understand the data. Glimpse the structure of our `turnover` data! You can choose either `str()` or `glimpse()` function.

``````# your code here
``````

Turnover data consists of 10 variables and 7.142 rows. This dataset is a human resource data that shows historical data of employee characteristics who will resign or not. Below is more information about the variable in the dataset:

• `satisfaction_level`: the level of employee satisfaction working in a company
• `last_evaluation`: employee satisfaction level at the last evaluation
• `number_project`: the number of projects the employee has received
• `average_monthly_hours`: average hours worked per month
• `time_spend_company`: length of time in the company (years)
• `work_accident`: presence or absence of work accident, 0 = none, 1 = there
• `promotion_last_5years`: ever got a promotion in the last 5 years, 0 = no, 1 = yes
• `division`: name of department or division
• `salary`: income level, divided into low, medium and high
• `left`: employee history data resigned, 0 = no, 1 = yes

In this quiz, we will try to predict whether or not the employee has a resignation tendency using the `left` column as our target variable. Please change the class of `Work_accident`, `left`, and `promotion_last_5years` column to be in factor class as it should be.

``````# your code here
``````

For example, as HR, we are instructed to investigate the division that has a long history of an employee resigning based on average monthly hours. Let's do some aggregation of `average_monthly_hours` for each division. Because you only focused at the employee who left, you should filter the historical data with the condition needed. You can use `filter` then `group_by()` function by `division` variable and `summarise()` the mean of `average_monthly_hours` variable and arrange it by the highest of the mean value of `average_monthly_hours` using `arrange()` function.

``````# your code here
``````

1. Based on the aggregation data that you have analyzed, which division has the highest average of monthly hours?
• [ ] Marketing division
• [ ] Technical division
• [ ] Sales division
• [ ] Accounting division

# Data Preprocessing

After conducting the data exploratory, we will go ahead and perform preprocessing steps before building the classification model. Before we build the model, let us take a look at the proportion of our target variable in the `left` column using `prop.table(table(data))` function.

``````# your code here
``````

It seems like our target variable has a balance proportion between both classes. Before we build the model, we should split the dataset into train and test data in order to perform model validation. Split `turnover` dataset into 80% train and 20% test proportion using `sample()` function and use `set.seed()` with the seed 100. Store it as a `train` and `test` object.

Notes: Make sure you use `RNGkind()` before splitting

``````RNGkind(sample.kind = "Rounding")
set.seed(100)
``````

Let's take a look distribution of proportion in `train` and `test` data using `prop.table(table(data))` function to make sure in train and test data has balance or not distribution of each class target. Please round the proportion using two decimal numbers using the `round()` function.

``````# your code here
``````

1. Based on the proportions of `train` and `test`, can the distribution of each class be considered balanced? Why do we need to ensure that each class has a balanced proportion especially in the training data set?
• [ ] No, it is not.
• [ ] Yes, it is, but it is not necessary to balance the class proportion.
• [ ] No, it is not. The distribution of each class needs to be balanced to prevent any misclassified observation.
• [ ] Yes, it is. The distribution of each class in training set data needs to be balanced so when doing model fitting, the algorithm can learn the characteristics for each class equally.

# Logistic Regression Model Fitting

After we have split our dataset in train and test set, let's try to model our `left` variable using all of the predictor variables to build a logistic regression. Please use the `glm(formula, data, family = "binomial")` to do that and store your model under the `model_logistic` object. Remember, we are not using `turnover` dataset any longer, and we will be using `train` dataset instead.

``````# model_logistic <- glm()
``````

Based on the `model_logictic` you have made above, take a look at the summary of your model using `summary()` function.

``````# your code here
``````

1. Logistic regression is one of interpretable model. We can explain how likely each variable are predicted to the class we observed. Based on the model summary above, what can be interpreted from the `Work_accident` coeficient?
• [ ] The probability of an employee that had a work accident not resigning is 0.21.
• [ ] Employee who had a work accident is about 0.21 more likely to resign than the employee who has not.
• [ ] Employee who had a work accident is about 1.57 less likely to resign than the employee who has not.

# K-Nearest Neighbor Model Fitting

Now let's try to explore the classification model using the k-Nearest Neighbor algorithm. In the k-Nearest Neighbor algorithm, we need to perform one more step of data preprocessing. For both our `train` and `test` set, drop the categorical variable from each column except our `left` variable. Separate the predictor and target in-out `train` and `test` set.

``````# predictor variables in `train`
train_x <-

# predictor variables in `test`
test_x <-

# target variable in `train`
train_y <-

# target variable in `test`
test_y <-
``````

Recall that the distance calculation for kNN is heavily dependent upon the measurement scale of the input features. If any variable that have high different range of value could potentially cause problems for our classifier, so let's apply normalization to rescale the features to a standard range of values.

To normalize the features in `train_x`, please using `scale()` function. Meanwhile, in testing set data, please normalize each features using the attribute center and scale of `train_x` set data.

Please look up to the following code as an example to normalize `test_x` data:

``````scale(data_test, center = attr(data_train, "scaled:center"),
scale = attr(data_train, "scaled: scale"))
``````

Now it's your turn to try it in the code below:

``````# your code here

# scale train_x data
train_x <- scale()

# scale test_x data
test_x <- scale()
``````

After we have done performing data normalizing, we need to find the right K to use for our K-NN model. In practice, choosing k depends on the difficulty of the concept to be learned and the number of records in the training set data.

1. The method for getting K value, does not guarantee you to get the best result. But, there is one common practice for determining the number of K. What method can we use to choose the number of k?
• [ ] square root by number of row
• [ ] number of row
• [ ] use k = 1

After answering the questions above, please find the number of k in the following code:

Hint: If you have got a decimal number, do not forget to round it and make sure you end up with an odd number to prevent voting tie break.

``````# your code here
``````

Using K value, we have calculated in the section before, try to predict `test_y` using `train_x` dan `train_y` dataset. To make the k-nn model, please use the `knn()` function and store the model under the `model_knn` object.

Next, please look up at the following code:

``````library(class)
model_knn <- knn(train = ______, test = ________, cl = _______, k = _____)
``````

1. Fill the missing code here based on the picture above and choose the right code for building the knn model!
• [ ] model_knn <- knn(train = train_y, test = test_y, cl = test_y, k = 75)
• [ ] model_knn <- knn(train = train_x, test = test_y, cl = test_x, k = 89)
• [ ] model_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 75)
• [ ] model_knn <- knn(train = train_x, test = train_y, cl = train_x, k = 89)

# Prediction

Now let's get back to our `model_logistic`. In this section, try to predict `test` data using `model_logistic` return the probability value using `predict()` function with `type = "response"` in the parameter function and store it under `prob_value` object.

``````prob_value <-
``````

Because the prediction results in the logistic model are probabilities, we have to change them to categorical / class according to the target class we have. Now, given a threshold of 0.45, try to classify whether or not an employee can be predicted to resign. Please use `ifelse()` function and store the prediction result under the `pred_value` object.

``````pred_value <-
``````

Based on the prediction value above, try to answer the following question.

1. In the prescriptive analytics stage, the prediction results from the model will be considered for business decision making. So, please take your time to check the prediction results. How many predictions do our `model_logistic` generate for each class?
• [ ] class 0 = 714, class 1 = 715
• [ ] class 0 = 524, class 1 = 905
• [ ] class 0 = 590, class 1 = 839

# Model Evaluation

In the previous sections, we have performed a prediction using both Logistic Regression and K-NN algorithm. However, we need to validate whether or not our model did an excellent job of predicting unseen data. In this step, try to make the confusion matrix of model performance in the logistic regression model based on `test` data and `pred_value` and use the positive class is "1".

Note: do not forget to do the explicit coercion `as.factor()`.

``````# your code here
``````

Make the same confusion matrix for `model_knn` prediction result of `test_y`.

``````# your code here
``````

Let's say that we are working as an HR staff in a company and are utilizing this model to predict the probability of an employee resigning. As an HR, we would want to know which employee has a high potential of resigning so that we can take a precautionary approach as soon as possible. Now try to answer the following questions.

1. Which one is the right metric for us to evaluate the numbers of resigning employees that we can detect?
• [ ] Recall
• [ ] Specificity
• [ ] Accuracy
• [ ] Precision

1. Using the metrics of your answer in the previous question, which of the two models has a better performance in detecting resigning employees?
• [ ] Logistic Regression
• [ ] K-Nearest Neighbor
• [ ] Both has more or less similar performance

1. Now, recall what we have learned the advantage of each model. Which one is more suitable to use if we aimed for model interpretability?
• [ ] K-NN, because it tends to have a higher performance than logistic regression
• [ ] Logistic regression, because it has a lower performance than K-nn
• [ ] Logistic regression, because each coefficient can be transformed into an odds ratio
• [ ] K-NN, because it results in a better precision score for the positive class

Quiz
You need to score 7 out of a possible 9 to earn a badge.
You have 1 attempt. Only your highest score will be taken into account.
• ###### Quiz 1

• Based on the aggregation data that you have analyzed, which division has the highest average of monthly hours?
• Question worth 1 point

• ###### Quiz 2

• Based on the proportions of `train` and `test`, can the distribution of each class be considered balanced? Why do we need to ensure that each class has a balanced proportion especially in the training data set?
• Question worth 1 point

• ###### Quiz 3

• Logistic regression is one of interpretable model. We can explain how likely each variable are predicted to the class we observed. Based on the model summary above, what can be interpreted from the `Work_accident` coeficient?
• Question worth 1 point

• ###### Quiz 4

• The method for getting K value, does not guarantee you to get the best result. But, there is one common practice for determining the number of K. What method can we use to choose the number of k?
• Question worth 1 point

• ###### Quiz 5

• Fill the missing code here based on the picture above and choose the right code for building the knn model!
• Question worth 1 point

• ###### Quiz 6

• In the prescriptive analytics stage, the prediction results from the model will be considered for business decision making. So, please take your time to check the prediction results. How many predictions do our `model_logistic` generate for each class?
• Question worth 1 point

• ###### Quiz 7

• Which one is the right metric for us to evaluate the numbers of resigning employees that we can detect?
• Question worth 1 point

• ###### Quiz 8

• Using the metrics of your answer in the previous question, which of the two models has a better performance in detecting resigning employees?
• Question worth 1 point

• ###### Quiz 9

• Now, recall what we have learned the advantage of each model. Which one is more suitable to use if we aimed for model interpretability?
• Question worth 1 point

Recipients 219