Preview Quiz.md

# Classification 2 Quiz

This quiz is part of Algoritma Academy assessment process. Congratulations on completing the second Classification in Machine Learning course! We will conduct an assessment quiz to test practical classification model techniques you have learned on the course. The quiz is expected to be taken in the classroom, please contact our team of instructors if you missed the chance to take it in class.

In this quiz, you will use the loan dataset — the data stored as a csv format in this repository as a `loan.csv` file. To complete this assignment, you will need to build your classification model using naive bayes, decision tree, and random forest algorithms by following these steps:

# Data Exploration

Before we jump into modelling, we will start with data exploration first. Hence, load and investigate the data given `loan.csv` and assign it to the loan object, followed by investigating the data using `str()` or `glimpse()` function.

``````# your code here
``````

Based on our investigation above, the loan data consists of 1000 observations and 17 variables, which shows historical data of customers who are likely to default or not in a bank. Meanwhile, the description of each feature explained below:

• `checking_balance and savings_balance`: Status of existing checking/savings account
• `credit_history`: Between critical, good, perfect, poor and very good
• `purpose`: Between business, car(new), car(used), education, furniture, and renovations
• `employment_duration`: Present employment since
• `percent_of_income`: Installment rate in percentage of disposable income
• `years_at_residence`: Present residence since
• `other_credit`: Other installment plans (bank/store)
• `housing`: Between rent, own, or for free
• `job`: Between management, skilled, unskilled and unemployed
• `dependents`: Number of people being liable to provide maintenance for
• `phone`: Between none and yes (registered under customer name)

As a data scientist, you will develop a model that aids management with their decision-making process. The first thing we need to know is what kind of business question we would like to solve. Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/ lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.

Before go through the modeling section, take your time to do the exploration step. Let's say we are instructed to investigate the historical number of default customers for each loan purpose. Please do some data aggregation to get the answer.

Hint: Because we only focused at the debtor/borrower who defaulted, filter the historical data with the condition needed first (default == "yes")

``````# your code here
``````

1. Based on the exploration above, which purpose is most often to default?
• [ ] furniture/appliances
• [ ] car
• [ ] education

# Cross-Validation

Before we build our model, we should split the dataset into training and test data. In this step, please split the data into 80% training and 20% test proportion using `sample()` function, `set.seed(100)`, and store it as `data_train` and `data_test`.

Notes: Make sure you use RNGkind() before splitting

``````RNGkind(sample.kind = "Rounding")
set.seed(100)
``````

Let's take a look distribution of proportion in train and test data using `prop.table(table(object\$target))` function to make sure in train and test data has balance or not distribution of each class target.

``````# your code here
``````

Based on the proportion of the target variable above, we can conclude that our target variable can be considered to be imbalance; hence we will have to balance the same before using it for our models. One most important things to be kept in mind is that all sub-samping operations have to be applied to only training datasets. So please do it on `data_train` using the `downSample()` function from the caret package.

Notes: set the argument `yname = "default"`

``````# your code here
``````

# Decision Tree

After splitting our data into data_train and data_test, let us build our first model of a decision tree using `ctree()` function and store it under the `model_dt` object. To tune our model, let us set the `mincriterion = 0.10` argument.

``````# your code here
``````

1. In our decision model, the goal of setting the `mincriterion = 0.10` is ...
• [ ] To prune our model, we let the tree that has maximum p-value <= 0.10 to split the node
• [ ] To prune our model, we let the tree that has maximum p-value <= 0.90 to split the node
• [ ] To prune our model, we let the tree that has a maximum of 10% of the data in the terminal nodes
• [ ] To prune our model, we let the tree that has a maximum of 10% of the data in the terminal nodes

To have a better grasp of our model, please try and plot the model using `type = "simple"` argument.

``````# your code here
``````

# Predict on data train and test

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting. So to validate whether our model is fit enough, please predict towards `data_train` and `data_test` based on `model_dt` using `predict()` function and choose the `type = "response"`.

``````# pred_train_dt <-
# pred_test_dt <-
``````

# Model Evaluation

The last part of building the model would be the model evaluation. To check the performance, we can use the `confusionMatrix()` to get our model performance. Please create two confusion matrix using our in-sample data (`data_train`) and out-of-sample data (`data_test`) and compare both performances.

1. From the decision tree performance above, we can conclude that our decision tree model is ...
• [ ] Overall balanced
• [ ] Tends to overfitting
• [ ] Tends to underfitting

# Random Forest

The second model that we want to build is Random Forest. Now, let us try to explore the random forest model we have prepared in `model_rf.RDS`. The `model_rf.RDS` is built with the following hyperparameter:

• `set.seed(100)` # the seed number
• `number = 5` # the number of k-fold cross-validation
• `repeats = 3` # the number of the iteration

In your environment, please load the random forest model (`model_rf.RDS`) and save it under the `model_rf` object using the `readRDS()` function.

``````# your code here
``````

1. Which of the following are NOT among the advantages of the random forest model?
• [ ] Automatic feature selection
• [ ] It is a relatively fast model to compared to decision tree
• [ ] Reduce bias in a model as it is the aggregate multiple decision trees
• [ ] It generates an unbiased estimate of the out-of-box error

Now check the summary of the final model we built using `model_rf\$finalModel`.

``````# your code here
``````

In practice, the random forest already have out-of-bag estimates (OOB) that represent an unbiased estimate of its accuracy on unseen data.

1. Based on the `model_rf\$finalModel` summary above, the out-of-bag error rate from our model is 33.61%, what does this mean?
• [ ] We have error 33.61% of our unseen data
• [ ] We have error 33.61% of our train data
• [ ] We have error 33.61% of our loan data
• [ ] We have error 33.61% of our in-sample data

# Predicting the test data

After building the model, we can now predict the data_train and data_test based on `model_rf`, from there, please choose the `type = "raw"` to obtain class prediction.

``````# your code here
``````

# Model evaluation

Next, let us evaluate the random forest model we built with `confusionMatrix()` function and try to evaluate the performance of the random forest model. How should you evaluate the model performance?

``````# your code here
``````

We could also use Variable Importance, to get a list of the most important variables used in our random forest. Many would argue that random forest, being a black box model, can offer no true information beyond its job in accuracy; actually paying special attention to attributes like variable importance for example often do help us gain valuable information about our data.

Please take your time try to check which variable has a high significance to the prediction. We can check it using `varImp()` function and pass into the `plot()` to get the visualization.

1. From the plot you have created, which is the most variable influence the default?
• [ ] checking_balance
• [ ] months_loan_duration
• [ ] amount
• [ ] purpose

# Naive Bayes

The last model we are trying to compare to is the Naive Bayes. There are several advantages in using this model, for example:

• The model is relatively fast to train
• It is estimating a probabilistic prediction
• It can handle irrelevant features

1. Below are the characteristics of Naive Bayes, EXCEPT
• [ ] Assume that among the predictor variables are independent
• [ ] Assume that between target and predictor variables are independent
• [ ] Skewness due to data scarcity

# Naive Bayes model fitting

Now let us build a naive bayes model using `naiveBayes()` function from the `e1071` package, then set the laplace parameter as 1. Store the model under `model_naive` before moving on to the next section.

``````# your code here
``````

# Predict the naive bayes model, using data train and data test

Using our test dataset we have created earlier, try to predict using `model_naive` and use `type = "raw"`. The prediction will then results in a probability of positive class happening for each test dataset.

``````# your code here
``````

Now, let's take a look at ROC and AUC performance, use the `prediction()` function from the `ROCR` package to compare the positive class in `pred_naive` (`pred_naive[,2]`) with the actual data (`data_test\$default`) and store it as `pred_roc` object.

``````# your code here
``````

Next, please use the performance function from the ROCR package to help us define the axes and assign it to a `perf` object. To use the performance function, please define the arguments as below:

• `prediction.obj = pred_roc`
• `measure = "tpr"`
• `x.measure = "fpr"`
``````perf <-
``````

After you created a `perf` object, plot the performance by passing it in `plot()` function.

``````# your code here
``````

Try to evaluate the ROC Curve; see if there are any undesirable results from our model. Next, take a look at the AUC using `performance()` function, and set the arguments `prediction.obj = pred_roc`, and `measure = "auc"` then save it under auc object.

``````# your code here
``````

1. From our naive bayes model above, how do you interpret the AUC value?
• [ ] 72.22%, means the model performance is good because of the closer to 1 the better
• [ ] 72.22%, means the model performance is weak because of the closer to 0 the better
• [ ] 72.22%, the value of Area under ROC Curve did not give any information about model performance

# Model Evaluation Performance

Lastly, you can check the model performance for the naive bayes model using `confusionMatrix()` and compare it using the actual label.

``````# your code here
``````

# Models Comparison

1. As a data scientist in a financial institution, we are required to generate a rule-based model that can be implemented to the existing system. What is the best model for us to pick?

• [ ] Naive Bayes because all the conditional probabilities are well calculated
• [ ] Decision Tree because a decision tree model is easily translatable to a set of rules
• [ ] Random Forest because it is possible to traceback the rule using variable importance information
2. Between all of the models, which model has better performance in terms of identifying all high-risk customers?

• [ ] Naive Bayes
• [ ] Decision Tree
• [ ] Random Forest
Quiz
You need to score 7 out of a possible 10 to earn a badge.
You have 1 attempt. Only your highest score will be taken into account.
• ###### Quiz 1

• Based on the exploration above, which purpose is most often to default?
• Question worth 1 point

• ###### Quiz 2

• In our decision model, the goal of setting the `mincriterion = 0.10` is ...
• Question worth 1 point

• ###### Quiz 3

• From the decision tree performance above, we can conclude that our decision tree model is ...
• Question worth 1 point

• ###### Quiz 4

• Which of the following are NOT among the advantages of the random forest model?
• Question worth 1 point

• ###### Quiz 5

• Based on the `model_rf\$finalModel` summary above, the out-of-bag error rate from our model is 33.61%, what does this mean?
• Question worth 1 point

• ###### Quiz 6

• From the plot you have created, which is the most variable influence the default?
• Question worth 1 point

• ###### Quiz 7

• Below are the characteristics of Naive Bayes, EXCEPT
• Question worth 1 point

• ###### Quiz 8

• From our naive bayes model above, how do you interpret the AUC value?
• Question worth 1 point

• ###### Quiz 9

As a data scientist in a financial institution, we are required to generate a rule-based model that can be implemented to the existing system. What is the best model for us to pick?

Question worth 1 point

• ###### Quiz 10

Between all of the models, which model has better performance in terms of identifying all high-risk customers?

Question worth 1 point

Recipients 171