Classification2 Download

5 upvotes
Visit on GitHub
Preview Quiz.md

Classification 2 Quiz

This quiz is part of Algoritma Academy assessment process. Congratulations on completing the second Classification in Machine Learning course! We will conduct an assessment quiz to test the practical classification model techniques that you have learned on the course. The quiz is expected to be taken in the classroom, please contact our team of instructors if you missed the chance to take it in class.

In this quiz, you will analyze the loan dataset which shows historical data of customers who are likely to default or not in a bank. The data stored in this repository as loan.csv. To complete this assignment, you will need to build classification models using Naive Bayes, Decision Tree, and Random Forest algorithms by following these steps:

Data Exploration

Before we jump into modeling, we will try to explore the data. Load the data given (loan.csv) and assign it to an object named loan, followed by investigating the data using str() or glimpse() function.

# your code here

Based on our investigation above, the loan data consists of 1000 observations and 17 variables. The description of each feature explained below:

  • checking_balance and savings_balance: Status of existing checking/savings account
  • months_loan_duration: Duration of the loan period in months
  • credit_history: Between critical, good, perfect, poor, and very good
  • purpose: Between business, car(new), car(used), education, furniture, and renovations
  • amount: Loan amount in DM (Deutsche Mark)
  • employment_duration: Length of time at current job
  • percent_of_income: Installment rate in percentage of disposable income
  • years_at_residence: Number of years at current residence
  • age: Customer's age
  • other_credit: Other installment plans (bank/store)
  • housing: Between rent, own, or for free
  • existing_loans_count: Number of ongoing loans
  • job: Between management, skilled, unskilled and unemployed
  • dependents: Number of people being liable to provide maintenance for
  • phone: Either no or yes (registered under customer name)
  • default: Either no or yes. A loan's default is considered as yes when it is defaulted, charged off, or past due date

You should also make sure that each column store the right data types. You can do data wrangling below if you need to.

Tips: You can also use parameter stringsAsFactors = TRUE from read.csv() so that all character column will automatically stored as factors.

# your code

As a data scientist, you will develop a model that aids management with their decision-making process. The first thing we need to know is what kind of business question we would like to solve. Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.

Before we go through the modeling section, take your time to do the exploration step. Try to investigate the historical number of defaulted customers for each loan purpose. Please do some data aggregation to get the answer.

Hint: Because we only focused of the customers who defaulted, filter the data based on the condition needed (default == "yes")

# your code here

  1. Based on the exploration above, which purpose is most often to default?
    • [ ] Furniture/appliances
    • [ ] Car
    • [ ] Business
    • [ ] Education

Cross-Validation

Before we build our model, we should split the dataset into training and test data. Please split the data into 80% training and 20% test using sample() function, set.seed(100), and store it as data_train and data_test.

Notes: Make sure you use RNGkind() and set.seed() before splitting and run them together with your sample() code

RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here

Let's look at the proportion of our target classes in train data using prop.table(table(object$target)) to make sure we have a balanced proportion in train data.

# your code here

Based on the proportion above, we can conclude that our target variable can be considered imbalanced; hence we will have to balance the data before using it for our models. One important thing to be kept in mind is that all sub-sampling operations have to be applied only to training dataset. So please do it on data_train using the downSample() function from the caret package, and then store the downsampled data in data_train_down object. You also need to make sure that the target variable already stored in factor data type.

Notes: set the argument yname = "default"

library(caret)
set.seed(100)
# your code here
data_train_down <-

In the following step, please use data_train_down to build Naive Bayes, Decision Tree, and Random Forest model below.

Naive Bayes

After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. There are several advantages in using this model, for example:

  • The model is relatively fast to train
  • It is estimating a probabilistic prediction
  • It can handle irrelevant features

  1. Below are the characteristics of Naive Bayes, EXCEPT
    • [ ] Assume that among the predictor variables are independent
    • [ ] Assume that between target and predictor variables are independent
    • [ ] Skewness due to data scarcity

Build a Naive Bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter as 1. Store the model under model_naive before moving on to the next section.

library(e1071)
# your code here
model_naive <-

Naive Bayes Model Prediction

Try to predict our test data using model_naive and use type = "class" to obtain class prediction. Store the prediction under pred_naive object.

# your code here
pred_naive <-

Naive Bayes Model Evaluation

The last part of model building would be the model evaluation. You can check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data_test. Make sure that you're using defaulted customer as the positive class (positive = "yes").

# your code here

Decision Tree

The next model we're trying to build is Decision Tree. Use ctree() function to build the model and store it under the model_dt object. To tune our model, let's set the parameter mincriterion = 0.90.

library(partykit)
set.seed(100)
# your code here
model_dt <-

  1. In our decision tree model, the goal of setting the mincriterion = 0.90 is ...
    • [ ] To prune our model, we let the tree that has p-value <= 0.90 to split the node
    • [ ] To prune our model, we let the tree that has p-value <= 0.10 to split the node
    • [ ] To prune our model, we let the tree that has a maximum of 10% of the data in the terminal nodes

To have a better grasp of our model, please try to plot the model and set type = "simple".

# your code here

  1. Based on the plot, which of the following interpretation is TRUE?
    • [ ] a customer who has checking_balance > 200 DM, with credit_history labelled "perfect", and saving_balance that is "unknown" is expected to default
    • [ ] a customer who has checking_balance 1-200 DM, with months_loan_duration < 21 is expected to default
    • [ ] a customer who has checking_balance that is "unknown", with other_credit consist of "store" is expected to default

Decision Tree Model Prediction

Now that we have the model, please predict towards the test data based on model_dt using predict() function and set the parameter type = "response" to obtain class prediction.

# your code here
pred_dt <-

Decision Tree Model Evaluation

We can use the confusionMatrix() to get our model performance. Make sure that you're using defaulted customer as the positive class (positive = "yes").

# your code here

Random Forest

The last model that we want to build is Random Forest. he following are among the advantages of the random forest model:

  • Reduce bias in a model as it aggregates multiple decision trees
  • Automatic feature selection
  • It generates an unbiased estimate of the out-of-box error

Now, let's explore the random forest model we have prepared in model_rf.RDS. The model_rf.RDS was built with the following hyperparameter:

  • set.seed(100) # the seed number
  • number = 5 # the number of k-fold cross-validation
  • repeats = 3 # the number of the iteration

In your environment, please load the random forest model (model_rf.RDS) and save it under the model_rf object using the readRDS() function.

# your code here
model_rf <-

Now check the summary of the final model we built using model_rf$finalModel.

library(randomForest)
# your code here

In practice, the random forest already have an out-of-bag estimates (OOB) that represent its accuracy on out-of-bag data (the data that is not sampled/used for building random forest).


  1. Based on the model_rf$finalModel summary above, how can we interpret the out-of-bag error rate from our model?
    • [ ] We have 33.61% error of our unseen data
    • [ ] We have 33.61% error of our train data
    • [ ] We have 33.61% error of our loan data
    • [ ] We have 33.61% error of our in-sample data

We could also use Variable Importance, to get a list of the most important variables used in our random forest. Many would argue that random forest, being a black box model, can offer no true information beyond its job in accuracy; actually paying special attention to attributes like variable importance for example often do help us gain valuable information about our data.

Please take your time to check which variable has a high influence to the prediction. You can use varImp() function and pass it to the plot() function to get the visualization.

# your code here

  1. From the plot you have created, which variable has the most influence to the prediction?
    • [ ] checking_balance
    • [ ] months_loan_duration
    • [ ] amount
    • [ ] purpose

Random Forest Model Prediction

After building the model, we can now predict the test data based on model_rf. You can use predict() function and set the parameter type = "raw" to obtain class prediction.

# your code here
pred_rf <-

Random Forest Model Evaluation

Next, let us evaluate the random forest model we built using confusionMatrix(). How should you evaluate the model performance?

# your code here

Another way of evaluating model performance is through its ROC and AUC value. To calculate it, we need the probability of a positive class for each observation. Let's try focusing on the ROC and AUC value from our random forest prediction. First, predict the test data using model_rf but now using the parameter type = "prob". The prediction will results in the probability values for each class. You can store the prediction in prob_test object.

# your code here
prob_test <-

Now, use the prediction() function from the ROCR package to compare the probability of positive class in prob_test[,"yes"] with the actual data data_test$default and store it as pred_roc object.

# your code here
library(ROCR)
pred_roc <-

Next, please use the performance() function from the ROCR package, define the axes, and assign it to a perf object. To use the performance() function, please define the arguments as below:

  • prediction.obj = pred_roc
  • measure = "tpr"
  • x.measure = "fpr"
# your code here
perf <-

After you created a perf object, plot the performance by passing it in the plot() function.

# your code here

Try to evaluate the ROC Curve; see if there is any undesirable results from our model. Next, take a look at the AUC value using performance() function by setting the arguments prediction.obj = pred_roc and measure = "auc" then save it under auc object.

# your code here
auc <-

  1. From the result above, how do you interpret the AUC value?
    • [ ] 90.51% means that the model performance is good because the closer to 1 the better
    • [ ] 90.51% means that the model performance is good in classifying positive classes
    • [ ] 95.11% means that the model performance is good in classifying both positive and negative class
    • [ ] 95.11% as Area under ROC Curve represent the accuracy of the model

Models Comparison


  1. As a data scientist in a financial institution, we are required to generate a rule-based model that can be easily implemented in the existing system. What is the best model for us to pick?

    • [ ] Naive Bayes because all the conditional probabilities are well calculated
    • [ ] Decision Tree because the model can be easily translated into a set of rules
    • [ ] Random Forest because it is possible to traceback the rule using variable importance information
  2. Between all the models we have made, which model has better performance in terms of identifying all high-risk customers?

    • [ ] Naive Bayes
    • [ ] Decision Tree
    • [ ] Random Forest

Last but not least, The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on the data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.

To validate whether our model is fit enough, we can predict the train and test data and then evaluate model performance in both data. You can check whether the performance is well balanced based on the threshold you have set.


  1. Based on your knowledge about the characteristic of a machine learning model, which statement below is FALSE?
    • [ ] Overfitting is a condition where a model performs well on the training data but performs very poorly in test data.
    • [ ] Underfitting is a condition where a model performs poor in the training data but performs well on the test data.
    • [ ] Machine Learning model that fit just right may have a slightly lower performance in its test data than in its training data.

Quiz
You need to score 8 out of a possible 10 to earn a badge.
You have 1 attempt. Only your highest score will be taken into account.
  • Quiz 1

  • Based on the exploration above, which purpose is most often to default?
  • Question worth 1 point

  • Quiz 2

  • Below are the characteristics of Naive Bayes, EXCEPT
  • Question worth 1 point

  • Quiz 3

  • In our decision tree model, the goal of setting the mincriterion = 0.90 is ...
  • Question worth 1 point

  • Quiz 4

  • Based on the plot, which of the following interpretation is TRUE?
  • Question worth 1 point

  • Quiz 5

  • Based on the model_rf$finalModel summary above, how can we interpret the out-of-bag error rate from our model?
  • Question worth 1 point

  • Quiz 6

  • From the plot you have created, which variable has the most influence to the prediction?
  • Question worth 1 point

  • Quiz 7

  • From the result above, how do you interpret the AUC value?
  • Question worth 1 point

  • Quiz 8

    As a data scientist in a financial institution, we are required to generate a rule-based model that can be easily implemented in the existing system. What is the best model for us to pick?

    Question worth 1 point

  • Quiz 9

    Between all the models we have made, which model has better performance in terms of identifying all high-risk customers?

    Question worth 1 point

  • Quiz 10

  • Based on your knowledge about the characteristic of a machine learning model, which statement below is FALSE?
  • Question worth 1 point

Recipients 446

Users with passing score on Classification2