First, we will split data into test and training sets.
Then, in the second step, we will do centering and scaling. For numerical data, we will need to center and scale data to prevent differences in unit of measurements to dominate the distance measurement.
For categorical features, we will conduct one-hot encoding such that there will be one dummy variable for each group of a categorical variable.
In this lab assignment, we will use recipe function in recipes package to pre process the data. step_center and step_scale functions are used to scale the data and step_dummy function to conduct one-hot encoding, shown in the next code chunk.
# Step 1: data splitting with caret package. Hold 30% for testing.
# some cleaning with recipe function. do the pre processing on the train_set
features_train <- recipe(creditability="" ~="" .,="" data="creditdata)" %="">%->
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())%>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
prep(training = train_set, retain = TRUE) %>%
juice() %>%
select(-creditability)
# some cleaning with recipe function. do the pre processing on the test_set
features_test <- recipe(creditability="" ~="" .,="" data="creditdata)" %="">%->
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())%>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
prep(training = test_set, retain = TRUE) %>%
juice() %>%
select(-creditability)
# In the original data, we had two numerical and two categorical variables:
#payment_status with 5 classes and sex_marital with 4 classes
#With one-hot encoding, payment_status variable is splitted into five dummies
# And sex_marital variable is splitted into four dummy variables
# In total, we ended up with 11 features
# 'credit_amount''age''payment_status_X0''payment_status_X1''payment_status_X2''payment_status_X3''payment_status_X4''sex_marital_X1''sex_marital_X2''sex_marital_X3''sex_marital_X4'
In [9]:
# check the dimension of the features
# With one hot encoding, instead of four , we now have eleven predictors
print(dim(features_train))
# check the summary statistics of features
print(summary(features_train))
[1] 700 11
credit_amount age payment_status_X0 payment_status_X1
Min. :-1.0458 Min. :-1.4593 Min. :0.00000 Min. :0.00000
1st Qu.:-0.6637 1st Qu.:-0.7477 1st Qu.:0.00000 1st Qu.:0.00000
Median :-0.3569 Median :-0.2140 Median :0.00000 Median :0.00000
Mean : 0.0000 Mean : 0.0000 Mean :0.03857 Mean :0.04857
3rd Qu.: 0.2382 3rd Qu.: 0.4976 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. : 5.1357 Max. : 3.5219 Max. :1.00000 Max. :1.00000
payment_status_X2 payment_status_X3 payment_status_X4 sex_marital_X1
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Median :1.0000 Median :0.00000 Median :0.0000 Median :0.00000
Mean :0.5386 Mean :0.07857 Mean :0.2957 Mean :0.05714
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
sex_marital_X2 sex_marital_X3 sex_marital_X4
Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :1.0000 Median :0.00000
Mean :0.3057 Mean :0.5557 Mean :0.08143
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.00000
knn approach
Knn approach is not good at handling missing values. Luckily, there is no missing value in our dataset and it can be checked with the following line of code: sapply(creditdata, function(x) sum(is.na(x)))
We will use knn function in class package to get the label predictions.
First, for illustration purposes, we will select one data point from the test set, calculate the distance to each observation in the training data set and look at the closest observations and their corresponding labels. Distance is a measure of dissimilarity, the closer the distance between two observations, the higher the similarity. This is just a practice to help you understand the algorithm behind knn.
First, we will use dist() function in R and calculate the Manhattan distance between the first observation in the test set and each observation in the training set. Since we need all numerical variables to be scaled and categorical variables to be hot-encoding, we will use features_train and features_test for distance measurement.
We take the first observation in test data (we name it as test1 below) and calculate the pairwise distances for each observation in the train set and later collect them in a data frame along with the train data labels and call it collect.
In collect, the first column train_label shows the data label in train data set. The second column distance shows the Manhattan distance from the first observation in the test data set and each observation in the train dataset.
In [10]:
test1<-features_test[1,] #="" this="" is="" the="" first="" observation="" in="" test="">-features_test[1,]>
combine <- rbind(test1,features_train)="" #="" stack="" test1="" and="" train="" data="" feaures,="" test1="" being="" the="" first="">->
distanceALL<-as.matrix(dist(combine, method="manhattan" ))="" #="" calculate="" distance="" from="" each="" row="" in="" combine="" against="" all="" other="" rows="" and="" save="" it="" as="" a="">-as.matrix(dist(combine,>
collect<-cbind(as.data.frame(train_set$creditability),distanceall[2:701,1]) #="" combine="" train_set="" labels="" with="" the="" distance="">-cbind(as.data.frame(train_set$creditability),distanceall[2:701,1])>
names<>
names(collect) <>
head(collect) # list the first few observations in collect
collect<>
A data.frame: 6 × 2
train_label
distance
2
1
3.9301694
3
1
5.8230155
4
1
0.4738137
5
1
1.3395091
6
1
0.4699036
7
1
0.9272197
Exercise 1
Sort collect based on distance in ascending order (from lowest to highest) and only show the first 9 rows and call it Nine_neighbors.
In [ ]:
# Exercise #1: List the Nine_neighbors
# your code here
In [ ]:
# Test your code in here
### BEGIN HIDDEN TEST
test_that("Check distance measures", {
expect_equal( min(Nine_neighbors$distance),0.0574817558633449)})
test_that("Check distance measures", {
expect_equal( max(Nine_neighbors$distance),0.29251146576787)})
test_that("Check distance measures", {
expect_equal( mean(Nine_neighbors$distance),0.185849316164428)})
print("Good job! Your code passed the test!")
### END HIDDEN TEST
Exercise 2
Use knn function in class package and predict labels in the test data with knn when k=5. Use set.seed(4230) and name the predicted test data labels as knn_five. (knn_five is a vector of length 300)
Please note than knn() function in the class package requires predictors and labels to be entered separately. More specifically, predictors need to be a matrix and the label to be a vector only. Hence, you need to feed the knn() function with features_train and train_set$creditability where the former is in matrix format and the latter is a vector.
In [ ]:
# we will need this function in the following code chunks.
#Run this code before proceeding to the next one
class_error = function(actual, predicted) {
mean(actual != predicted)
}
In [ ]:
# Exercise #2: knn results when k=5
# your code here
In [ ]:
# Test your code in here
### BEGIN HIDDEN TEST
class_error = function(actual, predicted) {
mean(actual != predicted)
}
test_that("Check the classification error", {
expect_equal( class_error(test_set$creditability,knn_five),0.353333333333333)})
print("Your code passed the test!")
### END HIDDEN TEST
Exercise 3
Use knn function in class package and predict labels in the test data with knn when k=10. Use set.seed(4230) and name the predicted test data labels as knn_ten. (knn_ten is a vector of length 300)
Please note than knn() function in class package requires predictors and labels to be entered separately. More specifically, predictors need to be a matrix and the label to be a vector only. Hence, you need to feed the knn() function with features_train and train_set$creditability where the former is in matrix format and the latter is a vector.
In [ ]:
# Exercise #3: knn results when k=10
# your code here
In [ ]:
# Test your code in here
### BEGIN HIDDEN TEST
class_error = function(actual, predicted) {
mean(actual != predicted)
}
test_that("Check the classification error", {
expect_equal( class_error(test_set$creditability,knn_ten),0.323333333333333)})
print("Your code passed the test!")
### END HIDDEN TEST
Exercise 4: Performance Measure
This time, we will take the predicted results based on knn_ten above and by using the confusion matrix, we will calculate several misclassification measures. Note that if someone has good credit, creditability=1, we define it as a positive event. For instance, True Negative (TN) refers to cases where the true label (creditability) is 0, and our model predicts it as 0. Likewise, False Negative (FN) refers to cases where true label was 1 (good credit), but the model predicted to be 0.
Your task is to calculate "Accuracy" and store it as Accuracy and calculate "Specificity" and store it as Specificity
IMPORTANT: You need to enter your findings with two decimal points for the test to pass.
For instance, if your accuracy measure is 0.37777777777777, Accuracy should store it as 0.38.
For instance, if your Specificity measure is 0.5839123412342, Specificity should store it as 0.58.
In [ ]:
# Exercise #4: Performance measure
# your code here
In [ ]:
# Test your code in here
### BEGIN HIDDEN TEST
test_that("Check the results", {
expect_equal(0.68*(Accuracy^2)^2 ,0.1453933568
)})
test_that("Check the results", {
expect_equal(0.68*(Specificity^2)^2 ,0.0041796608
)})
print("Your code passed the test!")
### END HIDDEN TEST
Parameter Tuning
In knn, k is the only parameter we need to tune to find the optimal k. Next, we try k values from 1 to 50, and calculate the classification error for each k and save them as k_class_error.
In [ ]:
set.seed(4230)
k_values = 1:50
k_class_error = seq(0,0,length.out=length(k_values))
for (i in seq_along(k_values)) {
predicted_labels = knn(train = features_train,
test = features_test,
cl = train_set$creditability,
k =k_values[i] )
k_class_error[i] = class_error(test_set$creditability, predicted_labels)
}
print(k_class_error)
Exercise 5:
Write a function to find the optimal k (the k value which minimizes the classification error) and call it optimal_k. In other words, at which value of k does the k_class_error take the minimum value?
In [ ]:
# Exercise #5: find the optimal k
# your code here
In [ ]:
# Test your code in here
### BEGIN HIDDEN TEST
test_that("Check the optimal k", {
expect_equal( optimal_k/4,8)})
print("Yoour code passed the test!")
### END HIDDEN TEST
Cross validation
A better approach to find the optimal k is cross-validation. One can use caret package to conduct k-fold cross validation very easily. Since the Coursera platform does not support certain packages in the Lab Manager, we skipped the cross-validation part.