IE 332 - Homework #3 Due: Nov 17th, 11:59pm Read Carefully. Important! As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable mark on this homework is...

Complete the assignment attached


IE 332 - Homework #3 Due: Nov 17th, 11:59pm Read Carefully. Important! As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable mark on this homework is 140 + 20 (bonus). As was also outlined in the syllabus, there is a zero tolerance policy for any form of academic misconduct. The assignment can be done individually or in pairs. By electronically uploading this assignment to Gradescope/Brightspace you acknowledge these statements and accept any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload your homework on time for it to be graded. No late assignments will be accepted. Only the last uploaded version of your assignment will be graded. NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last minute network traffic that would cause your assignment to be late, resulting in a grade of zero. You must use the provided LATEXtemplate on Brightspace to submit your assignment. Page i of i IE 332 Homework #3 Due: Nov 17 2021 1. In this question you will use a machine learning model to predict whether a passenger on the Titanic would have survived its sinking given a set of observed features. The dataset (‘Titanic.csv’ avail- able on Brightspace) includes information about 891 passengers (each row represents one person), with the following features for each: Pclass Sex Age SlbSp Parch Fare Embarked Survived where: • Pclass: ticket class (1 = 1st class (upper), 2 = 2nd class (middle), 3 = 3rd class (lower)) • Sex: passenger’s sex • Age: passenger’s age in years • SibSp: # of siblings/spouses aboard • Parch: # parents or children aboard • Fare: ticket fare • Embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) • Survived: If passenger survived (1 = survived, 0 = not survived) Specifically, you will compare the performance of a Naive Bayes classifier (using the e1071 package) and a Decision Tree classifier (using the tree package) for this task by reporting the confusion matrix and the ROC curve (using the ROCR package). Before you start, make sure you have installed the e1071, tree, and ROCR packages. STEP 1: Preprocessing the dataset. (a) (5 points) Load the provided data into R using the read.csv function. Ensure that the columns Pclass, Sex and Embarked are of class factor (HINT: lapply). Assuming your data frame is named df, show the output of executing the command str(df). Only two lines of R code. (b) (6 points) Find the total number of NAs in each column. Then, replace each NA in the Age column (only) by setting them equal to the median of the non-NA values in the column. No more than 2 lines of code. STEP 2: Partition the dataset into training and testing data. (c) (5 points) Create a training set composed of 75% of the rows selected randomly, with a testing set composed of the remaining 25% (use the createDataPartition function in the caret package setting the attribute p to the appropriate proportion, and retrieve the column Resample1 from the result to get the randomly selected indices). No more than 3 lines of code. STEP 3: Learn the models using the training data. (d) (5 points) Use the naiveBayes function in the e1071 package, and learn a classifier that de- termines if a passenger’s survival is “1” or “0”. Only one line of R code. (e) (5 points) Use the tree function in the tree package to learn a decision tree classifier to to determine a passenger’s survival. Only one line of R code. STEP 4: Evaluate model performance. (f) (8 points) Report the Confusion matrix for both trained models. What percentage of the test data was correctly classified for each model? No more than 7 lines of code. (g) (8 points) Use the ROCR package to create a single ROC plot showing the two classifiers (Naive Bayes in blue and decision tree in red). Make sure that plots are properly colored and labeled. Based on the Area Under the Curve measure, which of the two classifiers works better for the given data? No more than 10 lines of code. In the previous steps, some decisions that we made were arbitrary. That raises some questions: • Is using the median age to replace missing values in the Age column, like we did in step 1, an appropriate choice? • Why not split the data into 80% for training and the remainder 20% for testing, instead of a 75%/25% combination as in step 2? In the following items, we will consider how to approach these questions by performing some comparison analyses. For that, we will use the Decision Tree method, but the same analyses could be performed for Naive Bayes or any other supervised learning method. (h) (8 points) Compare the performance of the classifiers by varying the training/testing data pro- portion at 25%/75% and 50%/50% (vs. 75%/25%, which was performed above): No more than 6 lines of code. 1. random selection of 25% of the rows for training, and the other 75% for testing 2. random selection of 50% of the rows for training, and the other 50% for testing (i) (5 points) For each of the two new data partitions we need to train a separate decision tree. Only two lines of R code. (j) (10 points) Report the confusion matrix and the ROC plot for the model predicted with each of the different partitions: 25%/75% vs. 50%/50% vs. 75%/25% (original partition). What percentage of the test data was correctly classified for each partition level? Which one performed best for each of the tested metrics? Given this analysis, what partition level of the dataset would you pick in this case? IE 332 Homework #3 Page 2 of 9 2. Using the same Titanic dataset from above, you will conduct a clustering analysis on a mix of nominal and interval data types and investigate different distance measures. Use the preprocessed data with the following adjustment: • convert the Pclass column back to a numeric type and retain each value’s corresponding level. df[,c("Pclass")]=as.numeric(as.character(df[,c("Pclass")])) • Replace the first level (empty string) in column Embarked with “U”. levels(df[,c("Embarked")])[1]="U"} • Be sure you have installed the required packages: library(cluster)#for computing clustering,pam,gower library(factoextra)#for elegant ggplot2-based data visualization library(magrittr)#for piping: %>% library(Rtsne) # for t-SNE plot library(dplyr) # for data cleaning library(caret)# for one-hot encoding (a) (12 points) Using a statistical method (stacked bar chart, correlation, linear regression, or ANOVA), investigate the effect of categorical column Sex on the dependent column Survived. Is there a significant association between these two columns? Why? No more than 5 lines code and maximum 2 sentences to explain your reasoning. (b) (6 points) A categorical variable consists of discrete values that don’t have an ordered relation- ship. One-hot encoding is the process of converting a categorical variable into multiple variables, each with a value of 1 or 0. Read this reference one-hot-encoding-in-r and perform a one-hot encoding on the Sex and Embarked columns. Maximum 4 lines of code. (c) (6 points) Using the get dist() function from package factoextra, compute the Jaccard dis- similarity for the converted nominal columnsSex, Embarked. The formula for the Jaccard co- efficient could be found at https://www.ims.uni-stuttgart.de/documents/team/schulte/ theses/phd/algorithm.pdf. Note that the one-hot encoding from above will result in a 6 col- umn data frame where the first 2 columns are the results for column Sex and the last 4 columns are the results for column Embarked. Compute the Jaccard dissimlarity for columns Sex and Embarked separately. Maximum 2 lines of code. (d) (6 points) Using get dist() function from package factoextra, compute the Euclidean and Kendall distances for the numeric columns (remember to exclude column Survived). Maxi- mum 3 lines of code. (e) (10 points) Choose the optimal number of clusters. 1. Read this article k-medoids and write down one drawback of kmeans clustering that was mentioned. 2. Fill the missing parts in the following code, which uses the fviz nbclust function in pack- age factoextra to find the optimal number of clusters using a k-medoid clustering from calculating total within sum-of-squares (Elbow method). ##compute the weighted sum of three distance matrices, weights are equally weighted(each original column occupied 1/8 weight) my.d = 0.75*d.interval.kd + 0.125*d.sex + 0.125*d.eb ##recombine the numeric and categorical data my_data = cbind.data.frame(interval.data,nominal.onehot) fviz_nbclust(x=missing1, FUNcluster =missing2, method =missing3 , diss = missing4) 3. Use one sentence to explain how Elbow method works. IE 332 Homework #3 Page 3 of 9 https://datatricks.co.uk/one-hot-encoding-in-r-three-simple-methods https://www.ims.uni-stuttgart.de/documents/team/schulte/theses/phd/algorithm.pdf https://www.ims.uni-stuttgart.de/documents/team/schulte/theses/phd/algorithm.pdf https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples/ (f) (5 points) Conduct k-medoid clustering using function pam from package cluster with k = 2, if cluster 1 corresponds to those not survived and cluster 2 corresponds to those survived. Calculate the percentage of correctly assigned people according to your clustering result. 3. (8 points (bonus)) The following is a t-SNE plot(https://en.wikipedia.org/wiki/T-distributed_ stochastic_neighbor_embedding) for the clustering result. The two bigger black dots are the corre- sponding medoids for each cluster. Evaluate your answer from above using this visualization method. Does your answer align with this visualization? If not, what could be an alternative optimal number of clusters? Figure 1: t-SNE plot for the clustering result. IE 332 Homework #3 Page 4 of 9 https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding 4. In this question you will double encrypt/decrypt your first and last name (separated by a ‘ ’) using public and symmetric key approaches (if you are working in pairs you will use the first name of each member). Specifically, first encrypt your name using the One-Time Pad algorithm, resulting in ciphertext L1. Then, using the RSA algorithm encrypt L1 into ciphertext L2, which would be the transmitted message. You will then decrypt L2 to yield L3, and then decrypt L3 to result in L4, which will be a list of integers corresponding to the letters of your first and last name, as expected. That is, L4 should be identical to the corresponding list of integers from the original message. Use the following table to translate between characters and their integer representation. Be sure to show each step of the encryption and decryption processes. A B C D E F G H I J K L M 00 01 02 03 04 05 06 07 08 09 10 11 12 13 N 0 P Q R S T U V W X Y Z 14 15 16 17 18 19 20 21 22 23 24 25 26 Use the RSA algorithm shown in class, for p = 1153 and q = 997 (noting that key generation is only needed to be performed once for the entire message). Given the large number of digits is suggested to use a powerful calculator such as www.wolframalpha.com. The One-Time Pad: Believe it or not, but “perfect” encryption techniques are theoretically pos- sible! In this context, “perfect” means that there
Nov 09, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here