I have attached assignment files for my coursework. Last semester has the same coursework, but dataset file has been changed slightly for my semester and the dataset has more values, hence i am unable...

1 answer below »
I have attached assignment files for my coursework. Last semester has the same coursework, but dataset file has been changed slightly for my semester and the dataset has more values, hence i am unable to run code and also need to change values. From last semester, I have Rcode file and pdf files for all the questions other than question 1 of my assignment. Last semester's files have questions in different order. I can provide the files for your assistance to run code and answer the questions correctly.


SAI PRAJIT GHANTA – 662225892 SUJANA PRUDHIVI – 652693058 PRASHANT ARCOT – 650985079 IDS 572 Assignment 1 – Loan default prediction and investment strategies in online lending Due date: Feb 21 (Phase B) 5.Develop decision tree models to predict default. (a) Split the data into training and validation sets. What proportions do you consider, why? (b) Train decision tree models (use both rpart, c50) [If something looks too good, it may be due to leakage – make sure you address this] What parameters do you experiment with, and what performance do you obtain (on training and validation sets)? Clearly tabulate your results and briefly describe your findings. How do you evaluate performance – which measure do you consider, and why? (c) Identify the best tree model. Why do you consider it best? Describe this model – in terms of complexity (size). Examine variable importance. Briefly describe how variable importance is obtained in your best model. a) The percentile we consider the split for training and testing data is in the ratio of 70% : 30%. The given standard percentile ratio 80% : 20% seems undermined, given the lower proportions of Charged Off data points. We used caret’s “createDataPartition” method to achieve this. b) According to the requirement, two decision tree models are generated. Rpart and C50 algorithms have been implemented. Comparatively the performance of the rpart model was more efficient in terms of accuracy and optimum performance. The first rpart model had the properties like complexity parameter (cp) of 0.0001, and minSplit of 50. The following tree was pruned resulting in a complexity parameter of 0.0002. While working on pruning simultaneously, the C5.0 algorithm based model consisted of 81023 observations in total. Due to memory complications, we took a limited proportion of the original dataset but it seemed unwise to downsize the dataset even further. c) After pruning the rpart decision tree is the optimum model with 15,769 nodes, AUC value of 0.6458, and prediction accuracy of 99.23%. Considering the training set for the first model, we obtained an accuracy of 99.23% while for the second model we obtain an accuracy of 99.11% comparitively. We also took the AUC and Lift curves into consideration along with the confusion matrix that determines the best model with optimum performance. The confusion matrix provides a better understanding of the working of the model. Below are the confusion matrices for training and testing datasets: Actual Predicted Charged Off Fully Paid Charged Off 7954 71 Fully Paid 363 48327 Actual Predicted Charged Off Fully Paid Charged Off 3337 41 Fully Paid 173 20756 The top 10 variables by descending order of variable importance as given by the model are as follows: ● collection_recovery_fee (57) ● installment (17) ● loan_amt (15) ● total_pymnt (10) ● Sub_grade ● int_rate ● Annual_inc ● Grade ● Total_bc_limit ● Total_rev_hi_lim Consider the following lift curve and auc curve graphs for the test datasets. 6.Develop a random forest model. (Note the ‘ranger’ library can give faster computations) What parameters do you experiment with, and does this affect performance? Describe the best model in terms of number of trees, performance, variable importance. Compare the random forest and best decision tree model from Q 4 above. Do you find the importance of variables to be different? Which model would you prefer, and why ? Using the ranger’s implementation, two certain random forests were built. Post testing all the different values, we finally adopt an optimal value for both the models considerably 50 with the num.trees ranging from a value 50 to 100. According to the observations, we considered going with the process of opting a lower number to a higher number; with which only lesser observations are to be classified Charged Off. Simultaneously, the tree classifies applications as Fully Paid at the value of 100 as per the range. Hence, we looked forward to considering 50 as the final value for num.trees. The main difference between the two models implemented is the methodology used in determining the split. In the first model, we used permutation on information gain while in the second model we used the gini index taking impurity as the parameter. Based on the entire process, we conclude that the first model by using information gain on the testing dataset performs better than the second model. Compared to the previous model based on decision tree, the AUC is comparatively higher albeit marginally for both of these models. By using confusion matrix, we could conclude that the pruned decision tree performs remarkably better in predicting Charged Off. The first random forest model predicted 0 applications as “Charged Off”, and the second random forest model also predicted 0 applications as “Charged Off” on testing data. Both of these are significantly fewer compared to the pruned decision tree model’s testing data. However, on using confusion matrix to measure the model’s performance on testing data, the pruned decision tree does a remarkably better job in predicting “Charged Off” applications. Considering the confusion matrix of the first model for random forest, an accuracy of 99.99% and 97.94% for the training dataset and testing dataset is generated respectively with a prediction error of 0.0218. Training Predicted Charged Off Fully Paid Charged Off 8245 0 Fully Paid 1 48469 Testing Predicted Charged Off Fully Paid Charged Off 3390 0 Fully Paid 496 20726 Considering the confusion matrix of the second model for random forest, an accuracy of 99.99% with a prediction error of 0.0223 is generated using the gini index with impurity as the parameter. Below are the confusion matrices to be considered. Training Predicted Charged Off Fully Paid Charged Off 8245 0 Fully Paid 1 48469 Testing Predicted Charged Off Fully Paid Charged Off 3071 0 Fully Paid 510 20726 The top 10 variables by descending order of variable importance as given by both the models are as follows: ● Total_pymt ● Collection_recovery_fee ● installment ● loan_amnt ● tot_hi_cred_lim ● total_bc_limt ● avg_cur_bal ● bc_open_to_buy ● total_rev_hi_lin ● annual_inc The top 10 variables by descending order of variable importance as given by both the models are as follows: ● Collection_recovery_fee ● Total_pymt ● Loan_amnt ● Installment ● Int_rate ● Sub_grade ● Dti ● X1 ● Bc_util ● Total_bc_util By comparing the above two models and the variables given by the pruned decision tree, total_pymt, avg_cur_bal, tot_hi_cred_lim, are the most important variables that determine the status of a loan( “Fully Paid”, “ Charged Off”) according to the models. 7.(a) Compare the performance of your models from Qs 4 and 5 above based on this. Note that the confusion matrix depends on the classification threshold/cutoff you use. Evaluate different thresholds and analyze performance. Which model do you think will be best, and why. (b) Another approach istodirectlyconsiderhowthemodelwillbeused– youcanorderthe loans in descending order of prob(fully-paid). Then, you can consider starting with the loans which are most likely to be fully-paid and go down this list till the point where overall profits begin to decline (as discussed in class). Conduct an analysis to determine what threshold/cutoff value of prob(fully-paid) you will use and what is the total profit from different models. Also compare the total profits from using a model to that from investing in the safe CDs. Explain your analyses and calculations. Which model do you find to be best and why. And how does this compare with what you found to be best in part (a). For the purpose of making an investment model, we choose the profit value as 3 * average interest rate (11.56) = $34.68, and loss value of 3 * average return rate on defaulted loans (4.9) = $14.7. We used the pruned decision tree as our model for this analysis on account of its ability to predict “Fully Paid” loans accurately. The graph for cumulative profit is as follows: Drawing conclusions from the above pictorial representation, we conclude that around 20,000th observation, the total cumulative profit increases in a steady manner. In the 20,000th index, we notice a sharp decrease in the chart. We observe that this decline starts after the prediction probability falls below 90%. This can be our threshold to determine whether or not an investment should be made for a given loan application given its probability of being “Fully Paid”.
Answered 2 days AfterNov 01, 2022

Answer To: I have attached assignment files for my coursework. Last semester has the same coursework, but...

Banasree answered on Nov 03 2022
42 Votes
Ans.
1.a) Peer to peer lending is safe as long data analytics are good and through certified platfo
rm. In this case study Lending club is the platform. They determined based on a variety of information, including credit rating, credit history, income etc.
Few things which will help client for a better decision are:
1. Data reliability.
2. Simple data analytics.
3. Prediction and forecasting of the loan data model.
4. Risk Analysis.
5. Detailed loan data.
6. Analyses of returns from loan.
Objectives:...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here