a.For the continuous variables, create a descriptive statistics table, and for the categorical variables, create a frequency table. What is the average loan/price ratio? What ratio of the sample are white? What is the bankruptcy rate in the sample?
b. Randomly create three subsamples of training, validation, and test sets (where the training set roughly contains 70 percent of all data points and each of the validation and test sets contain 15 percent of the data points). Describe the procedure you employ to create the subsamples).
When creating the three subsamples, there had to be some initial setup prior to actually inputting the training, validation, and test set subsamples.
In the remaining sections, use the training set for the analysis, unless stated otherwise.
c. To test for discrimination in the mortgage loan market, a logistic regression model can be used:
?????(??,?????) = ?! + ?"?h??? + ??h?????????
In the equation above, if there is discrimination against minorities, and the appropriate factors have been controlled for, what is the sign of ?"?
If there was a descrimination against minorities, it would mean that minorities would be less likely to be approved. If that was the case, we would expect to see b1 have a negative sign.
d. Regress approve on white using logistic regression and report the coefficient table. Interpret the coefficient on white. Is it statistically significant? Is it practically large?
The odds ratio will tell us more information about the relationship between white and approve. Holding everything else constant, being white increases the odds ratio of being approved by XXXXXXXXXXTherefore, we can conclude that the magnitude is practically large. It is also statistically significant, <.001.
e. Find the estimated probability of loan approval for both whites and nonwhites. (Explain the process of your calculation in SPSS)
f. As controls, add the variables hrat, obrat, loanprc, unem, male, married, dep, sch, cosign, chist, pubrec, mortlat1, and mortlat2. What happens to the coefficient of white? Is there still evidence of discrimination against nonwhites? Save the predicted probability values for the whole sample.
g. Interpret the coefficients of white, bankruptcy, and loanprc. (Use odds ratio)
h. Use the estimated logistic regression equation to compute the probability of loan approval for an individual with the following characteristics (Make sure to explain the process in SPSS (or Excel) and formulas you use for the computation).
Hrat = 0.25 Obrat = 0.33 Loanprc = 0.8 Unem = 4 Male = 1 Married = 1 Dep = 2
Sch = 1 Cosign = 1 Chist = 1 pubrec = 0 mortlat1 = 1 mortlat2 = 0 white = 0
i. For the individual with the above characteristics, how the odds of approval changes if loanprc decreases 10 percentage point?
j. Using the validation set, compute the values of sensitivity, specificity, precision, and F1 score corresponding to the confusion matrix created using the cutoff value of 0.6 (for the model in part d). To achieve a specificity of at least 0.50, how much Class 1 error rate must be tolerated?
k. Create the ROC and find the value of AUC using the validation set.
l. Now, use all the variables in the sample as independent variables. Using the forward selection method, what variables will remain in the model? What variable has the highest explanatory power to predict approval? Save the predicted probabilities for the whole sample.
m. Evaluate the candidate logistic regression models (part k and part g) based on their predictive performance on the validation set. Recommend a final model.
n. Compare the false positive and false negative rates on the validation and test sets for the recommended model. Explain the role of the test set and the implication of the results.