PLease follow all the instructions i need it on R studio and pdf file knitted by R studio please...

Question

PLease follow all the instructions i need it on R studio and pdf file knitted by R studio please submit it on time

Modern Computational Statistical Methods Assignment2: Due Week 9, 2022 STAT 8178/7178 Instructions: This assignment covers from weeks 1 to 8 1. Due on 5th May 2023 2. For all the questions please provide the relevant mathematical derivations, the com- puter programs (only using R software) and the plots. 3. Please submit on iLearn a single PDF file containing all your work (code, compu- tations, plots, etc.). Other file formats (e.g. Word, html) will NOT be accepted. 4. Try to use Rmarkdown through Rstudio. But it is not compulsory to use Rmark- down even if facilitate to reproduce results. Only upload the pdf file. 1 of 4 Modern Computational Statistical Methods Assignment2: Due Week 9, 2022 Question 1: 10 marks Consider m samples {(x1, y2), . . . , (xm, ym)} with xi ∈ Rd and yi ∈ R. (a) [2 marks] Show that the following function g(w) is convex: g(w) = 1 2 m∑ i=1 (yi − wTxi)2, where w ∈ Rd. (b) [2 marks] Show that the following function h(w) is convex: h(w) = 1 2 γ||w||22, where γ > 0 (c) [2 marks] Using results from (a) and (b), show that f(w) = g(w) + h(w) is convex. (d) [4 marks] Solve the optimization problem min w∈Rd f(w) by expressing the minimizer w in terms of the data matrix of x and the vector y. Question 2: 21 marks In this question, we consider breast cancer prediction where the label, or outcome variable diagnosis has been coded as “M” in case of malignant lumps and “B” in case of benign lumps. A popular dataset in this context is called the Wisconsin Breast Cancer Dataset and is based on clinical data released in the early 1990’s. The feature vector x is composed of continuous variables such as radius mean, texture mean, etc., each potentially affecting the probability of malignancy. We use the 80-20 splitting strategy where we split it randomly between training data and testing data. We want to build a prediction model to predict malignant lumps based on main important features. 1. 1 mark Write down the logistic model for this task. 2. 1 mark Load the following file Breast.Rdata using load("Breast.Rdata") The above code provides two data frames: train and test. For both data sets, the first column (named “diagnosis”) is the categorical outcome. How many attributes are available to predict the outcome? How many samples are included in the two data frames. What is the distribution of the outcome variable in the two data sets ? 3. 1 mark Run a logistic model (previously defined) using all attributes (Hint: use the glm function and specify the family argument). You will estimate your model using the train dataset. 4. 2 marks Provide the confusion matrix for the train set of your classifier using a threshold of 0.5 and provide the accuracy of the model for the train set. 2 of 4 Modern Computational Statistical Methods Assignment2: Due Week 9, 2022 5. 2 marks Provide the confusion matrix for the test set of your classifier using a threshold of 0.5 and provide the accuracy of the model for the test set. 6. 2 marks Why the accuracy for the test set is lower than the one for the training set ? Which accuracy to report for assessing the performance of your classifier? 7. 2 marks We want to get a parsimonious model, meaning that we want to keep the most relevant features. One way to tackle this challenge is to run a penalized regression model. One scientist is struggling to choose between a ridge and a lasso regression model. Give some justification to choose between the two strategies. 8. 2 marks We want to run a penalized logistic regression using a lasso penalty. To do it you will use the glmnet R package using cv.glmnet and glmnet functions (Hint: do not forget to use “family=binomial”). Choose the best tuning parameter lambda using a K-fold cross- validation strategy (Hint: use the argument type.measure=”class” for choosing lambda to get the smallest miss-classification error). Plot the cross-validation error according to the log of lambda. 9. 2 marks Run the penalized logistic regression for the lambda you have chosen at the previous step. How many attributes are still in the model? 10. 2 marks For this model and a threshold at 0.5, define your classifier (Hint: use the function with argument type=”response” ). Report the confusion matrix on the test set and the accuracy of the model. 11. 2 marks Present on the same plot the ROC curves for the logistic model and the penalized logistic model. 12. 2 marks Report the two AUC (Area Under the ROC curve). What is your preference between the two models ? Question 3: 7 marks Let X1, ..., Xn be a random sample from a population with the following Bernoulli distribution P (X = x) = θx(1− θ)1−x, x = 0 or 1, 0 ≤ θ ≤ 0.5. We know that the maximum likelihood estimator (MLE) for θ is, θ̂ = min{X̄, 0.5}, and the method of moments estimator (MOM) for θ is given by, θ̃ = X̄. 1. 1 mark Generate a random sample with 500 observations from the above distribution function when θ = 0.45. 2. 2 marks Compute bootstrap estimates of θ using the MLE and MOM estimates. Let’s assume 1000 replications. 3. 2 marks Compare the performance of the θ̂ and θ̃ using your bootstrap samples by com- puting the bias and standard error. 4. 2 marks Find Bootstrap percentile intervals of the θ̂ and θ̃ using your bootstrap samples. Compare the results. Use the significance level α = 0.05. 3 of 4 Modern Computational Statistical Methods Assignment2: Due Week 9, 2022 Question 4: 7 marks The remiss.csv data set contains the remission times for 42 leukemia patients in weeks. Some of the patients were treated with the drug called 6-mercaptopurine (group = 0), and the rest were part of the control group (group = 1). 1. 1 mark Create a box plot for the remission times for two treatment and control groups. Compare the remission times of the two groups. 2. 1 mark Use a normal probability plot to check whether the distribution of the remission times for each group is Normal. 3. 1 mark Write the null hypothesis and alternative hypothesis to test for the equality of means between the two groups. 4. 4 mark Perform the above hypothesis test using Monte Carlo simulation to get the critical values. Estimate the p-value. Do you reject H0 or H1? You can assume the distribution of the remission times for each group is Normal and their population variances are equal. Use the significance level α = 0.05. 4 of 4

breast-p3qlxyeg.gz remiss-gtfybu5y.csv assignment2stat817871782023-othmi0be.pdf

Mohd · Accepted Answer

-
- 
- 
2023-04-26 
library(magrittr) 
library(dplyr) 
load("breast.Rdata") 
1. 1 mark Write down the logistic model for this task. 
response variable = Diagnosis 
family =Binomial 
mod|z|) 
## (Intercept)             -1.947e+03  1.834e+06  -0.001    0.999 
## radius_mean             -6.292e+01  5.539e+05   0.000    1.000 
## texture_mean            -4.053e+00  7.211e+03  -0.001    1.000 
## perimeter_mean           1.410e+01  8.570e+04   0.000    1.000 
## area_mean               -4.277e-01  1.794e+03   0.000    1.000 
## smoothness_mean          2.603e+03  2.667e+06   0.001    0.999 
## compactness_mean        -1.881e+03  3.620e+06  -0.001    1.000 
## concavity_mean           2.672e+03  1.039e+06   0.003    0.998 
## concave.points_mean     -1.770e+03  2.392e+06  -0.001    0.999 
## symmetry_mean           -8.507e+01  1.

PLease follow all the instructions i need it on R studio and pdf file knitted by R studio please submit it on time

Answer To: PLease follow all the instructions i need it on R studio and pdf file knitted by R studio please...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment