R programmingPROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear...

Question

R programmingPROG8430 – Data Analysis, Modeling and Algorithms  Assignment 3  Multivariate Linear Regression  DUE BEFORE MAR 28, 2021; 10PM  1. Submission Guidelines  All assignments must be submitted via the econestoga course website before the due date in to  the assignment folder.  You may make multiple submissions, but only the most current submission will be graded. SUBMISSIONS In the Assignment 3 Folder submit:  1. Your R Code  2. Your report in Word, following the template from our MLR lecture and in the Assignment  folder. DO NOT PUT THE DOCUMENTS IN TO A ZIP FILE! All variables in your code must abide by the naming convention [variable_name]_[intials]. For  example, my variable for State would be State_DM. You may only use base R (i.e. no additional packaged may be used) THIS IS AN INDIVIDUAL ASSIGNMENT. UNAUTHORIZED COLLABORATION IS AN ACADEMIC  OFFENSE. Please see the Conestoga College Academic Integrity Policy for details.  2. Grading  This assignment is worth 12.5% of your total grade in the course and you can expect it to take five  to eight hours. It is out of 25 marks overall.  Assignments submitted after 10pm will be reduced 20%. Assignments received after 8:00am  the morning after the due date will receive a mark of 0%.  Assignments which do not follow the submission instructions may have marks deducted. 3. Data  Each student will be using the study dataset:  STUDY DATASET:   PROG8430_Assign_MLR.Rdata  Appendix one contains a data dictionary for the study file.  4. Background  A survey of 2120 residents of Canada was conducted to determine the key factors associated with  political engagement. A variety of variables were measured and recorded including some tests  they were asked to complete. Appendix 1 contains the data dictionary for the data set. One group  of respondents (“Treat”) were given additional education on political matters while the other  (“Control”) were not.  Your task will be to used multiple linear regression to determine the factors which contribution  to Political Awareness (variable: Pol).  All of the tasks have been completed using the examples presented in class. A careful review of  your notes from the lectures should give you everything you need to complete these tasks.   5. Assignment Tasks  Nbr Description Marks  1 Data Transformation  As demonstrated in class, transform any variables that are required to  conduct the regression analysis.  2  2 Reduce Dimensionality  1. Apply the Missing Value Filter to remove appropriate columns of data.  2. Apply the Low Variance Filter to remove appropriate columns of data.  3. Apply the High Correlation Filter to remove appropriate columns of  data.  3  3 Outliers  1. Create boxplots of all relevant variables (i.e. non-binary) to determine  outliers.  2. Comment on any outliers you see and deal with them appropriately.    2    4 Exploratory Analysis  1. Correlations: Create both numeric and graphical correlations (as  demonstrated in class) and comment on noteworthy correlations you  observe. Are these surprising? Do they make sense?   2        5 Simple Linear Regression  1. Create a simple linear regression model using Pol as the dependent  variable and Score as the independent. Create a scatter plot of the  two variables and overlay the regression line.    4  2. Create a simple linear regression model using Pol as the dependent  variable and income as the independent. Create a scatter plot of the  two variables and overlay the regression line.  3. Compare the models. Which model is superior? Why?     6 Model Development  As demonstrated in class, create two models using two automatic  variable selection techniques discussed in class (Full, Backward). For  each model interpret and comment on the five main measures we  discussed in class:  1. F-Stat  2. R-Squared value  3. Residuals  4. Significant variables  5. Variable Co-Efficients   4  7 Model Evaluation – Verifying Assumptions  1. For all three models (as discussed and demonstrated in class) evaluate  the main assumptions of regression: Error terms mean of zero,  constant variance and normally distributed.     4   Final Recommendation  1. Based on your preceding analysis, recommend which of the three  models should be used.  NOTE – Even if none of the models meet all the assumptions of  regression, choose the best of the three. In subsequent classes we will  learn how to deal with these issues.    1   Professionalism, Clarity and Proper Citations 3 APPENDIX ONE: STUDY FILE DATA   Variable Description  id UserID (unique to each respondent)  group Treatment or Control group  hs.grad Graduated High School (Y or N)  nation Nationality (Region)  gender M/F  age Age in Years  m.status Marital Status  political: Political Affiliation  n.child Number of Children  income Annual Household Income  food Pct of Income to Food  housing Pct of Income to Housing  other Pct of Income to Other Expenses  score Score on Political Awareness Test  scr Standardized Score Test  time1 Pct of Time Taken on Test  time2 Time Taken on Section 1 (Standardized)  time3 Time Taken on Section 2 (Standardized)  Pol   Measure of Political Involvement    Title Layout PROG8430 – Data  Analysis, Modeling and  Algorithms LECTURE 8 – REGRESSION ANALYSIS Introduction to Simple  Linear Regression (SLR) From inference to prediction. Did you summarize the  data? STOP! Not  a data  analysis NO Did you report the  summaries without  interpretation? Descriptive Did you quantify whether  your discoveries will hold in  a new sample? Exploratory NO YES YES NO Are you trying to determine  how changing the average  of one measurement  affects another? YES Are you trying to predict  measurements for  individuals? Is the effect you are  seeking average or  deterministic? Inferential Predictive Causal Mechanistic YES NO NO YES Average Deterministic FOCUS FOR THIS  LECTURE Prediction vs. Explanation vs. Anomaly  Detection • Predictive modeling is the process of applying a statistical model or data mining algorithm to  data for the purpose of predicting new or future observations.  (E.g. the output value (Y ) for  new observations given their input values (X). Prediction • Causal or non-casual explanation and explanatory modeling is the use of statistical models for  testing causal explanations. Explanation • Identifies unusual or atypical patterns (outliers). E.g. • Fraud detection in various operating environments • Intrusion detection (unusual patterns in network traffic – potential hack?) • Identifying tumors in health imaging (E.g. MRI scans) Anomaly Detection Adapted from Shmueli, Galit To Explain or Predict?, Statistical Science, 2010, Vol. 25, No. 3 Simple Linear Regression Models the relationship between the magnitude of one variable and another. • Measures the strength of the relationship.  • Y and X are interchangable Correlation • Quantifies the relationship • Y is predicted using the value of X Regression Y = ?0 + ?1? + ? Examples in R # Read "comma separated value" files (".csv") # Systolic Blood Pressure Dataset Systolic  format(TdrSum,digits=2) • As always, let’s look at statistical measures as well as graphical representations. Fwn Adt Prc Sev nbr.val 8.00 8.00 8.00 8.00 nbr.null 0.00 0.00 0.00 0.00 nbr.na 0.00 0.00 0.00 0.00 min 1.90 6.80 10.60 1.00 max 3.40 9.70 14.10 5.00 range 1.50 2.90 3.50 4.00 sum 20.20 67.60 96.30 23.00 median 2.35 8.60 11.90 3.00 mean 2.53 8.45 12.04 2.88 SE.mean 0.20 0.38 0.43 0.44 CI.mean. 0.95 0.48 0.90 1.03 1.04 var 0.33 1.16 1.51 1.55 std.dev 0.57 1.08 1.23 1.25 coef.var 0.23 0.13 0.10 0.43 > SysSum  format(SysSum,digits=2) BP Age Wgt nbr.val 11.00 11.00 11.00 nbr

Suraj · Accepted Answer

Assignment
Topic: Multiple regression analysis using R
Submitted To:
Submitted BY:
Date:
1.
Data transformation: The first step is to load the data file in R studio and then store data set in a variable. Then after we have seen that there are many categorical variables are there in the data set. We transform all the categorical variables into the numeric value.
2.
Reduce dimensionality: We have check for the na values. Only one variable has na values. Thus, two variables are removed from the data set. The second variable is simple the serial number which is unimportant for the regression analysis.
There is no need to check for the low variance.
Then after we do the correlation analysis and see than 1-2 independent variables have higher correlation between them and many other variables has very less correlation between the dependent variable. So, we removed all those variables.
The variables those are removed are id, time1, time 2, time3, m.status, others, housing, scr.
3.
Outliers detection: Then after we do the outliers detection.

PROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear Regression DUE BEFORE MAR 28, 2021; 10PM 1. Submission Guidelines All assignments must be submitted via the...

Answer To: PROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear Regression DUE...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment