PROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear Regression DUE BEFORE MAR 28, 2021; 10PM 1. Submission Guidelines All assignments must be submitted via the...

1 answer below »
R programming


PROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear Regression DUE BEFORE MAR 28, 2021; 10PM 1. Submission Guidelines All assignments must be submitted via the econestoga course website before the due date in to the assignment folder. You may make multiple submissions, but only the most current submission will be graded. SUBMISSIONS In the Assignment 3 Folder submit: 1. Your R Code 2. Your report in Word, following the template from our MLR lecture and in the Assignment folder. DO NOT PUT THE DOCUMENTS IN TO A ZIP FILE! All variables in your code must abide by the naming convention [variable_name]_[intials]. For example, my variable for State would be State_DM. You may only use base R (i.e. no additional packaged may be used) THIS IS AN INDIVIDUAL ASSIGNMENT. UNAUTHORIZED COLLABORATION IS AN ACADEMIC OFFENSE. Please see the Conestoga College Academic Integrity Policy for details. 2. Grading This assignment is worth 12.5% of your total grade in the course and you can expect it to take five to eight hours. It is out of 25 marks overall. Assignments submitted after 10pm will be reduced 20%. Assignments received after 8:00am the morning after the due date will receive a mark of 0%. Assignments which do not follow the submission instructions may have marks deducted. 3. Data Each student will be using the study dataset: STUDY DATASET: PROG8430_Assign_MLR.Rdata Appendix one contains a data dictionary for the study file. 4. Background A survey of 2120 residents of Canada was conducted to determine the key factors associated with political engagement. A variety of variables were measured and recorded including some tests they were asked to complete. Appendix 1 contains the data dictionary for the data set. One group of respondents (“Treat”) were given additional education on political matters while the other (“Control”) were not. Your task will be to used multiple linear regression to determine the factors which contribution to Political Awareness (variable: Pol). All of the tasks have been completed using the examples presented in class. A careful review of your notes from the lectures should give you everything you need to complete these tasks. 5. Assignment Tasks Nbr Description Marks 1 Data Transformation As demonstrated in class, transform any variables that are required to conduct the regression analysis. 2 2 Reduce Dimensionality 1. Apply the Missing Value Filter to remove appropriate columns of data. 2. Apply the Low Variance Filter to remove appropriate columns of data. 3. Apply the High Correlation Filter to remove appropriate columns of data. 3 3 Outliers 1. Create boxplots of all relevant variables (i.e. non-binary) to determine outliers. 2. Comment on any outliers you see and deal with them appropriately. 2 4 Exploratory Analysis 1. Correlations: Create both numeric and graphical correlations (as demonstrated in class) and comment on noteworthy correlations you observe. Are these surprising? Do they make sense? 2 5 Simple Linear Regression 1. Create a simple linear regression model using Pol as the dependent variable and Score as the independent. Create a scatter plot of the two variables and overlay the regression line. 4 2. Create a simple linear regression model using Pol as the dependent variable and income as the independent. Create a scatter plot of the two variables and overlay the regression line. 3. Compare the models. Which model is superior? Why? 6 Model Development As demonstrated in class, create two models using two automatic variable selection techniques discussed in class (Full, Backward). For each model interpret and comment on the five main measures we discussed in class: 1. F-Stat 2. R-Squared value 3. Residuals 4. Significant variables 5. Variable Co-Efficients 4 7 Model Evaluation – Verifying Assumptions 1. For all three models (as discussed and demonstrated in class) evaluate the main assumptions of regression: Error terms mean of zero, constant variance and normally distributed. 4 Final Recommendation 1. Based on your preceding analysis, recommend which of the three models should be used. NOTE – Even if none of the models meet all the assumptions of regression, choose the best of the three. In subsequent classes we will learn how to deal with these issues. 1 Professionalism, Clarity and Proper Citations 3 APPENDIX ONE: STUDY FILE DATA Variable Description id UserID (unique to each respondent) group Treatment or Control group hs.grad Graduated High School (Y or N) nation Nationality (Region) gender M/F age Age in Years m.status Marital Status political: Political Affiliation n.child Number of Children income Annual Household Income food Pct of Income to Food housing Pct of Income to Housing other Pct of Income to Other Expenses score Score on Political Awareness Test scr Standardized Score Test time1 Pct of Time Taken on Test time2 Time Taken on Section 1 (Standardized) time3 Time Taken on Section 2 (Standardized) Pol Measure of Political Involvement Title Layout PROG8430 – Data Analysis, Modeling and Algorithms LECTURE 8 – REGRESSION ANALYSIS Introduction to Simple Linear Regression (SLR) From inference to prediction. Did you summarize the data? STOP! Not a data analysis NO Did you report the summaries without interpretation? Descriptive Did you quantify whether your discoveries will hold in a new sample? Exploratory NO YES YES NO Are you trying to determine how changing the average of one measurement affects another? YES Are you trying to predict measurements for individuals? Is the effect you are seeking average or deterministic? Inferential Predictive Causal Mechanistic YES NO NO YES Average Deterministic FOCUS FOR THIS LECTURE Prediction vs. Explanation vs. Anomaly Detection • Predictive modeling is the process of applying a statistical model or data mining algorithm to data for the purpose of predicting new or future observations. (E.g. the output value (Y ) for new observations given their input values (X). Prediction • Causal or non-casual explanation and explanatory modeling is the use of statistical models for testing causal explanations. Explanation • Identifies unusual or atypical patterns (outliers). E.g. • Fraud detection in various operating environments • Intrusion detection (unusual patterns in network traffic – potential hack?) • Identifying tumors in health imaging (E.g. MRI scans) Anomaly Detection Adapted from Shmueli, Galit To Explain or Predict?, Statistical Science, 2010, Vol. 25, No. 3 Simple Linear Regression Models the relationship between the magnitude of one variable and another. • Measures the strength of the relationship. • Y and X are interchangable Correlation • Quantifies the relationship • Y is predicted using the value of X Regression Y = ?0 + ?1? + ? Examples in R # Read "comma separated value" files (".csv") # Systolic Blood Pressure Dataset Systolic <- read.csv("c:/users/david/documents/data/systolic.csv",="" header="TRUE," sep="," )="" #="" read="" "comma="" separated="" value"="" files="" (".csv")="" #="" thunder="" basin="" dataset="" thunder=""><- read.csv("c:/users/david/documents/data/thunderbasin.csv",="" header="TRUE," sep="," )="" •="" prog8430-slr_demo.r="" is="" attached="" at="" the="" website="" and="" we="" will="" be="" using="" it="" for="" this="" lecture.="" •="" also,="" download="" thunderbasin1.csv="" and="" systolic1.csv="" files="" systolic="" blood="" pressure="" data="" the="" data="" (x1,="" x2,="" x3)="" are="" for="" each="" patient.="" x1="systolic" blood="" pressure="" x2="age" in="" years="" x3="weight" in="" pounds="" thunder="" basin="" antelope="" study="" the="" data="" (x1,="" x2,="" x3,="" x4)="" are="" for="" each="" year.="" x1="spring" fawn="" count/100="" x2="size" of="" adult="" antelope="" population/100="" x3="annual" precipitation="" (inches)="" x4="winter" severity="" index="" (1="mild," 5="severe)" rename="" variables="" to="" make="" them="" more="" convenient="" #rename="" variables="" to="" something="" meaningful="" names(systolic)=""><- c("bp",="" "age",="" "wgt")="" str(systolic)="" 'data.frame':="" 11="" obs.="" of="" 3="" variables:="" $="" bp="" :="" int="" 132="" 143="" 153="" 162="" 154="" 168="" 137="" 149="" 159="" 128="" ...="" $="" age:="" int="" 52="" 59="" 67="" 73="" 64="" 74="" 54="" 61="" 65="" 46="" ...="" $="" wgt:="" int="" 173="" 184="" 194="" 211="" 196="" 220="" 188="" 188="" 207="" 167="" ...="" names(thunder)=""><- c("fwn",="" "adt",="" "prc",="" "sev")="" str(thunder)="" 'data.frame':="" 8="" obs.="" of="" 4="" variables:="" $="" fwn:="" num="" 2.9="" 2.4="" 2="" 2.3="" 3.2="" ...="" $="" adt:="" num="" 9.2="" 8.7="" 7.2="" 8.5="" 9.6="" ...="" $="" prc:="" num="" 13.2="" 11.5="" 10.8="" 12.3="" 12.6="" ...="" $="" sev:="" int="" 2="" 3="" 4="" 2="" 3="" 5="" 1="" 3="" systolic="" blood="" pressure="" data="" the="" data="" (x1,="" x2,="" x3)="" are="" for="" each="" patient.="" x1="systolic" blood="" pressure="" x2="age" in="" years="" x3="weight" in="" pounds="" thunder="" basin="" antelope="" study="" the="" data="" (x1,="" x2,="" x3,="" x4)="" are="" for="" each="" year.="" x1="spring" fawn="" count/100="" x2="size" of="" adult="" antelope="" population/100="" x3="annual" precipitation="" (inches)="" x4="winter" severity="" index="" (1="mild," 5="severe)" note="" –="" these="" are="" very="" small="" datasets="" used="" simply="" for="" demonstration="" purposes!="" two="" research="" questions="" for="" each="" dataset="" 1.1="" is="" there="" a="" relationship="" between="" age="" and="" blood="" pressure?="" 1.2="" can="" we="" quantify="" it?="" 2.1="" is="" there="" a="" relationship="" between="" spring="" fawn="" count="" and="" adult="" population?="" 2.2="" can="" we="" quantify="" it?="" examine="" summary="" statistics="" tdrsum=""><-stat.desc(thunder)> format(TdrSum,digits=2) • As always, let’s look at statistical measures as well as graphical representations. Fwn Adt Prc Sev nbr.val 8.00 8.00 8.00 8.00 nbr.null 0.00 0.00 0.00 0.00 nbr.na 0.00 0.00 0.00 0.00 min 1.90 6.80 10.60 1.00 max 3.40 9.70 14.10 5.00 range 1.50 2.90 3.50 4.00 sum 20.20 67.60 96.30 23.00 median 2.35 8.60 11.90 3.00 mean 2.53 8.45 12.04 2.88 SE.mean 0.20 0.38 0.43 0.44 CI.mean. 0.95 0.48 0.90 1.03 1.04 var 0.33 1.16 1.51 1.55 std.dev 0.57 1.08 1.23 1.25 coef.var 0.23 0.13 0.10 0.43 > SysSum <-stat.desc(systolic)> format(SysSum,digits=2) BP Age Wgt nbr.val 11.00 11.00 11.00 nbr
Answered Same DayMar 27, 2021

Answer To: PROG8430 – Data Analysis, Modeling and Algorithms Assignment 3 Multivariate Linear Regression DUE...

Suraj answered on Mar 28 2021
133 Votes
Assignment
Topic: Multiple regression analysis using R
Submitted To:
Submitted BY:
Date:
1.
Data tran
sformation: The first step is to load the data file in R studio and then store data set in a variable. Then after we have seen that there are many categorical variables are there in the data set. We transform all the categorical variables into the numeric value.
2.
Reduce dimensionality: We have check for the na values. Only one variable has na values. Thus, two variables are removed from the data set. The second variable is simple the serial number which is unimportant for the regression analysis.
There is no need to check for the low variance.
Then after we do the correlation analysis and see than 1-2 independent variables have higher correlation between them and many other variables has very less correlation between the dependent variable. So, we removed all those variables.
The variables those are removed are id, time1, time 2, time3, m.status, others, housing, scr.
3.
Outliers detection: Then after we do the outliers detection. Where only two...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here