OVERVIEW:This is a multiple linear regression analysis assignment. The number of new daily confirmed cases is the y variable,the daily average temperature is x1,dew point temperature is x2,sea-level...

OVERVIEW:This is a multiple linear regression analysis assignment.
The number of new daily confirmed cases is the y variable,the daily average temperature is x1,dew point temperature is x2,sea-level pressure is x3,wind speed is x4,and humidity is x5.
I have pooled together the data for 5 countries into csv files.The screenshot has the 3 research questions that need to be answered for each of the countries (overall 15 models).
Please find the appropriate linear model using R by performing diagnosis+remedy if there are any violations to the linear model.*** Include preliminary linear model as well as modified linear model (if applicable) ***
I included a sample project PowerPoint for reference.


COVID Behavior in a Racially and Politically Diverse State An Insight into the COVID-19 Pandemic for the State of Texas COVID Behavior in a Racially and Politically Diverse State An Insight into the COVID-19 Pandemic for the State of Texas Introduction and Research Questions Research Question Out to seek how each of the 256 counties has faired with COVID to date General Research Question: Are there any explanatory variables relating to politics or race that can help explain the cases/population of each county? 3 general research questions will be statistically researched and analyzed pertaining to the following explanatory variables and their corresponding point estimates 3 β1: Linear impact of the percentage of county population that has limited English speaking abilityβ2: Linear impact of the percentage of county population below the poverty lineβ3: Linear impact of the percentage of the county population that lack a high school diploma β4: Linear impact of the percentage of county population that attended a 2-year college (minimum)β5: Linear impact of the average high school graduation rate of the high schools in the county β6: Linear impact of the political party receiving the majority vote in the county during the 2020 Election Research Questions Do characteristics commonly associated with Texan Hispanics (lack of English-speaking ability and living below the poverty line) have a significant linear effect on the cases/population ratio of each county? Statistical Hypothesis Ho: B1 = 0 or B2 = 0 (There is no linear impact from the county’s lack of English-speaking ability or population that lives in poverty on the average cases/population of the county for the State of Texas) Ha: Neither B1 and B2 equal 0 (At least one of the variables stated above has a linear impact on the average cases/population of the county) 4 Question 1 Does lacking a basic high school education have a larger or smaller linear effect on the case/population ratio of the county than living below the poverty line, or are the relatively the same? Statistical Hypothesis Ho: B2 = B3 (There is no difference in the linear impact between lacking a high school education and living in poverty has on the average cases/population of the county for the State of Texas) Ha: B2 ≠ B3 (There is a difference in the linear impact between lacking a high school education and living in poverty has on the average cases/population of the county) Question 2 Does the voting result for the county influence the case/population of county, regardless of the high school and college education make up of the county population? Statistical Hypothesis Ho: B6 = 0 (Voting for either candidate in the 2020 election has the same effect for the average cases/population of the county, regardless of the high school graduation rate or the college attendance rate of the county population) Ha: B6 = 0 (Voting for a specific candidate in the 2020 election has a different effect on the average cases/population of the county that voting for the other candidate) Question 3 Research Question 1 Do characteristics commonly associated with Texan Hispanics (lack of English-speaking ability and living below the poverty line) have a significant linear effect on the cases/population ratio of each county? Dataset Characteristics 6 Variable NameDefinitionUnitTypeRange ratioThe number of total cases that each county has reported since the March 17th# of cases / population of countyQuantitative (Y)[1, 2157] badEnglishThe percentage of the county population that lack the English speaking abilityDimensionless (percentage)Quantitative (X1) [0, 30.37] belowPovertyThe percentage of the county population that live below the poverty lineDimensionless (percentage) Quantitative (X2) [1.8, 37.9] Sample Size: 254 Y Data was extracted from the Texas DSHS cases documentation X Data was extracted from Kaggle Data Base All data was combined into one csv file for R analyzation Preliminary Analysis Initial Regression Equation: Y = -11.5537 + 6.9368X1 – 0.1259X2 Plot showing variables plotted against each other showed two problems Multicollinearity between explanatory variables (check for Violations) 2 potentially Y influential cases (162 and 155) 7 Residual Plot shows non-constant variance Shapiro plot highlights potential normal distribution violation Model Building and Diagnosis Checking Influential Cases Remedial Measure used: Student Deleted Residuals (SDR) SDR of the Infuential Cases compared with t-critical value of 3.7749 If SDR > t  Reject the null hypothesis that the case is not an outlier with respect to Y Case 162 (McLennan County) SDR Score: 77.404 Fail to Reject  Case 162 is a Y Outlier Case 155 (Maverick County) SDR Score: 4.02 Fail to Reject  Case 155 is a Y outlier Executive Decision was made to remove data points from analysis (reason analyzed in conclusion) Checking Cleaned Data for Violations New Regression Equation: Y = 6.5516 + 0.3423X1 + 0.2105X2 Violations Shapiro Plot still shows normal distribution violation Residual plot improved, but still shown lack of consistency 8 Box Cox Transformation (λ = 0.108) Final Model Report and Application Transformed, Cleaned Regression Equation: Y = 1.2275+ 0.003681X1 + 0.002294 R2 improved from 0.06741 (original) to 0.07403 Root MSE improved from 132.9 (original) to 8.869 Removing outliers significantly improved model To make further conclusion on Y, model must be back-transformed ANOVA Extra Sum of Squares used to evaluate Research Question Determining whether B1 = 0 F value: 6.67649 P-value: 0.009853 Determining whether Conclusion: B2 = 0 F value: 12.803 P-value: 0.004158 Research Question Conclusion: The explanatory variables have a significant (albeit small) positive linear impact on the cases/population for each county in the state of Texas. With an increase in 1 percent of each explanatory variable, the cases/population will rise. Did not understand safety guidelines Government officials did not take pandemic serious enough Higher risk due to other heath conditions 9 Research Question 2 Does lacking a basic high school education have a larger or smaller linear effect on the case/population ratio of the county than living below the poverty line, or are the relatively the same? Dataset Characteristics 11 Variable NameDefinitionUnitTypeRange ratioThe number of total cases that each county has reported since the March 17th# of cases / population of countyQuantitative (Y)[1, 2157] highSchoolThe percentage of the county population that lacks a high school diplomaDimensionless (percentage)Quantitative (X3) [6.1, 55.5] belowPovertyThe percentage of the county population that live below the poverty lineDimensionless (percentage) Quantitative (X2) [1.8, 37.9] Sample Size: 252 Y Data was extracted from the Texas DSHS cases documentation X Data was extracted from Kaggle Data Base All data was combined into one csv file for R analyzation Preliminary Analysis Plot showing variables plotted against each other showed MASSIVE multicollinearity problem Shapiro Plot also shows potential normal distribution violation Hard to make violation conclusions when excessive multicollinearity exists Remedial Action: Focus on reducing multicollinearity, and asses other violations after 12 Residual Plot shows okay residual pattern Shapiro plot highlights potential normal distribution violation Model Building and Diagnosis Finding Best Model (“Best” Algorithm) Measure used to see if one of the explanatory variables is suggested to be dropped by R to reduce multicollinearity Model with both predictors included: Smallest SSE Largest R2 Smallest Cp Smallest PRESS Finding the best model did not solve multicollinearity issue Ridge Regression Used to penalize point estimates by a subjective value (λ) generalized cross validation  λ = 82 (does not make sense) Reducing MSE  λ = 0.1 Reducing VIF to around 1  λ = 0.1 Ridge Regression Equation: Y = 5.4171 + 0.1477X2 + 0.2008X3 R2 improved by much (0.0505  0.0513) Residual plot, Shapiro test still concluded that violations in ridge regression existed Quick Box-Cox transformation yielded same results 13 Final Model Report and Application Due to remedial measures not fixing multicollinearity, the research question cannot be accurately answered K-fold validation not necessarily due to multicollinearity issue not being fixed SLR model of each individual predictor yields similar point estimates with minute R2 values highSchool: y = 6.84 + 0.240X2(R2 = 0.0412) belowPoverty: 6.26 + 0.33 X3 (R2 = 0.044) Conclusions: Neither explanatory variable, even with remedial measures, does a good job in explaining the response SLR results show that variables are too similar No benefit on combining Too insignificant to make research question conclusion Next step: revisit data collection for explanatory variables Identify sampling biases and abnormalities 14 Research Question 3 Does the voting result for the county influence the case/population of county, regardless of the high school and college education make up of the county population? Dataset Characteristics 16 Variable NameDefinitionUnitTypeRange ratioThe number of total cases that each county has reported since the March 17th# of cases / population of countyQuantitative (Y)[1, 2157] collegeThe percentage of the county population has attended a 2 year college (minimum)Dimensionless (percentage)Quantitative (X4) [15.715, 100] highSchoolThe average high school graduation rate of the high schools in the countyDimensionless (rate) Quantitative (X5) [61.8, 100] voteThe political party receiving the majority vote in the county during the 2020 ElectionCategorical 1 = Democratic 0 = RepublicanCategorical (X6)[0,1] Sample Size: 252 Y Data was extracted from the Texas DSHS cases documentation X Data was extracted from Kaggle Data Base All data was combined into one csv file for R analyzation Preliminary Analysis Initial Regression Equation: Y = 26.84 -0.09023X4 -0.11811X5 + 8.07119 (R2 = 0.08) No major multicollinearity issues between college and highSchool Shapiro graph shows potential normal distribution violation Remedial Action: Select best model to check whether highSchool or college can be eliminated from model 17 Residual Plot shows okay variance (no violation) Shapiro plot highlights potential normal distribution violation Model Building and Diagnosis Finding Best Model (Stepwise Algorithm) Measure used to if one of the explanatory variables is suggested to be dropped by R Stepwise suggested to drop highSchool (variable reduced AIC) New Regression Equation: Y = 15.93 - 0.0941X4 + 8.5776X6 (R2 = 0.1575, Root MSE = 8.864) Major improvement in R2, slight increase in Root MSE) K-cross Validation Reduced similarity between training sets, used to validate regression models Root MSE = 8.5569 K-fold validation validates Stepwise algorithm model Established Model Violations 18 Residuals = good Normal Distribution violated, but expected Model Building and Diagnosis Hypothesis testing to check for interaction between college and vote Ho : Binteraction= 0 (Y = B0 + B4X4 + B6X6) Ha: Binteraction ≠ 0 (Y = B0 + B4X4 + Binteraction X4 X6) GLT Test Results P-value > α (0.05)  fail to reject  No interaction 19 Final Model Report and Application GLT test used to answer Research Question (B6 equal or not equal to 0) F statistic: 17.194 P value: < 0.001 reject the hypothesis that b6 is equal to 0 research question conclusion: the cases/population for each county in the state of texas are not the same when analyzing the 2020 election results. counties that voted republican had higher cases/population that counties that voted democratic rhetoric from washington, austin less willing to sacrifice liberties 20 final regression equation: y = 15.93 - 0.0941x4 + 8.5776x6 r2 improved from 0.08005 (original) to 0.1575 root mse slightly worsened from 8.858 (original) to 8.864, k-fold validation confirmed correct model with similar root mse 0.001="" reject="" the="" hypothesis="" that="" b6="" is="" equal="" to="" 0="" research="" question="" conclusion:="" the="" cases/population="" for="" each="" county="" in="" the="" state="" of="" texas="" are="" not="" the="" same="" when="" analyzing="" the="" 2020="" election="" results.="" counties="" that="" voted="" republican="" had="" higher="" cases/population="" that="" counties="" that="" voted="" democratic="" rhetoric="" from="" washington,="" austin="" less="" willing="" to="" sacrifice="" liberties="" 20="" final="" regression="" equation:="" y="15.93" -="" 0.0941x4="" +="" 8.5776x6="" r2="" improved="" from="" 0.08005="" (original)="" to="" 0.1575="" root="" mse="" slightly="" worsened="" from="" 8.858="" (original)="" to="" 8.864,="" k-fold="" validation="" confirmed="" correct="" model="" with="" similar="" root="">
Nov 18, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here