InfluentialVarbsObs.html Influential variables and observations Anthon Eff 1 Try downloading this webpage The functionality of this webpage is constrained in D2L, and you might find it easier to read...

Can you help create a restricted and unrestricted model using dependent and independent variables using Rstudios software?
InfluentialVarbsObs.html
Influential variables and observations
Anthon Eff
1 Try downloading this webpage
The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
2 Resources to learn R
R tutorial on YouTube (21 videos, total time: 1 hour, 7 minutes)
Cheat sheets
R web search Use this to hunt for documentation for specific package or function
3 Miscellaneous international data
Please download worldData.xlsx to your working directory. The xlsx file contains two sheets: one containing variable descriptions, labeled variables, and the other containing data, labeled data. Each observation is a country, and the variables are drawn from a variety of sources.
xr<-data.frame(readxl::read_xlsx(path="worldData.xlsx",sheet="data"))
rownames(xr)<-xr$ISO3 # using as rowname the ISO 3-character code for each country
xr$prcpXtemp<-scale(xr$prcp*xr$temp) # values will be high when the climate is warm and wet; low when dry and cold
4 A model of healthy life expectancy at birth
We can extract some variables to make a model explaining the variation across countries in Healthy Life Expectancy at birth. This is the number of healthy years the average new born is expected to live (the figure is from around 2012).
4.1 Unrestricted model
mhh<-c("pctUrb","simlang","pctYlow40","NetPrimEd","sanTot","watTot","CALORIE97","prcpXtemp","GCGDP","X2009.Overall","Christian")
dpv<-"HLEbirth"
xur<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(xur,xr))
##
## Call:
## lm(formula = xur, data = xr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6870 -3.0476 0.1886 3.0076 9.8266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.707440 5.548952 -0.848 0.39850
## pctUrb 0.132423 0.038673 3.424 0.00093 ***
## simlang 7.104192 2.851683 2.491 0.01456 *
## pctYlow40 0.478732 0.147753 3.240 0.00168 **
## NetPrimEd 0.067901 0.058044 1.170 0.24516
## sanTot 0.114436 0.037552 3.047 0.00303 **
## watTot 0.135329 0.059723 2.266 0.02585 *
## CALORIE97 0.003235 0.001699 1.904 0.06006 .
## prcpXtemp 1.704599 0.742838 2.295 0.02408 *
## GCGDP -0.317173 0.126362 -2.510 0.01386 *
## X2009.Overall 0.213336 0.071512 2.983 0.00367 **
## Christian -1.977590 1.563444 -1.265 0.20918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.878 on 90 degrees of freedom
## (89 observations deleted due to missingness)
## Multiple R-squared: 0.8523, Adjusted R-squared: 0.8343
## F-statistic: 47.22 on 11 and 90 DF, p-value: < 2.2e-16
The above is our unrestricted model, so we will create our descriptive statistics table now.
u<-data.frame(psych::describe(xr[,c(dpv,mhh)])) # we will output our descriptive statistics now
u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file called "descrip.csv"
Descriptive statisticsnmeansdminmax
HLEbirth18957.519711.138628.560774.9935
pctUrb19055.215823.678411.0000100.0000
simlang1910.78740.21600.25511.0000
pctYlow4012516.84804.31256.000025.0000
NetPrimEd18385.016415.800122.0000100.0000
sanTot15567.193530.17835.0000100.0000
watTot15983.459118.303322.0000100.0000
CALORIE971392,686.6475513.36081,685.00003,699.0000
prcpXtemp1790.00001.0000-1.84542.8088
GCGDP14015.18935.62704.600032.0000
X2009.Overall17259.558110.586422.700087.1000
Christian1910.57530.37380.00000.9945
4.2 F-test to identify irrelevant variables
Next, we will identify the coefficients with a p-value above 0.10, and use an F-test to confirm that those variables may be dropped from the model.
j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
drpt<-intersect(rownames(j)[which(j[,4]>.1)],mhh) # get the names of the variables with p-values above 0.10
linearHypothesis(q,drpt) # perform the F-test
## Linear hypothesis test
##
## Hypothesis:
## NetPrimEd = 0
## Christian = 0
##
## Model 1: restricted model
## Model 2: HLEbirth ~ pctUrb + simlang + pctYlow40 + NetPrimEd + sanTot +
## watTot + CALORIE97 + prcpXtemp + GCGDP + X2009.Overall +
## Christian
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 2208.3
## 2 90 2141.3 2 67.078 1.4097 0.2496
The p-value of the F-test is above 0.05; therefore, we cannot reject the null hypothesis that the true values of the coefficients are zero. So we drop the variables, creating our restricted model as the final model.
4.3 Restricted model
mhh<-setdiff(mhh,drpt)
fr<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(fr,xr))
##
## Call:
## lm(formula = fr, data = xr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.023 -2.780 0.112 2.842 11.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.752275 5.145257 -0.535 0.593985
## pctUrb 0.121508 0.037929 3.204 0.001860 **
## simlang 7.690603 2.691687 2.857 0.005274 **
## pctYlow40 0.539948 0.137020 3.941 0.000157 ***
## sanTot 0.128905 0.033981 3.793 0.000264 ***
## watTot 0.157277 0.056965 2.761 0.006944 **
## CALORIE97 0.003503 0.001682 2.083 0.040036 *
## prcpXtemp 1.753336 0.717317 2.444 0.016399 *
## GCGDP -0.349862 0.120966 -2.892 0.004763 **
## X2009.Overall 0.193388 0.069378 2.787 0.006441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.875 on 93 degrees of freedom
## (88 observations deleted due to missingness)
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8344
## F-statistic: 58.1 on 9 and 93 DF, p-value: < 2.2e-16
Now that we have the final model, we can output our regression results table.
j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
j<-data.frame(j,vif=c(NA,vif(q))) # adding the VIF as the last column of the table
write.csv(j,file="regres.csv") # this writes the object j to a csv-format file called "regres.csv"
Regression results for final modelEstimateStd..Errort.valuePr…t..vif
(Intercept)-2.75235.1453-0.53490.5940NA
pctUrb0.12150.03793.20360.00192.8667
simlang7.69062.69172.85720.00531.4669
pctYlow400.53990.13703.94070.00021.5231
sanTot0.12890.03403.79340.00034.3245
watTot0.15730.05702.76090.00693.8197
CALORIE970.00350.00172.08260.04002.8525
prcpXtemp1.75330.71732.44430.01641.8877
GCGDP-0.34990.1210-2.89220.00481.6359
X2009.Overall0.19340.06942.78750.00641.7612
5 Influential variables
You have used a t-statistic to judge whether an independent variable is significantly related to the dependent variable. Here we look at a related question: of all the significant independent variables, which exerts the greatest influence on the dependent variable? There are three main ways to answer this question.
The first is the use of standardized coefficients (often called beta coefficients). The standardized coefficients show by how many standard deviations the dependent variable will change, for a one standard deviation change in the independent variable. This is equivalent to the estimated coefficient times the standard deviation of the independent variable divided by the standard deviation of the dependent variable.
The second is the calculation of elasticities. An elasticity is the percentage change in a dependent variable caused by a one percent change in the independent variable (e.g., the percentage change in quantity demanded caused by a one percent change in price). In a model in which all variables are converted to their natural logs (such as the Cobb-Douglas production function), the coefficient estimates can be directly interpreted as elasticities. In a linear regression, one multiplies the estimated coefficient by the mean of the independent variable divided by the mean of the dependent variable.
The third way to determine the relative influences of the independent variables is to decompose the model \(R^2\) into the portion attributable to each independent variable. The \(R^2\) is the percent of the variation in the dependent variable that can be explained by the model. If the independent variables are perfectly orthogonal (that is, not correlated with each other), then each independent variable will explain a unique portion of the variation in the dependent variable. But independent variables are almost never orthogonal—they will share some variation, so that two or more independent variables will account for a portion of the variation in the dependent variables. Decomposition of \(R^2\) into the portion accounted for by each independent variable is thus not straightforward. R contains several algorithms to perform this.
5.1 Standardized coefficients
We rescale our coefficients so they can be interpreted as the change in the dependent variable (measured in units of the standard deviation of the dependent variable) for a one standard deviation increase in the
Nov 08, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here