Answer To: hello can you please tell to the person that is going to work on this projectif he can do this in...
Franciosalgeo answered on Nov 25 2021
Assignment
2
Contents
Introduction 1
2. Data 2
2.1. Source of data 2
2.2 Descriptive statistics. 3
3. Data Analysis and Fitting Linear Regression Model. 6
3.1 Full Model 6
4. Cross validation 13
5. Multicollinearity 14
Summary and concluding Remarks 14
Reference 15
Appendix 1 16
Introduction
One of the most important indications of how the government is responding to the country's healthcare development is life expectancy. The health-care system's contribution is measured in medical costs per capita, while the system's output is measured in life expectancy at birth. The analysis that follows these trials may raise concerns about healthcare administration. Several global leaders can benefit from this knowledge because they are always looking for innovative methods to better their people's lives. A lengthy average lifespan can be a strong indicator that propaganda efforts are being implemented properly; on the other hand, a short life expectancy can signal a long-term problem that could become dangerous to the broader public. The average amount of time in years that an individual is predicted to live based on a variety of circumstances is known as life expectancy.
2. Data
2.1. Source of data
In this section, I’ll explain how I chose, cleaned, and prepared my data for analysis. I’ll also go over the ways for substituting data values that are missing. For consistency, the majority of the data in this research came from the World Data Bank and was collected in 2018. The variables used in the analysis were selected based on our understanding of core health concerns such as food and nutrition security, human resources (air and water quality), and other global difficulties. Because I didn’t have any data from 2018, I removed the air pollution variable. There were missing values in most of the variables with maximum number of missing values in food import (34 missing values), which appeared to be missing at random. All of the missing values were replaced using linear interpolation technique, that is imputing missing values with averages of nearest values. Variables included in this research study are life expectancy at birth (Y) , Health expenditure (% of SDP) (X1), Average amount of SDP (X2), Food import (X3), Basic drinking water (% of population) (X4), infant mortality rate (X5), Alcohol consumption (Liters of alcohol) (X6), Urban population (% of total population) (X7), Undernourishment (% of population) (X8), Population (total) (X9), Employment to population (X10). From i = 1 to 10, I’ll call the variable by their allocated Xi term throughout the rest of the study.
The full data may be found in Appendix 1 at the end of the study.
Before removing any insignificant regressors, I’ll start by fitting the entire model(Hoffmann). To discover which regressors are insignificant, we’ll employ hypothesis testing and the p value. The best fitting model will be found out using BIC and AIC criteria.
2.2 Descriptive statistics.
Table 1. Summary statistics of the variables
Characteristic
N = 227
Life expectancy at birth
Range
53, 85
Median (IQR)
74 (67, 77)
Mean (SD)
72 (7)
Health expenditure
Range
2.14, 16.89
Median (IQR)
6.06 (4.50, 7.96)
Mean (SD)
6.51 (2.66)
Average amount of SDP
Range
196,737,896, 86,100,000,000,000
Median (IQR)
79,788,768,969 (15,005,866,762, 937,000,000,000)
Mean (SD)
3,134,987,396,047 (9,941,006,969,831)
Food Import
Range
4, 48
Median (IQR)
11 (8, 16)
Mean (SD)
13 (7)
People using at least basic drinking water services (% of population)
Range
39, 100
Median (IQR)
94 (80, 99)
Mean (SD)
87 (15)
Infant Mortality rate
Range
2, 83
Median (IQR)
15 (6, 35)
Mean (SD)
22 (19)
Total Alcohol
Consumption (liters of pure alcohol)
Range
0.0, 20.5
Median (IQR)
5.8 (2.7, 9.1)
Mean (SD)
6.0 (3.9)
Urban population (% of total population)
Range
13, 100
Median (IQR)
59 (41, 77)
Mean (SD)
58 (22)
Undernourishment (% of population)
Range
2, 57
Median (IQR)
7 (3, 13)
Mean (SD)
10 (10)
Population (total)
Range
17,911, 7,592,475,615
Median (IQR)
15,477,727 (4,157,091, 96,984,780)
Mean (SD)
357,649,591 (1,036,083,609)
Employment to population
Range
32, 87
Median (IQR)
58 (52, 64)
Mean (SD)
58 (11)
Each variable in our data set is listed in Table 1 along with a summary of its individual statistics, such as sample size, average, maximum, and lowest values, and standard deviation.
3. Data Analysis and Fitting Linear Regression Model.
3.1 Full Model
Table2
Full Model: Linear Regression
Dependent varaiable: Y
Characteristic
Beta1
SE2
95% CI2
p-value
(Intercept)
70***
2.95
64, 76
<0.001
X1
0.28***
0.078
0.13, 0.43
<0.001
X2
0.00
0.000
0.00, 0.00
0.8
X3
-0.05
0.028
-0.11, 0.00
0.055
X4
0.05*
0.023
0.01, 0.10
0.019
X5
-0.28***
0.017
-0.32, -0.25
<0.001
X6
-0.08
0.051
-0.19, 0.02
0.10
X7
0.03**
0.011
0.01, 0.05
0.007
X8
0.01
0.024
-0.04, 0.05
0.8
X9
0.00
0.000
0.00, 0.00
0.3
X10
0.02
0.018
-0.01, 0.06
0.2
1*p<0.05; **p<0.01; ***p<0.001
2SE = Standard Error, CI = Confidence Interval
To begin, we utilize hypothesis testing to identify inconsequential factors and to see if each beta is equal to 0. The results of hypothesis testing are given in Table2. If the P value in table 2 is greater than 0.05, the corresponding variable is insignificant. From table 2 we can see that X2, X3, X6, X8, X9, X10 are insignificant.
We will select the best fitting model using AIC and BIC criteria. The AIC and BIC of the full model is given below.
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## 1 0.8826 0.8771 2.5477 162.3498 0 10 -528.7522 1081.504
## BIC deviance df.residual nobs
## 1 1122.604 1402.038 216 227
Thus, the full model has AIC value of 1081.504 and BIC value of 1122.604. The adjusted R2 value of the full model is 0.8771. With ten predictor variables, a sensible approach would be to assess all of the different models that can be built with them and then choose the best one based on BIC/AIC. This procedure is called the best subset selection. I have done this using the MASS::stepAIC function in r. The stepAIC function calculates the BIC/AIC in a somewhat different way than the BIC/AIC functions. This, however, has no bearing on model choices.
Table3
Final Model: Linear Regression
Dependent varaiable: Y
Characteristic
Beta1
SE2
95% CI2
p-value
(Intercept)
73***
2.21
68, 77
<0.001
X1
0.26***
0.071
0.12, 0.40
<0.001
X3
-0.07**
0.026
-0.12, -0.02
0.006
X4
0.05*
0.021
0.01, 0.09
0.025
X5
-0.28***
0.017
-0.32, -0.25
<0.001
X6
-0.08
0.051
-0.18, 0.02
0.11
X7
0.03**
0.011
0.01, 0.05
0.009
1*p<0.05; **p<0.01; ***p<0.001
2SE = Standard Error, CI = Confidence Interval
The AIC and BIC of the final model in original scale are
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## 1 0.8804 0.8772 2.5473 270.0158 0 6 -530.798 1077.596
## BIC deviance df.residual nobs
## 1 1104.996 1427.539 220 227
Thus, the AIC value of final model is 1077.596, which is better (smaller) than full model. Also, the BIC of the final model is 1104.996, which is also better than that of full model. However, the adjusted R2 value has not improved much. The plots of the final model are given below
4. Cross validation
Cross validation is a technique for determining how good the final fitted model will be at predicting future observations. Using the cv.lm function from the DAAG package in r, we do a 10-fold cross validation. The plot of the original observed y values vs fitted y values for all observations is shown below. When we read these data points from left to right, we notice a strong uphill linear pattern, which may be interpreted as a positive link between observed and expected values.
5. Multicollinearity
One of the major assumptions in linear regression model is independence of observations or no multicollinearity. Multicollinearity will be measured using variance inflation factor (VIF) . A VIF value less than 5 implies no multicollinearity. The VIF value of variables in the final model are given below.
## X1 X3 X4 X5 X6 X7
## 1.24 1.15 3.61 3.65 1.34 1.79
VIF value is found maximum for X4, which is 3.61. All the variables in the final model have VIF value less than 5. Hence, we can conclude the multicollinearity is not observed in the model and model asssumptions are satisfied.
Summary and concluding Remarks
The final model selected using the AIC and BIC criteria is
Y= 73+ 0.26X1 - 0.07X3 + 0.05X4 - 0.28X5 - 0.08X6 + 0.03X7. The variables in the final model are listed below:
1. Y= Life expectancy at birth
1. X1= Health expenditure
1. X3= Food Import
1. X4= People using at least basic drinking water services (% of population)
1. X5= Infant Mortality rate
1. X6= Total Alcohol Consumption (liters of pure alcohol)
1. X7= Urban population (% of total population)
Here we can see that, Health expenditure, basic drinking water services, and Urban population percentage had positive effect on life expectancy. Other variables such as infant mortality rate has negative effect on life expectancy. The R squared value of the final model is 0.88, which means 88% of the variance in life expectancy is explained in final model.
Reference
Hoffmann, John P. Linear Regression Models: Application in R. CRC press, Taylor and...