Develop the best multiple regression model you can in order to predict the Life Satisfaction using the other fields in the data set as predictors. To find the best model, please follow the “stepwise...

1 answer below »

Develop the best multiple regression model you can in order to predict the Life Satisfaction


using the other fields in the data set as predictors. To find the best model, please follow the


“stepwise backward regression” (or the principal of parsimony), as detailed in the below


steps:


1. Clean the dataset by removing any row that is missing any datapoint.


2. Run the correlation analysis between all variables. Make sure the independent


variables are “truly” independent of each other. That is, each two independent variable


must have a low correlation. If any two independent variables have high correlations


(e.g., 0.7 or above) remove one of them. High correlation between two independent


variables is known as “collinearity” and causes a lot of problems for the regression


model (please check the recorded supplementary video for details on collinearity)


3. Include all the remaining variables in your regression model. If you do the analysis in


Excel, Excel has a limitation of 16 independent variables. Try to reduce the number of


independent variables by eliminating those that have higher correlations with each


other and low correlation with the target variable (Life Satisfaction) until you come up


with no more than 15-16 independent variables as acceptable in Excel. If you are using


a different software like R, you should not have any worries about the number of


independent variables.


4. Run the multiple regression of Life Satisfaction against all the remaining variables.


5. Check the p-values or t-stat column of the coefficients of the independent variables.


6. If they are all significant (p-value < 0.05="" or="" |t-stat|=""> ~2), the regression is final. This


model is known as the “parsimonious model.”


7. If some of the variables are not significant, remove the one whose significance is the


lowest. That variable has the highest p-value or the lowest |t-stat| value. Then, re-run


the regression with the remaining variables.


8. Go to step 5.

Answered 1 days AfterMar 11, 2021

Answer To: Develop the best multiple regression model you can in order to predict the Life Satisfaction using...

Suraj answered on Mar 13 2021
137 Votes
Assignment
Topic: Regression analysis
Submitted To:
Submitted By:
Date:
Step 1:
Introduction: We have a data set consist of 23 variables and 37 rows. In which one variable is our dependent variable named “life satisfaction” and other 22 variables are independent variables.
Clearing the data: In our data, there are
many missing values. We can use the python to fill the missing values by method of mean. That is, we fill the missing values by a common value of that variable and that common value is mean of a particular variable. In this way we can make data for use.
Correlation analysis: This is very basic and best method for finding the best variables which are useful for further analysis. Because sometime we have a lot of variables and we want to run the regression technique and with so many variables it is not possible. Thus, here also the same situation arisen. Thus, we set a threshold value of correlation for other variables and if a variable has correlation less than the threshold value then we can simply delete that value from the data set.
Thus, we run a regression analysis in python and we can’t include the output as the image is too big it can’t include here. But we analysed that there are total 8 variables in the data set which are very less correlation with the dependent variable. These 8 variables are listed below:
1. Housing expenditure
2. Household net wealth
3. Education attainment
4. Student skills
5. Years in education
6. Stakeholder engagement for developing regulations
7. Homicide rate
8. Employees working very long hours
These all the variables have correlation less than 0.30. so, we delete these variables. Now, we have total 15 variable of which 14 are independent.
Now, the next task is to check for multicollinearity; which means that the independent variables are highly correlated with each other. Which is violation of the multiple regression assumption. So, we have to check for this also. We again calculate correlation for the remaining variables and set a threshold of 0.5. Any correlation greater than this is comes under multicollinearity and we simply remove that variable. The output image is again too large to add here. There are total 5 independent variables have multicollinearity with other variables. Hence, we remove them.
The variables are listed as follows:
1. Personal earnings
2. Quality of support network
3. Air pollution
4. Water quality
5. Feeling safe walking alone at night
Thus, after the removal of these variables we get total 9 independent variables useful out of 23 variables.
Now to check about the linearity between both the variables; we will plot 9 scatter plots for all the variables with dependent variables and check about the linearity. The plots are given as follows:
We observe that all the variables are perfectly correlated with the dependent variable except the variable Dwellings without basic facilities. So, we will remove this variable from the data. All the other variables satisfy the condition of linearity. Hence, we can now proceed further for regression analysis.
Regression analysis modelling
We will use the backward elimination method to build the best regression model. In backward...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here