Develop the best multiple regression model you can in order to predict the Life Satisfaction
using the other fields in the data set as predictors. To find the best model, please follow the
“stepwise backward regression” (or the principal of parsimony), as detailed in the below
steps:
1. Clean the dataset by removing any row that is missing any datapoint.
2. Run the correlation analysis between all variables. Make sure the independent
variables are “truly” independent of each other. That is, each two independent variable
must have a low correlation. If any two independent variables have high correlations
(e.g., 0.7 or above) remove one of them. High correlation between two independent
variables is known as “collinearity” and causes a lot of problems for the regression
model (please check the recorded supplementary video for details on collinearity)
3. Include all the remaining variables in your regression model. If you do the analysis in
Excel, Excel has a limitation of 16 independent variables. Try to reduce the number of
independent variables by eliminating those that have higher correlations with each
other and low correlation with the target variable (Life Satisfaction) until you come up
with no more than 15-16 independent variables as acceptable in Excel. If you are using
a different software like R, you should not have any worries about the number of
independent variables.
4. Run the multiple regression of Life Satisfaction against all the remaining variables.
5. Check the p-values or t-stat column of the coefficients of the independent variables.
6. If they are all significant (p-value < 0.05="" or="" |t-stat|=""> ~2), the regression is final. This
model is known as the “parsimonious model.”
7. If some of the variables are not significant, remove the one whose significance is the
lowest. That variable has the highest p-value or the lowest |t-stat| value. Then, re-run
the regression with the remaining variables.
8. Go to step 5.