Problem 1 . The file SpeedTrap.RData is an R data set that contains a data frame called SpeedTrap . This data frame consists of 184 observations (rows) and 7 variables (columns). Each row corresponds...

1 answer below »


Problem 1. The file
SpeedTrap.RData
is an R data set that contains a data frame called
SpeedTrap. This data frame consists of 184 observations (rows) and 7 variables (columns). Each row corresponds to a town in the Chicago area. The variables are as follows:





In each community we want to compare the rate of ticketing outsiders who are stopped for a


traffic violation to the rate of ticketing residents who are stopped. To do this we will use the



odds ratio
which is defined as follows:























where πout
and πres
are the probabilities of being ticketed for outsiders and resident, respectively. The odds ratio is used to compare probabilities between two populations. It often is preferred to using the straight difference πout
- πres
in statistical modeling. An odds ratio of 1.0 implies that the two probabilities are equal. An odds ratio greater than 1.0 implies that πout
is greater than πres. The odds ratio is estimated from the counts of successes and failures in each community by replacing πout
and πres
with sample estimates.


(a) Begin by calculating the estimated odds ratio for each community. Append this variable to the data frame (call it
OddsRatio). The first three values should match the following:



> SpeedTrap[1:3,"OddsRatio"]



[1] 1.146857 1.201661 1.264754




(b) Fit a regression model using
OddsRatio
as the outcome variable and
Pop, PPSQMI,



PPHU, and
PCI
as predictor variables. Using diagnostic plots, describe how well the


regression conforms to the assumptions of the normal, linear regression model.




(c) Identify those communities for which the leverage exceeds three times the average value.


Re-run the regression with these communities removed from the data set. Describe how


their removal affects the fitted model.




(d) Re-run the regression in (a), replacing each of the predictors by its logarithm. How does this change affect the presence of observations with high leverage?




(e) Using log-transformed predictors, find a Box-Cox transformation of the outcome variable that maximizes the likelihood. Re-fit the model with the transformed outcome variable. Does it better conform to the assumptions of the normal, linear regression model than the model that you fit originally? In what respects are the diagnostics still troublesome?




(f) Produce and interpret a set of partial residual plots for the model that you fit in (e). Do the predictor variables appear to be treated appropriately in the model?




(g) Assuming that all necessary assumptions are met with the model that you fit in (e):


i. Test the null hypothesis that the coefficients on
log(PPHU)
and
log(PPSQMI)
are both zero.


ii. Give a 95% confidence interval for the coefficient on
log(PCI).


iii. Give a 95% prediction interval for the estimated odds ratio in a community that has a population of 25,000; 4000 persons per square mile; 2.8 persons per housing unit, and a per capita income of $26,000.




(h) Conduct an outlier analysis on residuals from the regression in (e). Use a family-wide Type I error probability of α = .01. Which communities should be considered for removal from the regression?





Problem 2. The file
Ozone.RData
contains a vector named
ozone
which has length n = 111. This vector was obtained from a regression of air quality measurements (ozone) taken on 111 consecutive days in New York City in 1973. Each entry of
ozone
is either –1 if the residual is negative or +1 if the residual is positive. We are interested in testing the null hypothesis that the residuals are not serially correlated versus the alternative hypothesis that the residuals are serially correlated. Using the Runs Test, report a p-value and state your conclusion at the .05 test level.





Problem 3.
For this problem you will use the
prostate
data that is available in the faraway package. The outcome variable is
lcavol, all other variables are predictors. We want to determine if a regression model behaves differently for younger (under age 65) subjects than for older (age 65 and over) subjects.


(a) To do this, introduce a new variable called
Young
to the data set as a factor that distinguishes younger from older men. Introduce it in a way that separate intercepts and slopes are applied to the two groups of men. Show a summary of your regression. Note: we will accept the validity of all regression assumptions in this exercise.


(b) Using the model in (a) conduct an F-test to see if you reject the null hypothesis that coefficients associated with
Young
are all equal to zero. Explain in practical terms what your results mean.

Answered 1 days AfterDec 10, 2021

Answer To: Problem 1 . The file SpeedTrap.RData is an R data set that contains a data frame called SpeedTrap ....

Subhanbasha answered on Dec 11 2021
114 Votes
Report
Problem 1:
b).
Ans:
From the above plot the residuals are not normal.
There are some ou
tliers in the data where it need to remove.
There is some relation between the residuals with fitted values.
From the above all plots we can say that the data does not follow the regression assumptions.
c).
Ans: The outlier communities are WAYNE, HILLSIDE and ITASCA. So we are removed these three and the model performance somehow has been increased.
The model performance increased and the assumptions are somehow better than above model.
d).
Ans:
From the above plot we can clearly see that still there are some outliers where they are...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here