HW07 HW07 Your Name, Your Uniqname Due Monday March 9, 2020 at 10pm on Canvas Question 1 (3 pts) Set your working directory using Session -> Set Working Directory -> To Source File. Consider sampling...

1 answer below »
Hi,Please complete the following data science assignment using rmarkdown-- the output should be an rmd file.
I will submit the rmd file, the html file and the csv that might be necessary to read in.Please only complete questions 1- part a and question 3 part a and b. Do not complete the last question (question 3)


HW07 HW07 Your Name, Your Uniqname Due Monday March 9, 2020 at 10pm on Canvas Question 1 (3 pts) Set your working directory using Session -> Set Working Directory -> To Source File. Consider sampling \(n\) pairs \((Y_i, X_i)\) from a very large population of size \(N\). We will assume that the population is so large that we can treat \(n/N \approx 0\), so that all pairs in our sample are effectively independent. xy <- read.csv("xy.csv")="" ggplot(xy,="" aes(x="x," y="y))" +="" geom_point()="" for="" the="" population,="" you="" want="" to="" relate="" \(y\)="" and="" \(x\)="" as="" a="" linear="" function:="" \[y_i="\beta_0" +="" \beta_1="" x_i="" +="" r_i\]="" where="" \[="" \begin{aligned}="" \beta_1="" &="\frac{\text{Cov}(X,Y)}{\text{Var}(X)}" \\="" \beta_0="" &="E(Y)" -="" \beta_1="" e(x)="" \\="" r_i="" &="Y_i" -="" \beta_0="" -="" \beta_1="" x_i="" \end{aligned}="" \]="" the="" the="" line="" described="" by="" \(\beta_0\)="" and="" \(\beta_1\)="" is="" the="" “population="" regression="" line”.="" we="" don’t="" get="" to="" observe="" \(r_i\)="" for="" our="" sample,="" but="" we="" can="" estimate="" \(\beta_0\)="" and="" \(\beta_1\)="" to="" get="" estimates="" of="" \(r_i\).="" part="" (a)="" (1="" pt)="" the="" lm="" function="" in="" r="" can="" estimate="" \(\beta_0\)="" and="" \(\beta_1\)="" using="" sample="" means="" and="" variances.="" since="" these="" estimators="" are="" based="" on="" sample="" means,="" we="" can="" use="" the="" central="" limit="" theorem="" to="" justify="" confidence="" intervals="" for="" \(\beta_0\)="" and="" \(\beta_1\)="" (we="" won’t="" do="" so="" rigorously="" in="" this="" setting).="" use="" the="" lm="" function="" to="" estimate="" \(\beta_0\)="" and="" \(\beta_1\).="" apply="" the="" confint="" function="" to="" the="" results="" to="" get="" 95%="" confidence="" intervals="" for="" the="" \(\beta_1\)="" parameter.="" the="" estimated="" residuals="" (\(\hat="" r_i\))="" can="" be="" found="" by="" applying="" the="" resid="" function="" to="" the="" result="" of="" lm.="" provide="" a="" density="" plot="" of="" these="" values.="" do="" they="" give="" you="" any="" reason="" to="" be="" concerned="" about="" the="" validity="" of="" the="" central="" limit="" theorem="" approximation?="" part="" (b)="" (2="" pts)="" you="" can="" use="" the="" coef="" function="" to="" get="" just="" the="" estimators="" \(\hat="" \beta_0\)="" and="" \(\hat="" \beta_1\).="" use="" the="" boot="" package="" to="" get="" basic="" and="" percentile="" confidence="" intervals="" for="" just="" \(\beta_1\).="" you="" will="" need="" to="" write="" a="" custom="" function="" to="" give="" as="" the="" statistic="" argument="" to="" boot.="" use="" at="" least="" 1000="" bootstrap="" samples.="" you="" can="" use="" boot.ci="" for="" the="" confidence="" intervals.="" compare="" these="" intervals="" to="" part="" (a)="" and="" comment="" on="" the="" assumptions="" required="" for="" the="" bootstrap="" intervals.="" question="" 3="" (6="" pts)="" suppose="" that="" instead="" of="" sampling="" pairs,="" we="" first="" identified="" some="" important="" values="" of="" \(x\)="" that="" we="" wanted="" to="" investigate.="" treating="" these="" values="" as="" fixed,="" we="" sampled="" a="" varying="" number="" of="" \(y_i\)="" for="" each="" \(x\)="" value.="" for="" these="" data,="" we’ll="" attempt="" to="" model="" the="" conditional="" distribution="" of="" \(y="" \,="" |="" \,="" x\)="" as:="" \[y="" \,="" |="" \,="" x="\beta_0" +="" \beta_1="" x="" +="" \epsilon\]="" where="" \(\epsilon\)="" epsilon="" is="" assumed="" to="" be="" symmetric="" about="" zero="" (therefore,="" \(e(\epsilon)="0\))" and="" the="" variance="" of="" \(\epsilon\)="" does="" not="" depend="" on="" \(x\)="" (a="" property="" called="" “homoskedasticity”).="" these="" assumptions="" are="" very="" similar="" to="" the="" population="" regression="" line="" model="" (as="" \(e(r_i)="0\)" by="" construction),="" but="" cover="" the="" case="" where="" we="" want="" to="" design="" the="" study="" on="" paricular="" values="" (a="" common="" case="" is="" a="" randomized="" trial="" where="" \(x\)="" values="" are="" assigned="" from="" a="" known="" procedure="" and="" \(y\)="" is="" measured="" after).="" part="" (a)="" (3="" pts)="" let’s="" start="" with="" some="" stronger="" assumptions="" and="" then="" relax="" them="" in="" the="" subsequent="" parts="" of="" the="" question.="" the="" assumptions="" that="" underly="" the="" central="" limit="" theorem="" in="" question="" 1="" can="" also="" be="" used="" to="" assume="" that="" \(\epsilon="" \sim="" n(0,="" \sigma^2)\)="" so="" that:="" \[y="" \mid="" x="" \sim="" n(\beta_0="" +="" \beta_1="" x,="" \sigma^2)\]="" we’ve="" noticed="" that="" the="" normal="" distribution="" has="" “light="" tails”="" and="" assumptions="" based="" on="" normality="" can="" be="" sensitive="" to="" outliers.="" instead,="" suppose="" we="" we="" model="" \(\epsilon\)="" with="" scaled="" \(t\)-distribution="" with="" 4="" degrees="" of="" freedom="" (i.e.,="" has="" fatter="" tails="" than="" the="" normal="" distribution):="" \[\epsilon="" \sim="" \frac{\sigma}{\sqrt{2}}="" t(4)="" \rightarrow="" \text{var}(\epsilon)="\sigma^2\]" (the="" \(\sqrt{2}\)="" is="" there="" just="" to="" scale="" the="" \(t\)-distribution="" to="" have="" a="" variance="" of="" 1.="" more="" generally,="" if="" we="" picked="" a="" differed="" degrees="" of="" freemdom="" parameter="" \(v\),="" this="" would="" be="" replaced="" with="" \(\sqrt{v/(v-2)}\).)="" one="" way="" to="" get="" an="" estimate="" of="" the="" distribution="" of="" \(\hat="" \beta_1\)="" is="" the="" following="" algorithm:="" estimate="" \(\beta_0\),="" \(\beta_1\),="" and="" \(\sigma\)="" using="" linear="" regression="" (you="" can="" get="" the="" \(\hat="" \sigma\)="" using="" summary(model)$sigma),="" for="" all="" the="" \(x_i\)="" in="" the="" sample,="" generate="" \(\hat="" y_i="\hat" \beta_0="" +="" \hat="" \beta_1="" x_i\)="" for="" \(b\)="" replications,="" generate="" \(y_i^*="\hat" y_i="" +="" \epsilon_i*\),="" where="" \[\epsilon^*="" \sim="" \frac{\hat="" \sigma}{\sqrt{2}}="" t(4)\]="" for="" each="" replication,="" use="" linear="" regression="" to="" estimate="" \(\hat="" \beta_1^*\).="" use="" the="" \(\alpha/2\)="" and="" \(1="" -="" \alpha/2\)="" quantiles="" of="" the="" bootstrap="" distribution="" to="" get="" the="" confidence="" intervals:="" \[[2="" \hat="" \beta_1="" -="" \hat="" \beta_1^*(1="" -="" \alpha/2),="" 2="" \hat="" \beta_1="" -="" \hat="" \beta_1^*(\alpha/2)]\]="" to="" avoid="" double="" subscripts="" i’ve="" written="" \(\hat="" \beta^*_1(1="" -="" \alpha/2)\)="" as="" the="" upper="" \(1="" -="" \alpha/2\)="" quantile="" of="" the="" bootstrap="" (and="" likewise="" for="" the="" lower="" \(\alpha/2\)="" quantile).="" you="" may="" note="" that="" this="" is="" a="" “basic”="" basic="" bootstrap="" interval.="" in="" fact,="" this="" procedure="" (fitting="" parameters,="" then="" simulating="" from="" a="" model)="" is="" known="" as="" a="" parametric="" bootstrap.="" use="" the="" algorithm="" above="" to="" generate="" a="" confidence="" interval="" for="" \(\beta_1\).="" compare="" it="" to="" the="" fully="" parametric="" interval="" produced="" in="" question="" 2(a).="" which="" is="" larger="" or="" smaller?="" note:="" the="" boot="" function="" does="" have="" the="" option="" of="" performing="" a="" parametric="" bootstrap="" using="" a="" user="" supplied="" rand.gen="" function.="" feel="" free="" to="" use="" this="" functionality,="" but="" you="" may="" find="" it="" easier="" to="" implement="" the="" algorithm="" directly.="" part="" (b)="" (3="" pts)="" as="" an="" alternative="" to="" sampling="" from="" an="" assumed="" distribuiton="" for="" \(\epsilon\),="" we="" can="" replace="" step="" (3)="" in="" the="" previous="" algorithm="" with="" draw="" a="" sample="" (with="" replacement)="" from="" \(\hat="" \epsilon_i\)="" and="" make="" \(y_i^*="\hat" y_i="" +="" \epsilon_i^*\)="" implement="" this="" version="" of="" a="" parametic="" bootstrap.="" feel="" free="" to="" use="" the="" boot="" package.="" compare="" the="" results="" to="" part="" (a)="" of="" this="" question.="" question="" 3="" (1="" pts)="" read="" the="" paper="" “the="" risk="" of="" cancer="" associated="" with="" specific="" mutations="" of="" brca1="" and="" brca2="" among="" ashkenazi="" jews.”="" briefly="" summarize="" the="" paper.="" make="" sure="" to="" discuss="" the="" research="" question,="" data="" source,="" methods,="" and="" results.="" how="" did="" the="" authors="" use="" the="" bootstrap="" procedure="" in="" this="" paper?="" ---="" title:="" "hw07"="" author:="" "your="" name,="" your="" uniqname"="" date:="" "due="" monday="" march="" 9,="" 2020="" at="" 10pm="" on="" canvas"="" output:="" html_document="" ---="" ```{r="" setup,="" include="FALSE}" knitr::opts_chunk$set(echo="TRUE)" library(tidyverse)="" library(ggplot2)="" ```="" ##="" question="" 1="" (3="" pts)="" set="" your="" working="" directory="" using="" session="" -=""> Set Working Directory -> To Source File. Consider sampling $n$ pairs $(Y_i, X_i)$ from a very large population of size $N$. We will assume that the population is so large that we can treat $n/N \approx 0$, so that all pairs in our sample are effectively independent. ```{r} xy <- read.csv("xy.csv") ggplot(xy, aes(x = x, y = y)) + geom_point() ``` for the population, you want to relate $y$ and $x$ as a linear function: $$y_i = \beta_0 + \beta_1 x_i + r_i$$ where \[ \begin{aligned} \beta_1 &= \frac{\text{cov}(x,y)}{\text{var}(x)} \\ \beta_0 &= e(y) - \beta_1 e(x) \\ r_i &= y_i - \beta_0 - \beta_1 x_i \end{aligned} \] the the line described by $\beta_0$ and $\beta_1$ is the "population regression line". we don't get to observe $r_i$ for our sample, but we can estimate $\beta_0$ and $\beta_1$ to get estimates of $r_i$. ### part (a) (1 pt) the `lm` function in r can estimate $\beta_0$ and $\beta_1$ using sample means and variances. since these estimators are based on sample means, we can use the **central limit theorem** to justify confidence intervals for $\beta_0$ and $\beta_1$ (we won't do so rigorously in this setting). use the `lm` function to estimate $\beta_0$ and $\beta_1$. apply the `confint` function to the results to get 95% confidence intervals for the $\beta_1$ parameter. the estimated residuals ($\hat r_i$) can be found by applying the `resid` function to the result of `lm`. provide a density plot of these values. do they give you any reason to be concerned about the validity of the central limit theorem approximation? ### part (b) (2 pts) you can use the `coef` function to get just the estimators $\hat \beta_0$ and $\hat \beta_1$. use the `boot` package to get basic and percentile confidence intervals for just $\beta_1$. you will need to write a custom function to give as the `statistic` argument to `boot`. use at least 1000 bootstrap samples. you can use `boot.ci` for the confidence intervals. compare these intervals to part (a) and comment on the assumptions required for the bootstrap intervals. ## question 3 (6 pts) suppose that instead of sampling pairs, we first identified some important values of $x$ that we wanted to investigate. treating these values as fixed, we sampled a varying number of $y_i$ for each $x$ value. for these data, we'll attempt to model the conditional distribution of $y \, | \, x$ as: $$y \, | \, x = \beta_0 + \beta_1 x + \epsilon$$ where $\epsilon$ epsilon is assumed to be symmetric about zero (therefore, $e(\epsilon) = 0$) and the variance of $\epsilon$ does not depend on $x$ (a property called "homoskedasticity"). these assumptions are very similar to the population regression line model (as $e(r_i) = 0$ by construction), but cover the case where we want to design the study on paricular values (a common case is a randomized trial where $x$ values are assigned from a known procedure and $y$ is measured after). ### part (a) (3 pts) let's start with some stronger assumptions and then relax them in the subsequent parts of the question. the assumptions that underly the central limit theorem in question 1 can also be used to assume that $\epsilon \sim n(0, \sigma^2)$ so that: $$y \mid x \sim n(\beta_0 + \beta_1 x, \sigma^2)$$ we've noticed that the normal distribution has "light tails" and assumptions based on normality can be sensitive to outliers. instead, suppose we we model $\epsilon$ with scaled $t$-distribution with 4 degrees of freedom (i.e., has fatter tails than the normal distribution): $$\epsilon \sim \frac{\sigma}{\sqrt{2}} t(4) \rightarrow \text{var}(\epsilon) = \sigma^2$$ (the $\sqrt{2}$ is there just to scale the $t$-distribution to have a variance of 1. more generally, if we picked a differed degrees of freemdom parameter $v$, this would be replaced with $\sqrt{v/(v-2)}$.) one way to get an estimate of the distribution of $\hat \beta_1$ is the following algorithm: 1. estimate $\beta_0$, $\beta_1$, and $\sigma$ using linear regression (you can get the $\hat \sigma$ using `summary(model)$sigma`), 2. for all the $x_i$ in the sample, generate $\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$ 3. for $b$ replications, generate $y_i^* = \hat y_i + \epsilon_i*$, where $$\epsilon^* \sim \frac{\hat \sigma}{\sqrt{2}} t(4)$$ 4. for each replication, use linear regression to estimate $\hat \beta_1^*$. 5. use the $\alpha/2$ and $1 - \alpha/2$ quantiles of the bootstrap distribution to get the confidence intervals: $$[2 \hat \beta_1 - \hat \beta_1^*(1 - \alpha/2), 2 \hat \beta_1 - \hat \beta_1^*(\alpha/2)]$$ to avoid double subscripts i've written $\hat \beta^*_1(1 - \alpha/2)$ as the upper $1 - \alpha/2$ quantile of the bootstrap (and likewise for the lower $\alpha/2$ quantile). you may note that this is a "basic" basic bootstrap interval. in fact, this procedure (fitting parameters, then simulating from a model) is known as a **parametric bootstrap**. use the algorithm above to generate a confidence interval for $\beta_1$. compare read.csv("xy.csv")="" ggplot(xy,="" aes(x="x," y="y))" +="" geom_point()="" ```="" for="" the="" population,="" you="" want="" to="" relate="" $y$="" and="" $x$="" as="" a="" linear="" function:="" $$y_i="\beta_0" +="" \beta_1="" x_i="" +="" r_i$$="" where="" \[="" \begin{aligned}="" \beta_1="" &="\frac{\text{Cov}(X,Y)}{\text{Var}(X)}" \\="" \beta_0="" &="E(Y)" -="" \beta_1="" e(x)="" \\="" r_i="" &="Y_i" -="" \beta_0="" -="" \beta_1="" x_i="" \end{aligned}="" \]="" the="" the="" line="" described="" by="" $\beta_0$="" and="" $\beta_1$="" is="" the="" "population="" regression="" line".="" we="" don't="" get="" to="" observe="" $r_i$="" for="" our="" sample,="" but="" we="" can="" estimate="" $\beta_0$="" and="" $\beta_1$="" to="" get="" estimates="" of="" $r_i$.="" ###="" part="" (a)="" (1="" pt)="" the="" `lm`="" function="" in="" r="" can="" estimate="" $\beta_0$="" and="" $\beta_1$="" using="" sample="" means="" and="" variances.="" since="" these="" estimators="" are="" based="" on="" sample="" means,="" we="" can="" use="" the="" **central="" limit="" theorem**="" to="" justify="" confidence="" intervals="" for="" $\beta_0$="" and="" $\beta_1$="" (we="" won't="" do="" so="" rigorously="" in="" this="" setting).="" use="" the="" `lm`="" function="" to="" estimate="" $\beta_0$="" and="" $\beta_1$.="" apply="" the="" `confint`="" function="" to="" the="" results="" to="" get="" 95%="" confidence="" intervals="" for="" the="" $\beta_1$="" parameter.="" the="" estimated="" residuals="" ($\hat="" r_i$)="" can="" be="" found="" by="" applying="" the="" `resid`="" function="" to="" the="" result="" of="" `lm`.="" provide="" a="" density="" plot="" of="" these="" values.="" do="" they="" give="" you="" any="" reason="" to="" be="" concerned="" about="" the="" validity="" of="" the="" central="" limit="" theorem="" approximation?="" ###="" part="" (b)="" (2="" pts)="" you="" can="" use="" the="" `coef`="" function="" to="" get="" just="" the="" estimators="" $\hat="" \beta_0$="" and="" $\hat="" \beta_1$.="" use="" the="" `boot`="" package="" to="" get="" basic="" and="" percentile="" confidence="" intervals="" for="" just="" $\beta_1$.="" you="" will="" need="" to="" write="" a="" custom="" function="" to="" give="" as="" the="" `statistic`="" argument="" to="" `boot`.="" use="" at="" least="" 1000="" bootstrap="" samples.="" you="" can="" use="" `boot.ci`="" for="" the="" confidence="" intervals.="" compare="" these="" intervals="" to="" part="" (a)="" and="" comment="" on="" the="" assumptions="" required="" for="" the="" bootstrap="" intervals.="" ##="" question="" 3="" (6="" pts)="" suppose="" that="" instead="" of="" sampling="" pairs,="" we="" first="" identified="" some="" important="" values="" of="" $x$="" that="" we="" wanted="" to="" investigate.="" treating="" these="" values="" as="" fixed,="" we="" sampled="" a="" varying="" number="" of="" $y_i$="" for="" each="" $x$="" value.="" for="" these="" data,="" we'll="" attempt="" to="" model="" the="" conditional="" distribution="" of="" $y="" \,="" |="" \,="" x$="" as:="" $$y="" \,="" |="" \,="" x="\beta_0" +="" \beta_1="" x="" +="" \epsilon$$="" where="" $\epsilon$="" epsilon="" is="" assumed="" to="" be="" symmetric="" about="" zero="" (therefore,="" $e(\epsilon)="0$)" and="" the="" variance="" of="" $\epsilon$="" does="" not="" depend="" on="" $x$="" (a="" property="" called="" "homoskedasticity").="" these="" assumptions="" are="" very="" similar="" to="" the="" population="" regression="" line="" model="" (as="" $e(r_i)="0$" by="" construction),="" but="" cover="" the="" case="" where="" we="" want="" to="" design="" the="" study="" on="" paricular="" values="" (a="" common="" case="" is="" a="" randomized="" trial="" where="" $x$="" values="" are="" assigned="" from="" a="" known="" procedure="" and="" $y$="" is="" measured="" after).="" ###="" part="" (a)="" (3="" pts)="" let's="" start="" with="" some="" stronger="" assumptions="" and="" then="" relax="" them="" in="" the="" subsequent="" parts="" of="" the="" question.="" the="" assumptions="" that="" underly="" the="" central="" limit="" theorem="" in="" question="" 1="" can="" also="" be="" used="" to="" assume="" that="" $\epsilon="" \sim="" n(0,="" \sigma^2)$="" so="" that:="" $$y="" \mid="" x="" \sim="" n(\beta_0="" +="" \beta_1="" x,="" \sigma^2)$$="" we've="" noticed="" that="" the="" normal="" distribution="" has="" "light="" tails"="" and="" assumptions="" based="" on="" normality="" can="" be="" sensitive="" to="" outliers.="" instead,="" suppose="" we="" we="" model="" $\epsilon$="" with="" scaled="" $t$-distribution="" with="" 4="" degrees="" of="" freedom="" (i.e.,="" has="" fatter="" tails="" than="" the="" normal="" distribution):="" $$\epsilon="" \sim="" \frac{\sigma}{\sqrt{2}}="" t(4)="" \rightarrow="" \text{var}(\epsilon)="\sigma^2$$" (the="" $\sqrt{2}$="" is="" there="" just="" to="" scale="" the="" $t$-distribution="" to="" have="" a="" variance="" of="" 1.="" more="" generally,="" if="" we="" picked="" a="" differed="" degrees="" of="" freemdom="" parameter="" $v$,="" this="" would="" be="" replaced="" with="" $\sqrt{v/(v-2)}$.)="" one="" way="" to="" get="" an="" estimate="" of="" the="" distribution="" of="" $\hat="" \beta_1$="" is="" the="" following="" algorithm:="" 1.="" estimate="" $\beta_0$,="" $\beta_1$,="" and="" $\sigma$="" using="" linear="" regression="" (you="" can="" get="" the="" $\hat="" \sigma$="" using="" `summary(model)$sigma`),="" 2.="" for="" all="" the="" $x_i$="" in="" the="" sample,="" generate="" $\hat="" y_i="\hat" \beta_0="" +="" \hat="" \beta_1="" x_i$="" 3.="" for="" $b$="" replications,="" generate="" $y_i^*="\hat" y_i="" +="" \epsilon_i*$,="" where="" $$\epsilon^*="" \sim="" \frac{\hat="" \sigma}{\sqrt{2}}="" t(4)$$="" 4.="" for="" each="" replication,="" use="" linear="" regression="" to="" estimate="" $\hat="" \beta_1^*$.="" 5.="" use="" the="" $\alpha/2$="" and="" $1="" -="" \alpha/2$="" quantiles="" of="" the="" bootstrap="" distribution="" to="" get="" the="" confidence="" intervals:="" $$[2="" \hat="" \beta_1="" -="" \hat="" \beta_1^*(1="" -="" \alpha/2),="" 2="" \hat="" \beta_1="" -="" \hat="" \beta_1^*(\alpha/2)]$$="" to="" avoid="" double="" subscripts="" i've="" written="" $\hat="" \beta^*_1(1="" -="" \alpha/2)$="" as="" the="" upper="" $1="" -="" \alpha/2$="" quantile="" of="" the="" bootstrap="" (and="" likewise="" for="" the="" lower="" $\alpha/2$="" quantile).="" you="" may="" note="" that="" this="" is="" a="" "basic"="" basic="" bootstrap="" interval.="" in="" fact,="" this="" procedure="" (fitting="" parameters,="" then="" simulating="" from="" a="" model)="" is="" known="" as="" a="" **parametric="" bootstrap**.="" use="" the="" algorithm="" above="" to="" generate="" a="" confidence="" interval="" for="" $\beta_1$.="">
Answered Same DayMar 04, 2021

Answer To: HW07 HW07 Your Name, Your Uniqname Due Monday March 9, 2020 at 10pm on Canvas Question 1 (3 pts) Set...

Aditya Kumar answered on Mar 07 2021
142 Votes
---
title: "HW07"
author: "Your Name, Your Uniqname"
date: "Due Monday March 9, 2020 at 10pm on Canvas"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#library(tidyverse)
library(ggp
lot2)
```
## Question 1 (3 pts)
Set your working directory using Session -> Set Working Directory -> To Source File.
Consider sampling $n$ pairs $(Y_i, X_i)$ from a very large population of size $N$. We will assume that the population is so large that we can treat $n/N \approx 0$, so that all pairs in our sample are effectively independent.
```{r}
setwd("G:/Grey Nodes/51577")
xy <- read.csv("xy-cxgioxhm-2qjnkcmk.csv")
colnames(xy) = c("index", "x", "y")
ggplot(xy, aes(x = x, y = y)) + geom_point()
```
For the population, you want to relate $Y$ and $X$ as a linear function:
$$Y_i = \beta_0 + \beta_1 X_i + R_i$$
where
\[
\begin{aligned}
\beta_1 &= \frac{\text{Cov}(X,Y)}{\text{Var}(X)} \\
\beta_0 &= E(Y) - \beta_1 E(X) \\
R_i &= Y_i - \beta_0 - \beta_1 X_i
\end{aligned}
\]
The the line described by $\beta_0$ and $\beta_1$ is the "population regression line". We don't get to observe $R_i$ for our sample, but we can estimate $\beta_0$ and $\beta_1$ to get estimates of $R_i$.
### Part (a) (1 pt)
The `lm` function in R can estimate $\beta_0$ and $\beta_1$ using sample means and variances. Since these estimators are based on sample means, we can use the **central limit theorem** to justify confidence intervals for $\beta_0$ and $\beta_1$ (we won't do so rigorously in this setting).
Use the `lm` function to estimate $\beta_0$ and $\beta_1$. Apply the `confint` function to the results to get 95% confidence intervals for the $\beta_1$ parameter.
The estimated residuals ($\hat R_i$) can be found by applying the `resid` function to the result of `lm`. Provide a density plot of these values. Do they give you any reason to be concerned about the validity of the Central Limit Theorem approximation?
```{r}
model1 = lm(y~x, data = xy)
CI.1 = confint.lm(object = model1, parm = 2, level = 0.95)
residulas.model1 = resid(model1)
hist(residulas.model1)
```
### Part (b) (2 pts)
You can use the `coef` function to get just the estimators $\hat \beta_0$ and $\hat \beta_1$. Use the `boot` package to get basic and percentile confidence intervals for just...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here