Homework 2 Due: before 12:00 pm (noon) on Tuesday, March 30. Please do not include your name on your write-up, since these documents will be reviewed by anonymous peer graders. For probability...

Homework 2
Due: before 12:00 pm (noon) on Tuesday, March 30. Please do not include your name on your write-up, since
these documents will be reviewed by anonymous peer graders.
For probability derivations, show your work and/or explain your reasoning. Do not include your raw R
code in your write-up unless we explicitly ask for it. You will submit your R script as a separate
document to the write-up itself. On Canvas, you will see two assignments pages corresponding to Homework
2: (1) to upload your write-up PDF file and (2) to upload the R script that you used to generate your
write-up. Your write-up is what will be peer graded. The R script will not be graded, but you must submit it
to receive credit on the write-up.
If you use tables or figures, make sure they are formatted professionally. Figures and tables should have
informative captions. Numbers should be rounded to a sensible number of digits (you’re at UT and therefore
a smart cookie; use your judgment for what’s sensible depending on the level of precision that is appropriate
for the problem context).
Problem 1 - NHANES
The American National Health and Nutrition Examination Surveys (NHANES) are collected by the US
National Center for Health Statistics, which has conducted a series of health and nutrition surveys since
the early 1960s. Since 1999, approximately 5,000 individuals of all ages are interviewed each year. For this
problem you will need to install the NHANES package in RStudio with a built-in data frame called NHANES.
Part A: Create a histogram for the distribution of SleepHrsNight for individuals aged 18-22 (inclusive) via
the bootstrap. Use at least 10000 iterations. Include the plot and report the mean sleep hours for this age
group. Optional: how does your sleep compare?
Part B: Now we want to build a confidence interval for the proportion of women we think are pregnant at
any given time. Bootstrap a confidence interval with 10000 iterations. Include in your write-up a histogram
of your simulation results, along with a 95% confidence interval for the proportion. To speed things up, you
can use this code to subset the NHANES data frame to one with only women. Let’s get rid of the N/A values
for our variable of interest (PregnantNow) in our filtered data frame:
NHANES_women <- NHANES %>%
Problem 2 - Iron Bank
The Securities and Exchange Commission (SEC) is investigating the Iron Bank, where a cluster of employees
have recently been identified in various suspicious patterns of securities trading. Of the last 2021 trades, 70
were flagged by the SEC’s detection algorithm. Trades are flagged periodically even when no illicit market
activity has taken place. For that reason, the SEC often monitors individual and institutional trading but
does not investigate detected incidents that may be consistent with random variability in trading patterns.
SEC data suggest that the overall baseline rate of suspicious securities trades is 2.4%.
Are the observed data (70 flagged trades out of 2021) consistent with the SEC’s null hypothesis that, over the
long run, securities trades from the Iron Bank are flagged at the same baseline rate as that of other traders?
Use Monte Carlo simulation (with at least 100000 simulations) to calculate a p-value under this null hypothesis.
Include the following items in your write-up:
• the null hypothesis that your are testing;
• the test statistic you used to measure evidence against the null hypothesis;
• a plot of the probability distribution of the test statistic, assuming that the null hypothesis is true;
• the p-value itself;
• and a one-sentence conclusion about the extent to which you think the null hypothesis looks plausible
in light of the data. This one is open to interpretation! Make sure to defend your conclusion.
Problem 3 - Armfold
A professor at an Australian university ran the following experiment with her students in a data science
class. Everyone in the class stood up, and the professor asked everyone to fold their arms across their chest.
Students then filled out an online survey with two pieces of information: 1) Did they fold their arms with the
left arm on top of right, or with the right arm on top of the left? 2) Did they identify as male or female? The
professor then asked her students to assess whether, in light of the data from the survey, there was support
for the idea that males and females differed in how often they folded their arms with their left arm on top of
the right. The survey data indicated that males folded their arms with their left arms on top more frequently.
But how much more frequently? And was this just a “small-sample” difference? Or did it accurately reflect a
population-level trend? The data from this experiment are in armfold.csv. There are two relevant variables:
• LonR_fold: a binary (0/1) indicator, where 1 indicates left arm on top, and 0 indicates right arm on
• Sex: a categorical variable with levels male and female.
(There’s also a third variable indicating which hand the student writes with, but we’re not using that here.)
Your task (quite similar to what we did with the recidivism R walkthrough) is to assess support for any
male/female differences in the population-wide rate of “left arm on top” folding. Make sure to quantify your
uncertainty about how much more often males fold their left arms on top. (That is, it’s not enough to just
report the estimate for this sample; you have to provide a confidence interval that tells us how we can expect
this number to generalize to the wider population. In doing so, you can treat this sample as if it were a
random sample from the relevant population, in this case university students.) Your write-up should include
four sections:
1) Question: What question are you trying to answer?
2) Approach: What modeling approach did you use to answer the question?
3) Results: What evidence/results did your modeling approach provide to answer the question? This
might include numbers, figures, and/or tables as appropriate depending on your approach.
4) Conclusion: What is your conclusion about your question? You will want to provide a short written
interpretation of your confidence interval.
Note: for a relatively simple problem like this, each of these four sections will likely be quite short. Nonetheless,
these sections reflect a good general organization for a data-science write-up. So we’ll start practicing with
this organization on a simple problem, even if it seems a bit overkill at first. (It is certainly possibly in this
case for each of them to be only 1 or 2 sentences long. Although you might feel you need more, and although
nobody on our end is breaking out a word counter, it shouldn’t be too much longer than that.)
Problem 4 - Ebay
In this problem, you’ll analyze data from an experiment run by EBay in order to assess whether the company’s
paid advertising on Google’s search platform was improving EBay’s revenue. (It was certainly improving
Google’s revenue!)
Google Ads, also known as Google AdWords, is Google’s advertising search system, and it’s the primary way
the company made its $162 billion in revenue in fiscal year 2019. The AdWords system has advertisers bid on
certain keywords (e.g., “iPhone” or “toddler shoes”) in order for their clickable ads to appear at the top of
the page in Google’s search results. These links are marked as an “Ad” by Google, and they’re distinct from
the so-called “organic” search results that appear lower down the page.
Nobody pays for the organic search results; pages get featured here if Google’s algorithms determine that
they’re among the most relevant pages for a given search query. But if a customer clicks on one of the
sponsored “Ad” search results, Google makes money. Suppose, for example, that EBay bids $0.10 on the
term “vintage dining table” and wins the bid for that term. If a Google user searches for “vintage dining
table” and ends up clicking on the EBay link from the page of search results, EBay pays Google $0.10 (the
amount of their bid). 1
For a small company, there’s often little choice but to bid on relevant Google search terms; otherwise their
search results would be buried. But a big site like EBay doesn’t necessarily have to pay in order for their
search results to show up prominently on Google. They always have the option of “going organic,” i.e. not
bidding on any search terms and hoping that their links nonetheless are shown high enough up in the organic
search results to garner a lot of clicks from Google users. So the question for a business like EBay is, roughly,
the following: does the extra traffic brought to our site from paid search results—above and beyond what
we’d see if we “went organic”—justify the cost of the ads themselves?
To try to answer this question, EBay ran an experiment in May of 2013. For one month, they turned off
paid search in a random subset of 70 of the 210 designated market areas (DMAs) in the United States. A
designated market area, according to Wikipedia, is “a region where the population can receive the same or
similar television and radio station offerings, and may also include other types of media including newspapers
and Internet content.” Google allows advertisers to bid on search terms at the DMA level, and it infers the
DMA of a visitor on the basis of that visitor’s browser cookies and IP address. Examples of DMAs include
“New York,” “Miami-Ft. Lauderdale,” and “Beaumont-Port Arthur.” In the experiment, EBay randomly
assigned each of the 210 DMAs to one of two groups:
• the treatment group, where advertising on Google AdWords for the whole DMA was paused for a
month, starting on May 22.
• the control group, where advertising on Google AdWords continued as before.
In ebay.csv you have the results of the experiment. The columns in this data set are:
• DMA: the name of the designated market area, e.g. New York
• rank: the rank of that DMA by population
• tv_homes: the number of homes in that DMA with a television, as measured by the market research
firm Nielsen (who defined the DMAs in the first place)
• adwords_pause: a 0/1 indicator, where 1 means that DMA was in the treatment group, and 0 means
that DMA was in the control group.
• rev_before: EBay’s revenue in dollars from that DMA in the 30 days before May 22, before the
experiment started.
• rev_after: EBay’s revenue in dollars from that DMA in the 30 days beginning on May 22, after the
experiment started.
The outcome of interest is the revenue ratio at the DMA level, i.e. the ratio of revenue after to revenue
before for each DMA. If EBay’s paid search advertising on Google was driving extra revenue, we would expect
this revenue ratio to be systematically lower in the treatment-group DMAs versus the control-group DMAs.
On the other hand, if paid search
Mar 25, 2021

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here