DS 111, Prof Spirling, Prof Jones-Rooy Homework 3 This homework is due on Nov. 3 by 8:30p. Please complete this assignment in your own document and then upload it as a single attached PDF to...

i need you to use python to solve all the questions in the pdf assignment. using only pythonpost screenshot of the code, and typed code and the screenshot of the output.make sure to label each part as to what you are solving like 1a or 3b etc.for questions that don't need python and just a 1 or 2 sentence answer, type those out
you will need some excel files for the work. please make sure to refer to them and use them.





DS 111, Prof Spirling, Prof Jones-Rooy Homework 3 This homework is due on Nov. 3 by 8:30p. Please complete this assignment in your own document and then upload it as a single attached PDF to Brightspace > Assignments. Email submissions and/or submissions in any format other than PDF will not be accepted. Throughout your homework you must (a) clearly identify the question you are answering and (b) provide all executed code you used to generate your answers. Failure to do either (a) or (b) will result in no credit for that question. This homework is worth 33 points (one point per sub-question). Late homework will be graded down, no exceptions. Some of the questions refer to external articles and resources, which you can find attached as PDFs to Brightspace > Assignments > Homework 3 (where you found this assignment). Note that the course academic honesty policy applies to every homework, including this one. 1. This question concerns a sample of 500 observations, nyc salary samp.csv, taken from a much larger data set, nyc salary full all.csv. We will treat the latter as the ‘population’ for this question. Our interest will be in the Regular.Gross.Paid variable. (a) Report the median value of the variable for the sample. (b) Report the median value of the same variable for the population. (c) Plot the empirical distribution of the sample as a histogram, and place a blue triangle point on the x-axis where the sample median value is located. Place the population median on the same axis where the population median value is located. Place a legend on the plot (to the top right). (d) Perform a single bootstrap resample of the sample, and report the median of that re- sample. (e) Using a bootstrap with 1000 resamples (so, 1000 replications), report the 95% confidence interval around the sample median. Does it capture the true (population) median? (f) Draw a histogram of the bootstrapped sampling distribution from Question 1e. Place the confidence interval along the x-axis in yellow, and plot the true population median as a red dot. (g) Using the same set of resamples from Question 1e, report the 10%, 90% and 95% con- fidence intervals, each time producing a histogram (so, three histograms) that includes 1 the confidence interval and the population parameter. Does your confidence interval capture the true parameter every time? What happens as the confidence level increases in terms of the width of the interval? 2. In this question, we will consider a situation in which we are taking many samples from a population, and constructing a confidence interval around the median of each. We will com- pare those confidence intervals to the true population median (which we would not normally know, but here we will act as if we do). In particular, you must produce a plot that contains the population median as a vertical blue line, with each confidence interval (from the multiple samples) as a horizontal green line. (a) Simulate 400 confidence intervals (so 400 samples) around the value of the median of the Regular.Gross.Paid variable from the nyc salary full all.csv data. Each sample must be of size 250, and taken without replacement. Each confidence interval should be at the 92% level, and produced via a bootstrap with 300 resamples replications. Produce a figure of the results, as described above (note the colors!). (b) What percentage of the time did your confidence interval drawing procedure capture the true parameter value? 3. This question uses the okcupid data women.csv data set, which is a (randomly drawn) sam- ple of women from a dating website, with various variables about their profiles. (a) Report the mean and median height of the women. Plot a histogram of the variable with that median and mean plotted as a red solid line, and black dashed line, respectively. (b) Report the standard deviation and variance of the heights, using the correct units. (c) Produce a 95% confidence interval around the sample mean using the bootstrap. The confidence interval should be based on 10000 bootstrap resamples. (d) A researcher claims that the mean height of women in the population (from which this sample is drawn) is not statistically significantly different from 65 inches. He claims he cannot reject this null at the 0.05 level. Is he correct? Briefly explain. 4. This question also uses the okcupid data women.csv data set. Our focus here is on the drinks and drugs variables. (a) Produce a 95% confidence interval around the proportion of women in the sample who say they drink “not at all”. Use the bootstrap with 2000 resamples. (b) Produce a 95% confidence interval around the proportion of women in the sample who say they use drugs “never”. Use the bootstrap with 2000 resamples. 2 5. This question relies on the mean and standard deviation you calculated for the women in Question 3. In what follows, assume that the population from which the women are drawn is normal, and that it has the same mean and standard deviation as the sample you were given. (a) What is the Z-score for a woman of height 67 inches? (b) What is the Z-score for a woman of height 60 inches? (c) What proportion of women are shorter than 60 inches? (d) What proportion of women are between 60 and 67 inches tall? 6. This question requires you to simulate a game of Texas hold ’em, a fun poker pastime which we will now proceed to suck all the joy from. At the start of this game, each player is dealt two cards (which only they see), and then there are three “community cards” that everyone sees. These three community cards are called the ‘flop’. (a) Simulate a dealer dealing cards to four players, and then presenting the flop. Report the cards each player received, and the flop itself. (b) In order to vary the game, a player suggests dealing 7 cards to 7 players, and having a flop of four cards. Try to simulate this game, and explain what error results and why. 7. A researcher is conducting a study to understand the political engagement of NYU students. To do this, they randomly sample 50 students in this semester’s Data Science for Everyone course and ask them to fill out a short, anonymous survey indicating whether they plan to vote in the upcoming New York City mayoral election on Nov. 2 with the possible answers being “yes”, “no”, or “maybe”. (a) How is this researcher conceptualizing ‘political engagement’? (b) How is this researcher operationalizing ‘political engagement’? (c) What is one possible source of random error in the resulting dataset from this study? (d) The researcher is concerned that students may misrepresent their true intentions around voting. What do we call this type of bias, why might it occur, and in what direction would it likely bias the results? (e) What is one possible source of selection bias in this study? 3 (f) A critic points out that measuring ‘political engagement’ in terms of the likelihood of voting in NY may not be picking up engagement so much as whether a student is from New York, and thus registered to vote here. The critic is identifying the possibility of what kind of error? 8. For this question we are going to use the new books per million.csv dataset, which is provided by an organization called Our World in Data. It provides the number of books pub- lished per one million inhabitants per year in “entities” (which are countries, plus additional territories and/or regions) around the world over the past several centuries. (a) Import the dataset into Python as a Pandas dataframe and display the first ten obser- vations. (b) When working with data, we usually prefer to work with variables with simple, lower- cased names. Rename the first three columns so they are lowercased only (but otherwise the same), and rename the fourth column as ‘titles per cap’. Do all of this in one command (but make sure all code is visible when you export), and make sure the change persists. (c) Using the ‘sort values()’ command, from what year is the most recent observation in this dataset? (d) We are going to evaluate book publications per capita around the world in the years 1900 and 2000. Create two subsets of your dataframe, one with observations only from 1900, and one with observations only from 2000. Show the first five rows of these two new dataframes. (e) Using any command(s) other than ‘sort values()’, what country published the most titles per capita in 1900? In 2000? (f) How many observations are in each dataset for 1900 and 2000? What does this suggest about the representativeness of this data in terms of the “global” production of books? Provide your answer in terms of a likely systematic error in this dataset (you may need to inspect the specific observations). 4 "","Fiscal.Year","Payroll.Number","Agency.Name","Last.Name","First.Name","Mid.Init","Agency.Start.Date","Work.Location.Borough","Title.Description","Leave.Status.as.of.June.30","Base.Salary","Pay.Basis","Regular.Hours","Regular.Gross.Paid","OT.Hours","Total.OT.Paid","Total.Other.Pay" "16059",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","DUNN","RHONDA","","2/18/2007","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","CEASED",49350," per Annum",1466,41299.67,0,0,3214.7 "16060",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","EVANS","PHYLLIS","","6/23/1996","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","CEASED",69838," per Annum",930,40086.17,0,0,2125.78 "16061",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","RAY","JEANINE","","6/1/1998","QUEENS ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",49321," per Annum",1830,48062.77,0,0,1604.55 "16062",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","SMITH","DENISE","M","6/23/1996","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",49389," per Annum",1830,48129.53,0,0,3849.47 "16063",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","TRACEY","DAWN","M","6/23/1996","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",49576," per Annum",1819.5,48034.05,7,0,3802.73 "16064",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","BARRENTINE","AUDREY","","6/23/1996","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",49508," per Annum",1684.75,46886.11,0,0,3717.47 "16065",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","BRATHWAITE","CHERYL","E","10/21/2002","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",55144," per Annum",1830,53758.15,0,0,678.42 "16066",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","BUNCH","CALVIN","","7/26/2004","MANHATTAN ","PRINCIPAL ADMINISTRATIVE ASSOCIATE","ACTIVE",49321," per Annum",1830,48062.77,0,0,1598.42 "16067",2016,NA,"ADMIN FOR CHILDREN'S SVCS ","AARON","TERESA","","3/21/2016","BRONX ","CHILD PROTECTIVE SPECIALIST
Nov 02, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here