Project 1: Hypothesis Test For a topic of your choice, find data for a variable of interest for two samples of at least 100 observations each (case-by-case basis). (examples: income for a sample of...

Produce summary statistics (20 pts), histograms (20 pts), and confidence intervals (10pts) for your dataset (50 points total). Conduct a 2-sample hypothesis test (25pts - difference between means) and interpret the results (25 points). Refer to the template in course content. When finished, think about variables that may explain why the difference exists, and these will be useful for the larger project, which we will start after fall break.


Project 1: Hypothesis Test For a topic of your choice, find data for a variable of interest for two samples of at least 100 observations each (case-by-case basis). (examples: income for a sample of men vs a sample of women, sports stats for players of type X vs players of type Y, etc) Provide summary statistics (N, mean, median, std. dev, min, max), histograms, and confidence intervals for the variable of interest for both of the two samples. Test the null hypothesis that the population means are equal for the two groups. (i.e., average income for men and women are the same, etc). An example is provided below. Income Inequality Hypothesis Test Carlos Lopes[footnoteRef:1] [1: Carlos Lopes, Abraham Baldwin Agricultural College; email: [email protected]] September 2018 Introduction/Data This paper examines the differences in annual income of men and women over the period from 1990-2014 in the United States. Data was collected for each of the 25 years from the Annual Social and Economic Supplement to the March Current Population Survey conducted by the Bureau of Labor Statistics. Income data was converted into real 2014 dollars using annual averages from the consumer price index. The table below provides summary statistics for a sample of individuals with earned income (income>0). Men Women N 1,114,447Comment by Carlos Lopes: These sample sizes are huge, and this affects the “power” of the test we’re doing. In other words, we’re going to “find something.” With smaller sample sizes, real differences are more easily missed because it’s harder for us to accurately predict population means. 1,056,626 Mean $51,995 $32,086 Median $39,788 $25,311 Minimum $1 $1 Maximum $1,789,200 $1,300,000 Std. Dev. $59,893 $35,546 Confidence Intervals Lower Upper Lower Upper 90% 1114353 1114540.61 1056569 1056683.06 95% 1114336 1114558.2 1056558 1056693.78 99%Comment by Carlos Lopes: Notice that if we construct the confidence intervals we can see that they do not come close to overlapping for the two groups. In the real world there would be no practical reason to continue at this point—you know the population means are different. 1114301 1114593.37 1056537 1056715.22 From the table, we can see that mean real income for men is $19,909 higher than for women. In order to test to see if this difference is statistically significant, we will conduct a 2-sample hypothesis test. Our null hypothesis is that the mean income of men and women are not different (i.e., the difference we observe in our sample is just due to random chance). Our alternative hypothesis is that the underlying mean income of men and women in the U.S. population is different. Comment by Lopes, Carlos: The mean values for two samples are almost always different. The key question is--could this difference be attributed to randomness, or is this evidence of a “real” underlying difference? Note: There are a couple different methods that can be used to conduct this test. 2 different methods are outlined below. It doesn’t matter how you run the test—depending on what your data looks like you might prefer to use one strategy over the other. Notice that in each case the t-stat is the same. Method 1 From Gretl > Tools > Test Statistic Calculator > 2 means in R t.test(X,Y) Null hypothesis: Difference of means = 0 Sample 1: n = 1114447, mean = 51995, s.d. = 59893 standard error of mean = 56.7344 95% confidence interval for mean: 51883.8 to 52106.2 Sample 2: n = 1056626, mean = 32086, s.d. = 35546 standard error of mean = 34.5804 95% confidence interval for mean: 32018.2 to 32153.8 Test statistic: t(2171071) = (51995 - 32086)/67.2956 = 295.844 Two-tailed p-value = 0 (one-tailed = 0) Method 2 Model>OLS Put the variable of interest (income) as the dependent variable, and the “dummy variable” as the independent variable In R, t.test(X~Y) Model 1: OLS, using observations 1-2171073 Dependent variable: income Coefficient Std. Error t-ratio p-value const 51994.7 46.7592 1111.966<0.0001 ***="" female="" −19909="" 67.2956="" −295.844=""><0.0001 *** interpretation in this sample our t-stat is 295.84, which is greater than the critical value at 95% of 1.96. we reject the null hypothesis at a 95% confidence level. assuming the null hypothesis is true (we can’t assume that in this case), then we would see data that produces results at least as extreme as this less than 0.01% of the time. what if we tested the null hypothesis that men’s income over these 25 years is $20,000 higher than women’s? null hypothesis: difference of means = 20000 sample 1: n = 1114447, mean = 51995, s.d. = 59893 standard error of mean = 56.7344 95% confidence interval for mean: 51883.8 to 52106.2 sample 2: n = 1056626, mean = 32086, s.d. = 35546 standard error of mean = 34.5804 95% confidence interval for mean: 32018.2 to 32153.8 test statistic: t(2171071) = (51995 - 32086 - 20000)/67.2956 = -1.35224 two-tailed p-value = 0.1763 (one-tailed = 0.08815) our samples suggest a $19,909 difference (which is lower than $20,000). if the true difference really is $20,000 (i.e., assuming the null hypothesis is true), and we then say the difference is lower than $20,000, then 17.6% of the time we would observe data at least as extreme in this sample that would allow us to reject the null. it’s higher than 5%. this is an example of a case where we would fail to reject. ***="" interpretation="" in="" this="" sample="" our="" t-stat="" is="" 295.84,="" which="" is="" greater="" than="" the="" critical="" value="" at="" 95%="" of="" 1.96.="" we="" reject="" the="" null="" hypothesis="" at="" a="" 95%="" confidence="" level.="" assuming="" the="" null="" hypothesis="" is="" true="" (we="" can’t="" assume="" that="" in="" this="" case),="" then="" we="" would="" see="" data="" that="" produces="" results="" at="" least="" as="" extreme="" as="" this="" less="" than="" 0.01%="" of="" the="" time.="" what="" if="" we="" tested="" the="" null="" hypothesis="" that="" men’s="" income="" over="" these="" 25="" years="" is="" $20,000="" higher="" than="" women’s?="" null="" hypothesis:="" difference="" of="" means="20000" sample="" 1:="" n="1114447," mean="51995," s.d.="59893" standard="" error="" of="" mean="56.7344" 95%="" confidence="" interval="" for="" mean:="" 51883.8="" to="" 52106.2="" sample="" 2:="" n="1056626," mean="32086," s.d.="35546" standard="" error="" of="" mean="34.5804" 95%="" confidence="" interval="" for="" mean:="" 32018.2="" to="" 32153.8="" test="" statistic:="" t(2171071)="(51995" -="" 32086="" -="" 20000)/67.2956="-1.35224" two-tailed="" p-value="0.1763" (one-tailed="0.08815)" our="" samples="" suggest="" a="" $19,909="" difference="" (which="" is="" lower="" than="" $20,000).="" if="" the="" true="" difference="" really="" is="" $20,000="" (i.e.,="" assuming="" the="" null="" hypothesis="" is="" true),="" and="" we="" then="" say="" the="" difference="" is="" lower="" than="" $20,000,="" then="" 17.6%="" of="" the="" time="" we="" would="" observe="" data="" at="" least="" as="" extreme="" in="" this="" sample="" that="" would="" allow="" us="" to="" reject="" the="" null.="" it’s="" higher="" than="" 5%.="" this="" is="" an="" example="" of="" a="" case="" where="" we="" would="" fail="" to="">
Oct 18, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here