It's just the Data Analysis Project at the beginning. Parts 1-3
1 Econ 378 Data Analysis Project Overview This project gives you hands-on experience summarizing and analyzing data of your own interest. You are welcome to use spreadsheet or statistical software such as Excel or Stata. Some major statistical databases are listed on Learning Suite, and numerous data sources are available freely on the internet. It will be easy to find something that fits the parameters of the project, but I encourage you to find something that is important to you personally, to make the project much more meaningful.1 Feel free to consult with me or with other professors if you need help finding a specific type of data. Examples that you might find interesting include: • Price data (e.g. wages, interest rates, stock returns, home values, insurance premiums) • National statistics (e.g. GDP, employment, inflation, crime, tax rates) over time or across countries or states • Sales data from a business (with permission/confidentiality, as appropriate) • Health/sports/political statistics, opinion polls, or your own experimental research2 • Your own personal finances, time use, grades, etc. This class prepares you to answer questions such as: (1) On average, how big is variable ??? (2) How widely does ?? vary across observations? (3) Is variable ?? positive or negatively correlated with variable ??, and how strong is this relationship? (4) How can I use variable ?? to predict variable ??? To answer these questions, you will need at least two variables, but this will not be difficult. Additional variables may make the analysis more interesting, but you will only analyze two at a time. You can also analyze multiple variables using Econometrics (Econ 388), so keep your data. Part 1 – Data Collection & Summary (+35) 1 If you lack research ideas, imagine that you have a magic crystal ball that can answer any one question of your choice. What do you wish to ask? That question is your research topic. Next, suppose that you have to answer that question on your own, but that you can ask the crystal ball for any secondary facts that will aid you in answering your big question for yourself. What more specific questions will lead you eventually to the answers you had wanted? Continue this procedure until you reach a question that is sufficiently specific (albeit several steps removed from your original interest) that it becomes feasible to collect the relevant data and get to work. 2 If you collect data from human subjects, you must take care to preserve their safety and privacy, and ensure that participation is voluntary. If you wish to publish your data or results beyond this class, you will need advance approval from the BYU Internal Review Board, who monitor compliance with federal regulations (see http://orca.byu.edu/IRB/ for more details). Start early in that case, to leave time for the approval process, and request additional time if necessary. http://orca.byu.edu/IRB/ 2 1. (+15) Collect data of interest You do not need to submit your data files; just describe the data: If it is not obvious already, what exactly do the variables measure (e.g., what units)?3 How were they collected? Do you have data for the entire population of interest? Or just a sample? The first column of data should list the unit of observation (e.g. individual, firm, country, or time period). 4 For each observation, you need at least one quantitative variable (e.g. price, number of sales, age, GDP) and one binary variable (e.g. gender, race, industry, political party, sport position).5 While not required, it is often interesting to pull data from multiple sources, or to construct new variables from existing data.6 In the spreadsheet below, for example, government finance variables come from one source and a binary political variable comes from another. Per capita variables are then computed simply as ratios; growth variables are computed simply as differences (as a ratio of the original level); and additional binary variables are constructed either by reducing a quantitative variable into “high” and “low” categories (e.g. GDP growth above or below 1.5%) or by comparing two existing variables (e.g. Gov. growth > GDP growth?). Unit Original Variables Constructed Variables GDP Population Gov. Spending Republican House? Per capita GDP Per capita GDP growth GDP Growth > 1.5%? Per capita Gov. spending Per capita Gov. growth Gov. growth > GDP growth? ($ bil.) (mil.) ($ bil.) ($ thous.) (%) ($ thous.) (%) Year 2008 14,834 304 4,665 0 48.8 - - 15.3 - - 2009 14,418 307 5,179 0 47.0 -3.7% 0 16.9 10.1% 1 2010 14,779 309 5,057 0 47.8 1.7% 1 16.3 -3.2% 0 2011 15,052 312 5,116 1 48.3 1.1% 0 16.4 0.4% 0 2012 15,471 314 5,042 1 49.3 2.0% 1 16.1 -2.2% 0 2013 15,759 316 4,955 1 49.8 1.1% 0 15.7 -2.5% 0 2014 16,077 319 4,957 1 50.4 1.3% 0 15.5 -0.7% 0 2. (+3) Identify your audience Identify some audience that might find this data interesting: a policy maker, a business leader, a consumer, etc. In Part 2, you will report your findings to this individual. List any questions (at 3 For example, a humanitarian agency might rate sovereign governments as “corrupt” or not, and designate individuals as “in poverty” or not, but how are these categories assigned? What exactly do they mean? 4 You need at least three observations; larger samples increase precision. If you have trouble identifying the unit of observation, it may be that your data are actually a summary of more primitive raw data. If so, this may be unusable, as the number of observations is effectively reduced to one. 5 You can make a categorical variable binary simply by combining categories. For example, a “race” variable might have several codes for different races, but can be reduced simply to “white” and “minority”. You can also construct binary variables from quantitative variables (see below). 6 When the unit of observation is a time period (e.g. year or week), it can also double as a quantitative variable. 3 least two) that this audience might have, that you believe your data can shed (at least partial) light on. 3. (+6) Summarize individual variables a. Summarize at least one binary variable by reporting the total fraction in each category. b. Summarize at least one quantitative variable by reporting the minimum, maximum, mean, and standard deviation. c. Use one binary variable to divide your data into subgroups, and report the conditional minimum, conditional maximum, conditional mean, and conditional standard deviation for this subgroup (e.g. average wages among female workers). Note: for all subsequent analysis of this project, you may use the full sample or this restricted sample, as you wish. d. Represent at least one quantitative variable graphically, using a histogram.7 4. (+6) Correlation and causation Choose two variables, and do the following: a. Identify reasons why the variables might be positively or negatively correlated. Might one cause the other to increase or decrease? Is reverse causation possible? Are there outside factors that might cause both variables to move? Predict the sign and magnitude of the correlation coefficient ?? between these variables. b. For any outside factors that you identify in part a, tell what additional data could be collected and examined, to control for these outside factors. c. Compute the actual correlation coefficient, and compare it with your prediction above. 5. (+5) Graphical Summary Compare two variables graphically, using something like the following. Include labels (e.g. color- code, axis labels, legend, etc.) so that your graphic is clear. • Scatter chart (two quantitative variables) • Double pie chart (two categorical variables) • Color-coded scatter chart (two quantitative and one categorical variable) • Bar or column chart (one categorical and one or more quantitative or categorical variables) • Line graph (one quantitative variable and time) 7 In MS Excel 2010, load the “Data Analysis” tool pack (File>Options>Add-ins for PC or Tools>Add-ins for Mac), and then select Data>Data Analysis>Histogram. Select the Input Range and Bin Range, and be sure to select the box for “Chart Output”. Note that a bar chart is not the same as a histogram. 4 • Bubble chart (three quantitative variables) Briefly describe some facet of the relationship between the two variables that is apparent in the type of graphic you chose. Part 2 – Statistical Inference (+28) Do the following, stating any important assumptions that your answers rely on.8 You do not need to write out all of your computations, but should make clear how you arrived at your answers. 1. Mean a. (+2) For at least one quantitative variable, find a point estimate of the underlying population mean ??.9 Compute a confidence interval for ??, at a confidence level of your choice.10 b. (+2) Perform a one- or two-sided test, at the level of your choice, of the hypothesis that ?? is equal to a specific value of your choice. State the associated p-value. 2. Standard Deviation (OPTIONAL; must do 2 or 5 or 6) a. (+2) For at least one quantitative variable, find a point estimate of the underlying population standard deviation ??. Compute a confidence interval for ??, at a