SPSS Exercises for MAT 301 TABLE OF CONTENTS STAT1S: Exercise Using SPSS to Explore Levels of Measurement Goals of Exercise Part I—Introduction to Levels of Measurement STAT2S:Exercise Using SPSS to...

1 answer below »
SPSS Exercises for MAT 301

TABLE OF CONTENTS

STAT1S: Exercise Using SPSS to Explore Levels of Measurement
Goals of Exercise
Part I—Introduction to Levels of Measurement
STAT2S:Exercise Using SPSS to Explore Measures of Central Tendency and Dispersion
Goals of Exercise
Part I – Measures of Central Tendency
Part II – Deciding Which Measure of Central Tendency to Use
Part III – Measures of Dispersion or Variation
STAT3S: Exercise Using SPSS to Explore Measures of Skewness and Kurtosis
Goals of Exercise
Part I – Measures of Skewness
Part II – Measures of Kurtosis
STAT4S: Exercise Using SPSS to Explore Graphs and Charts
Goals of Exercise
Part I – Pie Charts
Part II – Bar Charts
Part III – Histograms
Part IV – Box Plots
Part V – Conclusions
STAT5S: Exercise Using SPSS to Explore Hypothesis Testing – One-Sample t Test
Goals of Exercise
Part I – Simple Random Sampling
Part II. Hypothesis Testing – the One-Sample T test
Part III. Now It’s Your Turn
STAT6S: Exercise Using SPSS to Explore Hypothesis Testing – Independent-Samples
Goals of Exercise
Part I – Computing Means
Part II – Now it’s Your Turn
Part III – Hypothesis Testing – Independent-Samples t Test
Part IV – Now it’s Your Turn Again
Part V – What Does Independent Samples Mean?
STAT7S: Exercise Using SPSS to Explore Hypothesis Testing – Paired-Samples t Test
Goals of Exercise
Part I – Populations and Samples
Part II – Now it’s Your Turn
Part III – Hypothesis Testing – Paired-Samples t Test
Part IV – Now it’s Your Turn Again
STAT8S: Exercise Using SPSS to Explore Hypothesis Testing – One-Way Analysis of Variance
Goals of Exercise
Part I – Populations and Samples
Part II – Now it’s Your Turn
Part III – Hypothesis Testing – One-Way Analysis of Variance
Part IV – Now it’s Your Turn Again
STAT9S:Exercise Using SPSS to Explore Crosstabulation
Goals of Exercise
Part I—Relationships between Variables
Part II – Interpreting the Percents
Part III – Now it’s Your Turn
Part IV – Adding another Variable into the Analysis
Part V – Now it’s Your Turn Again
STAT10S: Exercise Using SPSS to Explore Chi Square
Goals of Exercise
Part I—Relationships between Variables
Part II – Interpreting the Percents
Part III – Chi Square
Part IV – Now it’s Your Turn
Part V – Expected Values
Part VI – Now it’s Your Turn Again
STAT13S: Exercise Using SPSS to Explore Correlation
Goals of Exercise
Part I – Scatterplots
Part II – Now it’s Your Turn
Part III - Pearson Correlation Coefficient
Part IV – Now it’s Your Turn Again
Part V – Correlation Matrices
Part VI – The Correlation Ratio or Eta-Squared
Part VII – Your Turn
STAT1S: Exercise Using SPSS to Explore Levels of
Measurement
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is
gss14_subset_for_classes_STATISTICS.sav which is a subset of the XXXXXXXXXXGeneral Social Survey.
Some of the variables in the GSS have been recoded to make them easier to use and some new
variables have been created. The data have been weighted according to the instructions from the
National Opinion Research Center. This exercise uses FREQUENCIES in SPSS to introduce the
concept of levels of measurement (nominal, ordinal, interval, and ratio measures). A good
reference on using SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler,
John Korey, Edward Nelson (Editor), and Elizabeth Nelson. The online version of the book is on
the Social Science Research and Instructional Council's Website . You have permission to use this
exercise and to revise it to fit your needs. Please send a copy of any revision to the author.
Included with this exercise (as separate files) are more detailed notes to the instructors, the SPSS
syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output for the
exercise (SPSS output file). Please contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format)
Goals of Exercise
The goal of this exercise is to explore the concept of levels of measurement (nominal, ordinal,
interval, and ratio measures) which is an important consideration for the use of statistics. The
exercise also gives you practice in using FREQUENCIES in SPSS.
Part I—Introduction to Levels of Measurement
We use concepts all the time. We all know what a book is. But when we use the word “book” we’re
not talking about a particular book that we’re reading. We’re talking about books in general. In other
words, we’re talking about the concept to which we have given the name “book.” There are many
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT1S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT1S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT1S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT1S.docx
different types of books – paperback, hardback, small, large, short, long, and so on. But they all
have one thing in common – they all belong to the category “book.”
Let’s look at another example. Religiosity is a concept which refers to the degree of attachment
that individuals have to their religious preference. It’s different than religious preference which
refers to the religion with which they identify. Some people say they are Lutheran; others say they
are Roman Catholic; still others say they are Muslim; and others say they have no religious
preference. Religiosity and religious preference are both concepts.
A concept is an abstract idea. So there are the abstract ideas of book, religiosity, religious
preference, and many others. Since concepts are abstract ideas and not directly observable, we
must select measures or indicants of these concepts. Religiosity can be measured in a number of
different ways – how often people attend church, how often they pray, and how important they say
their religion is to them.
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national
probability sample of adults in the United States conducted by the National Opinion Research
Center (NORC). The GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since.
For this exercise we’re going to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to
access this data set which is called gss14_subset_for_classes_STATISTICS.sav.
The GSS is an example of a social survey. The investigators selected a sample from the
population of all adults in the United States. This particular survey was conducted in XXXXXXXXXXand is a
relatively large sample of approximately 2,500 adults. In a survey we ask respondents questions
and use their answers as data for our analysis. The answers to these questions are used as
measures of various concepts. In the language of survey research these measures are typically
referred to as variables. Often we want to describe respondents in terms of social characteristics
such as marital status, education, and age. These are all variables in the GSS.
These measures are often classified in terms of their levels of measurement. S. S. Stevens
described measures as falling into one of four categories – nominal, ordinal, interval, or ratio. [1]
Here’s a brief description of each level.
A nominal measure is one in which objects (i.e. in our survey, these would be the respondents)
are sorted into a set of categories which are qualitatively different from each other. For example,
we could classify individuals by their marital status. Individuals could be married or widowed or
divorced or separated or never married. Our categories should be mutually exclusive and
exhaustive. Mutually exclusive means that every individual can be sorted into one and only one
category. Exhaustive means that every individual can be sorted into a category. We wouldn’t want
to use single as one of our categories because some people who are single can also be divorced
and therefore could be sorted into more than one category. We wouldn’t want to leave widowed off
our list of categories because then we wouldn’t have any place to sort these individuals.
The categories in a nominal level measure have no inherent order to them. This means that it
wouldn’t matter how we ordered the categories. They could be arranged in any number of different
ways. Run FREQUENCIES in SPSS for the variable d10_marital so you can see the frequency
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/493#_ftn1
distribution for a nominal level variable. (See Frequencies in Chapter 4 of the SPSS online book
mentioned on page XXXXXXXXXXIt wouldn’t matter how we ordered these categories.
An ordinal measure is a nominal measure in which the categories are ordered from low to high or
from high to low. We could classify individuals in terms of the highest educational degree they
achieved. Some individuals did not complete high school; others graduated from high school but
didn’t go on to college. Other individuals completed a two-year junior college degree but then
stopped college. Still others completed their bachelor ’s degree and others went on to graduate
work and completed a master ’s degree or their doctorate. These categories are ordered from low to
high.
But notice that while the categories are ordered they lack an equal unit of measurement. That
means, for example, that the differences between categories are not necessarily equal. Run
FREQUENCIES in SPSS for d3_degree. Look at the categories. The GSS assigned values (i.e.,
numbers) to these categories in the following way:
● 0 = less than high school,
● 1 = high school degree,
● 2 = junior college,
● 3 = bachelors, and
● 4 = graduate.
The difference in education between the first two categories is not the same as the difference
between the last two categories. We might think they are because 0 minus 1 is equal to 3 minus 4
but this is misleading. These aren’t really numbers. They’re just symbols that we have used to
represent these categories. We could just as well have labeled them a, b, c, d, and e. They don’t
have the properties of real numbers. They can’t be added, subtracted, multiplied, and divided. All
we can say is that b is greater than a and that c is greater than b and so on.
An interval measure is an ordinal measure with equal units of measurement. For example,
consider temperature measured in degrees Fahrenheit. Now we have equal units of measurement
– degrees Fahrenheit. The difference between XXXXXXXXXXdegrees and XXXXXXXXXXdegrees is the same as the
difference between XXXXXXXXXXdegrees and XXXXXXXXXXdegrees. Now the numbers have the properties of real
numbers and we can add them and subtract them. But notice one thing about the Fahrenheit scale.
There is no absolute zero point. There can be both positive and negative temperatures. That
means that we can’t compare values by taking their ratios. For example, we can’t divide 80
degrees Fahrenheit by XXXXXXXXXXdegrees and conclude that XXXXXXXXXXis twice as hot at XXXXXXXXXXTo do that we would
need a measure with an absolute zero. [2]
A ratio measure is an interval measure with an absolute zero point. Run FREQUENCIES for
d9_sibs which is the number of siblings. This variable has an absolute zero point and all the
properties of nominal, ordinal, and interval measures and therefore is a ratio variable.
Notice that level of measurement is itself ordinal since it is ordered from low (nominal) to high
(ratio). It’s what we call a cumulative scale. Each level of measurement adds something to the
previous level.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/493#_ftn2
Why is level of measurement important? One of the things that helps us decide which statistic to
use is the level of measurement of the variable(s) involved. For example, we might want to
describe the central tendency of a distribution. If the variable was nominal, we would use the
mode. If it was ordinal, we could use the mode or the median. If it was interval or ratio, we could
use the mode or median or mean. Central tendency will be the focus of another exercise
( STAT2S_pspp ).
Run FREQUENCIES for the following variables in the GSS:
● f4_satfin,
● f11_wealth.
● hap2_happy,
● p1_partyid,
● r1_relig,
● r4_denom,
● r8_reliten,
● s1_nummen,
● s2_numwomen,
● s9_premarsx, and
● d1_age.
For each variable, decide which level of measurement it represents and write a sentence or two
indicating why you think it is that level. Keep in mind that we’re only considering what SPSS calls
the valid responses. The missing responses represent missing data (e.g., don’t know or no answer
responses).


[1] Stanley Smith Stevens, 1946, “On the Theory of Scales of Measurement,” Science XXXXXXXXXX),
pp XXXXXXXXXX.
[2] You might wonder why we didn’t use an example from the GSS. There isn’t one. They don’t
occur in social science research very often. There are examples from the field of business. Think
about profit for businesses over a fiscal year. There is no absolute zero. Profit could be positive or
negative.


https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/493#_ftnref1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/493#_ftnref2
STAT2S:Exercise Using SPSS to Explore Measures of Central
Tendency and Dispersion
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses
FREQUENCIES in SPSS to explore measures of central tendency and dispersion. A good reference on
using SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward
Nelson (Editor), and Elizabeth Nelson. The online version of the book is on the Social Science Research
and Instructional Council's Website . You have permission to use this exercise and to revise it to fit your
needs. Please send a copy of any revision to the author. Included with this exercise (as separate files) are
more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax
file), and the SPSS output for the exercise (SPSS output file). Please contact the author for additional
information.
I’m attaching the following files.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore measures of central tendency (mode, median, and mean) and
dispersion (range, interquartile range, standard deviation, and variance). The exercise also gives you
practice in using FREQUENCIES in SPSS.
Part I – Measures of Central Tendency
Data analysis always starts with describing variables one-at-a-time. Sometimes this is referred to as
univariate (one-variable) analysis. Central tendency refers to the center of the distribution.
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT2S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT2S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT2S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT2S.docx
There are three commonly used measures of central tendency – the mode, median, and mean of a
distribution. The mode is the most common value or values in a distribution [1] . The median is the middle
value of a distribution. [2] The mean is the sum of all the values divided by the number of values.
Run FREQUENCIES in SPSS for the variable d9_sibs. (See Chapter 4, Frequencies in the online SPSS
book mentioned on page XXXXXXXXXXOnce you have selected this variable click on the “Statistics” button and
check the boxes for mode, median, and mean. Then click on “Continue” and click on the “Charts” button.
Select “Histogram” and check the box for “Show normal curve on histograms.” Then click on “Continue.”
That will take you back to the screen where you selected the variable. Click on “OK” and SPSS will open
the Output window and display the results that you requested.
Your output will display the frequency distribution for d9_sibs and a box showing the mode, median, and
mean with the following values displayed.
● Mode = 2 meaning that two brothers and sisters was the most common answer XXXXXXXXXX%) from the
2,531 respondents who answered this question. However, not far behind are those with one sibling
(18.6%) and those with three siblings XXXXXXXXXX%). So while technically two siblings is the mode, what
you really found is that the most common values are one, two, and three siblings. Another part of
your output is the histogram which is a chart or graph of the frequency distribution. The histogram
clearly shows that one, two, and three are the most common values (i.e., the highest bars in the
histogram). So we would want to report that these three categories are the most common
responses.
● Median = 3 which means that three siblings is the middle category in this distribution. The middle
category is the category that contains the 50 th percentile which is the value that divides the
distribution into two equal parts. In other words, it’s the value that has 50% of the cases above it
and 50% of the cases below it. The cumulative percent column of the frequency distribution tells
you that 41.4% of the cases have two or fewer siblings and that 59.3% of the cases have three or
fewer siblings. So the middle case (i.e., the 50 th percentile) falls somewhere in the category of
three siblings. That is the median category.
● Mean = XXXXXXXXXXwhich is the sum of all the values in the distribution divided by the number of
responses. If you were to sum all these values that sum would be 9, XXXXXXXXXXDividing that by the
number of responses or 2,531 will give you the mean of 3.74.
Part II – Deciding Which Measure of Central Tendency to Use
The first thing to consider is the level of measurement (nominal, ordinal, interval, ratio) of your variable (see
Exercise STAT1S).
● If the variable is nominal, you have only one choice. You must use the mode.
● If the variable is ordinal, you could use the mode or the median. You should report both measures
of central tendency since they tell you different things about the distribution. The mode tells you
the most common value or values while the median tells you where the middle of the distribution
lies.
● If the variable is interval or ratio, you could use the mode or the median or the mean. Now it gets a
little more complicated. There are several things to consider.
○ How skewed is your distribution? [3] Go back and look at the histogram for d9_sibs. Notice
that there is a long tail to the right of the distribution. Most of the values are at the lower
level – one, two, and three siblings. But there are quite a few respondents who report
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftn1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftn2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftn3
having four or more siblings and about 5% said they have ten or more siblings. That’s
what we call a positively skewed distribution where there is a long tail towards the right or
the positive direction. Now look at the median and mean. The mean XXXXXXXXXXis larger than
the median XXXXXXXXXXThe respondents with lots of siblings pull the mean up. That’s what
happens in a skewed distribution. The mean is pulled in the direction of the skew. The
opposite would happen in a negatively skewed distribution. The long tail would be towards
the left and the mean would be lower than the median. In a heavily skewed distribution the
mean is distorted and pulled considerably in the direction of the skew. So consider
reporting only the median in a heavily skewed distribution. That’s why you almost always
see median income reported and not mean income. Imagine what would happen if your
sample happened to include Bill Gates. The income distribution would have this very, very
large value which would pull the mean up but not affect the median.
○ Is there more than one clearly defined peak in your distribution? The number of siblings
has one clearly defined peak – one, two and three siblings. But what if there is more than
one clearly defined peak? For example, consider a hypothetical distribution of XXXXXXXXXXcases
in which there are XXXXXXXXXXcases with a value of two and fifty cases with a value of XXXXXXXXXXThe
median and mean would be five but there are really two centers of this distribution – two
and eight. The median and the mean aren’t telling the correct story about the center.
You’re better off reporting the two clearly defined peaks of this distribution and not
reporting the median and mean.
○ If your distribution is normal in appearance then the mode, median, and mean will all be
about the same. A normal distribution is a perfectly symmetrical distribution with a single
peak in the center. No empirical distribution is perfectly normal but distributions often are
approximately normal. Here we would report all three measures of central tendency. Go
back to your SPSS output and look at the histogram for d9_sibs. When you told SPSS to
give you the histogram you checked the box that said “Show normal curve on histograms.”
SPSS then superimposed the normal curve on the histogram. The normal curve doesn’t fit
the histogram perfectly particularly at the lower end but it does suggest that it
approximates a normal curve particularly at the upper end.
Run FREQUENCIES for the following variables. Once you have selected the variables click on the
“Statistics” button and check the boxes for mode, median, and mean. Then click on “Continue” and click
on the “Charts” button. Select “Histogram” and check the box for “Show normal curve on histograms.”
Then click on “Continue.” That will take you back to the screen where you selected the variables. Click on
“OK” and SPSS will open the Output window and display the results of what you requested. For each
variable write a sentence or two indicating which measure(s) of central tendency would be appropriate to
use to describe the center of the distribution and what the values of those statistics mean.
● hap2_happy
● p1_partyid
● r8_reliten
● s1_nummen
● s2_numwomen
● d1_age
Part III – Measures of Dispersion or Variation
Dispersion or variation refers to the degree that values in a distribution are spread out or dispersed. The
measures of dispersion that we’re going to discuss are appropriate for interval and ratio level variables (see
Exercise STAT1S). [4] We’re going to discuss four such measures – the range, the inter-quartile range, the
variance, and the standard deviation.
The range is the difference between the highest and the lowest values in the distribution. Run
FREQUENCIES for d1_age and compute the range by looking at the frequency distribution. You can also
ask SPSS to compute it for you. Click on “Statistics” and then click on “Range.” You should get XXXXXXXXXXwhich
is 89 – XXXXXXXXXXThe range is not a very stable measure since it depends on the two most extreme values – the
highest and lowest values. These are the values most likely to change from sample to sample.
A more stable measure of dispersion is the interquartile range which is the difference between the third
quartile (Q3) and the first quartile (Q1). The third quartile is the same thing as the seventy-fifth percentile
which is the value that has 25% of the cases above it and 75% of the cases below it. The first quartile is
the same as the twenty-fifth percentile which is the value that has 75% of the cases above it and 25% of
the cases below it. SPSS will calculate Q3 and Q1 for you. Click on the “Statistics” button and then click
on “Quartiles” in the “Percentiles” box in the upper left. Once you know Q3 and Q1 you can calculate the
interquartile range by subtracting Q1 from Q3. Since it’s not based on the most extreme values it will be
more stable from sample to sample. Go back to SPSS and calculate Q3 and Q1 for d1_age and then
calculate the interquartile range. Q3 will equal XXXXXXXXXXand Q1 will equal XXXXXXXXXXand the interquartile range will equal
60 – XXXXXXXXXXor 27.
The variance is the sum of the squared deviations from the mean divided by the number of cases minus 1
and the standard deviation is just the square root of the variance. Your instructor may want to go into more
detail on how to calculate the variance by hand. SPSS will also calculate it for you. Click on the “Statistics”
button and then click on “Variance” and on “Standard deviation.” The variance should equal XXXXXXXXXXand the
standard deviation will equal XXXXXXXXXX.
The variance and the standard deviation can never be negative. A value of 0 means that there is no
variation or dispersion at all in the distribution. All the values are the same. The more variation there is, the
larger the variance and standard deviation.
So what does the variance XXXXXXXXXXand the standard deviation XXXXXXXXXXof the age distribution mean?
That’s hard to answer because you don’t have anything to compare it to. But if you knew the standard
deviation for both men and women you would be able to determine whether men or women have more
variation. Instead of comparing the standard deviations for men and women you would compute a statistic
called the Coefficient of Relative Variation (CRV). CRV is equal to the standard deviation divided by the
mean of the distribution. A CRV of 2 means that the standard deviation is twice the mean and a CRV of
0.5 means that the standard deviation is one-half of the mean. You would compare the CRV’s for men and
women to see whether men or women have more variation relative to their respective means.
You might also have wondered why you need both the variance and the standard deviation when the
standard deviation is just the square root of the variance. You’ll just have to take my word for it that you will
need both as you go further in statistics.
Run FREQUENCIES for the following variables. Once you have selected the variables click on the
“Statistics” button and check the boxes for quartiles, range, variance, standard deviation, and mean. Then
click on “Continue.” That will take you back to the screen where you selected the variables. Click on “OK”
and SPSS will open the Output window and display the results of what you requested. For each variable
write a sentence or two indicating what the values of these statistics are for each of the variables and what
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftn4
the values of those statistics mean. Compare the relative variation for the number of male sex partners
since the age of XXXXXXXXXXs1_nummen) and the number of female sex partners (s2_numwomen) by comparing
the CRV’s for each variable.
● s1_nummen
● s2_numwomen
● d9_sibs
[1] Frequency distributions can be grouped or ungrouped. Think of age. We could have a distribution that
lists all the ages in years of the respondents to our survey. One of the variables (d1_age) in our data set
does this. But we could also divide age into a series of categories such as under 30, XXXXXXXXXXto 39, XXXXXXXXXXto 49, 50
to 59, XXXXXXXXXXto 69, and XXXXXXXXXXand older. In a grouped frequency distribution the mode would be the most common
category or categories.
[2] In a grouped frequency distribution the median would be the category that contains the middle value.
[3] See Exercise STAT3S for a more thorough discussion of skewness.
[4] The Index of Qualitative Variation can be used to measure variation for nominal variables.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftnref1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftnref2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftnref3
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/494#_ftnref4
STAT3S: Exercise Using SPSS to Explore Measures of
Skewness and Kurtosis
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses
FREQUENCIES in SPSS to explore measures of skewness and kurtosis. A good reference on using
SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson
(Editor), and Elizabeth Nelson. The online version of the book is on the Social Science Research and
Instructional Council's Website . You have permission to use this exercise and to revise it to fit your
needs. Please send a copy of any revision to the author. Included with this exercise (as separate files) are
more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax
file), and the SPSS output for the exercise (SPSS output file). Please contact the author for additional
information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word;
Goals of Exercise
The goal of this exercise is to explore measures of skewness and kurtosis. The exercise also gives you
practice in using FREQUENCIES in SPSS.
Part I – Measures of Skewness
A normal distribution is a unimodal (i.e., single peak) distribution that is perfectly symmetrical. In a normal
distribution the mean, median, and mode are all equal. Here’s a graph showing what a normal distribution
looks like.
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT3S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT3S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT3S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT3S.docx

The horizontal axis is marked off in terms of standard scores where a standard score tells us how many
standard deviations a value is from the mean of the normal distribution. So a standard score of XXXXXXXXXXis one
standard deviation above the mean and a standard score of XXXXXXXXXXis one standard deviation below the mean.
The percents tell us the percent of cases that you would expect between the mean and a particular
standard score if the distribution was perfectly normal. You would expect to find approximately 34% of the
cases between the mean and a standard score of XXXXXXXXXXor XXXXXXXXXXIn a normal distribution, the mean, median, and
mode are all equal and are at the center of the distribution. So the mean always has a standard score of
zero.
Skewness measures the deviation of a particular distribution from this symmetrical pattern. In a skewed
distribution one side has longer or fatter tails than the other side. If the longer tail is to the left, then it is
called a negatively skewed distribution. If the longer tail is to the right, then it is called a positively skewed
distribution. One way to remember this is to recall that any value to the left of zero is negative and any
value to the right of zero is positive. Here are graphs of positively and negatively skewed distributions
compared to a normal distribution.
The best way to determine the skewness of a distribution is to tell SPSS to give you a histogram along with
the mean and median. SPSS will also compute a measure of skewness. Run FREQUENCIES in SPSS for
the variables d1_age and d9_sibs. (See Frequencies in Chapter 4 of the online SPSS book mentioned on
page XXXXXXXXXXClick on the “Charts” button and select “Histogram” and “Show normal curve on histogram.” Then
click on “Continue.” Now click on “Statistics” and select mean, median, skewness and kurtosis. Then click
on “Continue” and on “OK.” We’ll talk about kurtosis in a little bit.
Notice that the mean is larger than the median for both variables. This means that the distribution is
positively skewed. But also notice that the mean for d9_sibs is quite a bit larger than the median in a
relative sense than is the case for d1_age. This suggests that the distribution for d9_sibs is the more
skewed of the two variables. Look at the histograms and you’ll see the same thing. Both variables are
positively skewed but d9_sibs is the more skewed variable. Now look at the skewness values — XXXXXXXXXXfor
d9_sibs and XXXXXXXXXXfor d1_age. The larger the skewness value, the more skewed the distribution. Positive
skewness values indicate a positive skew and negative values indicate a negative skew. There are various
rules of thumb suggested for what constitutes a lot of skew but for our purposes we’ll just say that the
larger the value, the more the skewness and the sign of the value indicates the direction of the skew.
Run FREQUENCIES for the following variables. Tell SPSS to give you the histogram and to show the
normal curve on the histogram. Also ask for the mean, median, and skewness. Write a paragraph for each
variable explaining what these statistics tell you about the skewness of the variables.
● d20_hrsrelax
● tv1_tvhours
Part II – Measures of Kurtosis
Kurtosis refers to the flatness or peakedness of a distribution relative to that of a normal distribution.
Distributions that are flatter than a normal distribution are called platykurtic and distributions that are more
peaked are called leptokurtic.
SPSS will compute a kurtosis measure. Negative values indicate a platykurtic distribution and positive
values indicate a leptokurtic distribution. The larger the kurtosis value, the more peaked or flat the
distribution is.
Look back at the output for d1_age and d9_sibs. For d1_age the kurtosis value was XXXXXXXXXXindicating a
flatter distribution and for d9_sibs kurtosis was XXXXXXXXXXindicating a more peaked distribution. To see this
visually look at your histograms.
Run FREQUENCIES for the following variables. Tell SPSS to give you the histogram and to show the
normal curve on the histogram. Also ask for kurtosis. Write a paragraph for each variable explaining what
these statistics tell you about the kurtosis of the variables.
● d22_maeduc
● d24_paeduc
● s6_sexfreq
STAT4S: Exercise Using SPSS to Explore Graphs and Charts
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses
FREQUENCIES and EXPLORE in SPSS to explore different ways of creating graphs and charts. A good
reference on using SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey,
Edward Nelson (Editor), and Elizabeth Nelson. The online version of the book is on the Social Science
Research and Instructional Council's Website . You have permission to use this exercise and to revise
it to fit your needs. Please send a copy of any revision to the author. Included with this exercise (as
separate files) are more detailed notes to the instructors, the SPSS syntax necessary to carry out the
exercise (SPSS syntax file), and the SPSS output for the exercise (SPSS output file). Please contact the
author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore different ways of graphing frequency distributions. The exercise also
gives you practice in using FREQUENCIES and EXPLORE in SPSS.
Part I – Pie Charts
A pie chart is a chart that shows the frequencies or percents of a variable with a small number of
categories. It is presented as a circle divided into a series of slices. The area of each slice is proportional
to the number of cases or the percent of cases in each category. It is normally used with nominal or ordinal
variables (see Exercise STAT1S) but can be used with interval or ratio variables which have a small
number of categories.
Run FREQUENCIES in SPSS for the variables p1_partyid, p4_polviews, and d12_childs. (See Chapter 4,
Frequencies in the online SPSS book mentioned on page XXXXXXXXXXClick on “Charts” and select “Pie charts.”
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT4S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT4S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT4S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT4S.docx
Notice that there is an option called “Chart Values” that allows you to select whether you want your table to
include “Percentages” or “Frequencies.” Usually you want to select “Percentages.”
Once SPSS has displayed the pie chart in the output window, you can double click anywhere inside the pie
chart to open the “Chart Editor.” Once you have opened the “Chart Editor” right-click anywhere inside one
of the pie slices in the “Chart Editor” and you will see a list of different ways you can edit your pie chart.
Click on “Show Data Labels” and then click on the “Data Value Labels” tab. If “Percent” is not listed in the
“Displayed” box, move it to that box and click on “Apply” and then “Close.” If it is listed in the “Displayed”
box, just click on close. This will close the “Properties” box. Click anywhere outside the “Chart Editor” and
you will see your edited pie chart. There are lots of other ways you could edit your chart. Explore some of
them if you are curious.
If you are wondering why you shouldn’t use pie charts for variables with a large number of categories,
create a pie chart for d1_age and you’ll see why.
Part II – Bar Charts
A bar chart is a chart that shows the frequencies or percents of a variable and is presented as a series of
vertical bars that do not touch each other. The height of each bar is proportional to the number of cases or
the percent of cases in each category. It is normally used with nominal or ordinal variables.
Run FREQUENCIES for the variables p1_partyid and p4_polviews. This time click on “Charts” and select
“Bar charts.” Select “Percentages” to display percents in the chart.
Part III – Histograms
A histogram is a graph that shows the frequencies or percents of a variable with a larger number of
categories. It is presented as a series of vertical bars that touch each other. The height of each bar is
proportional to the number of cases or the percent of cases in each category. It is used with interval or
ratio variables.
Run FREQUENCIES for the variables d1_age, d4_educ, and d12_childs [1] . Click on “Charts” and select
“Histogram.”
Look at the histogram for d1_age. Let’s say you want to redefine the width of each vertical bar.
Double-click anywhere inside the histogram which will open the “Chart Editor.” Now right click anywhere
inside the rectangles in the “Chart Editor” and click on “Properties Window.” This will open the “Properties”
box. Click on the tab for “Binning.” Click on “Custom” and “Interval width” under “X Axis.” Enter XXXXXXXXXXin the
“Interval width” box indicating that you want each vertical bar to represent an interval width of ten years.
Where do we want the first interval to start? We could let SPSS decide but let’s make the decision
ourselves. Click on “Custom value for anchor” and enter XXXXXXXXXXin the box. Click on “Apply” and look at your
histogram. Does it look how you want it to look? Is there any further editing you want to do? If you are
satisfied, click on “Close” to close the “Properties” box. Click anywhere outside the “Chart Editor” box and
you will see your edited histogram.
Part IV – Box Plots
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/496#_ftn1
A box plot is a graph that displays visually a number of characteristics of a frequency distribution:
● the third quartile (Q3),
● the first quartile (Q1),
● the interquartile range (IQR),
● the median,
● the range,
● outliers, and
● extreme values.
Run EXPLORE for d1_age, d4_educ, and d12_childs. (See Chapter 4, Explore in the online SPSS book.)
You can use the default settings for EXPLORE so all you have to do is click “OK” after you have selected
your variables.
The first thing you will see is various descriptive statistics for each variable. You’re probably familiar with
most of these. Then you’ll see the stem-and-leaf display which we’re not going to discuss. The last thing
you’ll see is the box plot. Let’s look at the boxplot for d1_age. The box is bounded at the top by the third
quartile (Q3) and at the bottom by the first quartile (Q1). The height of the box (Q3 – Q1) is the interquartile
range. The horizontal line inside the box represents the median. There are two vertical lines coming out of
the box. This line extends upward to the maximum value and downward to the minimum value. The
difference between the maximum and minimum values is the range.
You can also learn about skewness from the box plot. In a non-skewed distribution, the median will be in
the middle of the box halfway between the third and first quartiles. In a skewed distribution the median will
be either higher or lower in the box. Notice that for d1_age and d4_educ the median is in the middle of the
box suggesting that these distributions are not very skewed but for d12_childs the median is in the upper
part of the box suggesting that this is a positively skewed distribution.
Now look at the box plots for d4_educ and d12_childs. Here you’ll see some circles and numbers. The
circles represent outliers which are values that lie between XXXXXXXXXXand XXXXXXXXXXbox lengths above the third quartile
or below the first quartile. A box length is just another name for the interquartile range since the height of
the box is the interquartile range. The numbers are the case numbers in SPSS. Extreme values are
values that are more than XXXXXXXXXXbox lengths from the first or third quartiles. There aren’t any extreme values
in these distributions.
Sometimes you want to compare box plots for two or more groups of respondents. Let’s look at the box
plot for d1_age and compare the box plots for men and women. Run EXPLORE for d1_age but this time
put d5_sex in the “Factor List” box. Your output should now show the box plots for men and women
side-by-side.
Part V – Conclusions
We have talked about four different types of graphs – pie charts, bar charts, histograms, and box plots.
There are other types of graphs you could use but these are the four most commonly used graphs. There
are other ways to construct graphs in SPSS that your instructor might want to talk about. You can click on
“Graphs” in the menu bar at the top of the SPSS screen and then on “Chart Builder” but we aren’t going to
go into that in this exercise.
[1] There is a small problem with d12_childs. One of the categories is “eight or more” children. That means
we don’t know what these values actually are. They could be 8 or XXXXXXXXXXor XXXXXXXXXXor XXXXXXXXXXor something else. Since
there are so few cases in this category we’re going to ignore this problem.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/496#_ftnref1
STAT5S: Exercise Using SPSS to Explore Hypothesis Testing –
One-Sample t Test
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses COMPARE
MEANS (one-sample t test) and SELECT CASES in SPSS to explore hypothesis testing and the
one-sample t test. A good reference on using SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by
Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth Nelson. The online version of the book is
on the Social Science Research and Instructional Center's Website . You have permission to use this
exercise and to revise it to fit your needs. Please send a copy of any revision to the author. Included with
this exercise (as separate files) are more detailed notes to the instructors, the SPSS syntax necessary to
carry out the exercise (SPSS syntax file), and the SPSS output for the exercise (SPSS output file). Please
contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore hypothesis testing and the one-sample t test. The exercise also gives
you practice in using COMPARE MEANS (one-sample t test) and SELECT CASES in SPSS.
Part I – Simple Random Sampling
Populations are the complete set of objects that we want to study. For example, a population might be all
the individuals that live in the United States at a particular point in time. The U.S. does a complete
enumeration of all individuals living in the United States every ten years (i.e., each year ending in a zero).
We call this a census. Another example of a population is all the students in a particular school or all
college students in your state. Populations are often large and it’s too costly and time consuming to carry
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT5S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT5S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT5S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT5S.docx
out a complete enumeration. So what we do is to select a sample from the population where a sample is a
subset of the population and then use the sample data to make an inference about the population.
A statistic describes a characteristic of a sample while a parameter describes a characteristic of a
population. The mean age of a sample is a statistic while the mean age of the population is a parameter.
We use statistics to make inferences about parameters. In other words, we use the mean age of the
sample to make an inference about the mean age of the population. Notice that the mean age of the
sample (our statistic) is known while the mean age of the population (our parameter) is usually unknown.
There are many different ways to select samples. Probability samples are samples in which every object in
the population has a known, non-zero, chance of being in the sample (i.e., the probability of selection).
This isn’t the case for non-probability samples. An example of a non-probability sample is an instant poll
which you hear about on radio and television shows. A show might invite you to go to a website and
answer a question such as whether you favor or oppose same-sex marriage. This is a purely volunteer
sample and we have no idea of the probability of selection.
There are many ways of selecting a probability sample but the most basic type of probability sample is a
simple random sample in which everyone in the sample has the same chance of being selected in the
sample. SPSS will select a simple random sample for you. We’re going to use the General Social Survey
(GSS) for this exercise. The GSS is a national probability sample of adults in the United States conducted
by the National Opinion Research Center (NORC). The GSS started in XXXXXXXXXXand has been an annual or
biannual survey ever since. For this exercise we’re going to use a subset of the XXXXXXXXXXGSS. Your instructor
will tell you how to access this data set which is called gss14_subset_for_classes_STATISTICS.sav. It’s a
large sample of about 2,500 individuals. To illustrate simple random sampling, we’re going to select a
simple random sample of 30% of all the individuals in the GSS. [1]
Start by getting a frequency distribution for the variable d4_educ which is the last year of school completed
by the respondent. (See, Chapter 4, Frequencies in the online SPSS book mentioned on page XXXXXXXXXXYou’ll
see that there are a total of 2,538 cases. One of those cases said he or she didn’t know. That means
there are 2,537 valid cases that answered the question.
Now click on Data in the menu bar at the top of the screen. (See Chapter 3, Select Cases in the online
SPSS book.) This will open a drop-down box. Click on SELECT CASES. Then click on “Random sample
of cases” and then on “Sample” in the box below. One of the options will already be selected and will say
“Approximately [box] % of all cases.” Fill in XXXXXXXXXXin the box indicating that you want to select a simple random
sample of 30% of all the cases in the GSS. Click on “Continue” and then on “OK.” Now run
FREQUENCIES again for the variable, d4_educ. Your sample will be smaller than before. This is a
random sample of all the cases in the GSS.
Part II. Hypothesis Testing – the One-Sample T test
Let’s think about our variable, d4_educ. What do we know about education in the United States? One
thing we know is that the average years of school completed has been increasing over the twentieth and
twenty-first centuries. It used to be that many people stopped after completing high school which would be
12 years of education. Now more go on to college. So we would hypothesize that the mean years of
school completed is now greater than XXXXXXXXXXHow could we test that hypothesis? We need a statistical
procedure to do that. The t test is one of a number of statistical tests that we can use to test such
hypotheses.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn1
Notice how we are going about this. We have a sample of adults in the United States (i.e., the XXXXXXXXXXGSS).
We can calculate the mean years of school completed by all the adults in the sample who answered the
question. But we want to test the hypothesis that the mean years of school completed in the population of
all adults is greater than XXXXXXXXXXWe’re going to use our sample data to test a hypothesis about the
population. [2]
What do we know about sampling? We know that no sample is ever a perfect representation of the
population from which the sample is drawn. This is because every sample contains some amount of
sampling error. Sampling error is inevitable. There is always some amount of sampling error present in
every sample. Another thing we know is that the larger the sample size, the less the sampling error.
So the hypothesis we want to test is that the mean years of school completed in the population is greater
than XXXXXXXXXXWe’ll call this our research hypothesis. It’s what we expect to be true. But there is no way to
prove the research hypothesis directly. So we’re going to use a method of indirect proof. We’re going to
set up another hypothesis that says that the research hypothesis is not true and call this the null
hypothesis. [3] In our case, the null hypothesis would be that the mean years of school completed in the
population is equal to XXXXXXXXXXIf we can reject the null hypothesis then we have evidence to support the
research hypothesis. If we can’t reject the null hypothesis then we don’t have any evidence in support of
the research hypothesis. You can see why this is called a method of indirect proof. We can’t prove the
research hypothesis directly but if we can reject the null hypothesis then we have indirect evidence that
supports the research hypothesis.
Here are our two hypotheses.
● research hypothesis – the population mean is greater than 12
● null hypothesis – the population mean is equal to 12
It’s the null hypothesis that we are going to test.
Before we carry out the t test, let’s make sure we are using the full GSS sample and not the 30% simple
random sample. Click on “Data” and on “Select Cases.” Select “All cases” and then click on OK. Now you
are using all the cases.
Now click on “Analyze” in the menu bar which will open a drop-down menu. Click on “Compare Means”
which will open another drop-down menu and click on “One-Sample T Test.” Move the variable, d4_educ,
over to the “Test Variable(s)” box on the right. Below the box on the right you will see a box called “Test
Value.” This is where we enter the value specified in the null hypothesis which in our case is XXXXXXXXXXAll you
have to do now is click on OK.
You should see two output boxes. The first box will have four values in it.
● N is the number of cases for which we have valid information [4] (i.e., the number of respondents
who answered the question). In this problem, N equals 2,537.
● Mean is the mean years of school completed by the respondents in the sample who answered the
question (see STAT2S). In this problem, the sample mean equals XXXXXXXXXX.
● Standard Deviation is a measure of dispersion (see STAT2S). In this problem, the standard
deviation equals XXXXXXXXXX.
● Standard Error of the Mean is an estimate of how much sampling error there is. In this problem,
the standard error equals .061.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn3
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn4
The second box will have five values in it.
● t is the value of the t test
● df is the number of degrees of freedom
● Significance (2-tailed) value
● Mean Difference
● 95% Confidence Interval of the Difference which we’re going to discuss in a later exercise
There is a formula for calculating the value of t in the t test. Your instructor may or may not want you to
learn how to calculate the value of t. I’m going to leave it to your instructor to do this. In this problem t
equals XXXXXXXXXX.
Degrees of freedom (df) is the number of values that are free to vary. If the sample mean equals XXXXXXXXXX
then how many values are free to vary? The answer is N – 1 which is 2,537 – 1 or 2, XXXXXXXXXXSee if you can
figure out why it’s 2, XXXXXXXXXXYour instructor will help you if you are having trouble figuring it out.
The significance value is a probability. It’s the probability that you would be wrong if you rejected the null
hypothesis. It’s XXXXXXXXXXwhich you would think is telling you that there is no chance of being wrong if you
rejected the null hypothesis. But it’s actually a rounded value and it means that the probability is less than
XXXXXXXXXXor less than five in ten thousand. So there is a chance of being wrong but it’s really, really small.
The mean difference is the difference between the sample mean XXXXXXXXXXand the value specified in the null
hypothesis XXXXXXXXXXSo it’s XXXXXXXXXX – XXXXXXXXXXor 1.68. [5] That’s the amount that your sample mean differs from the
value in the null hypothesis. If it’s positive, then your sample mean is larger than the value in the null and if
it’s negative, then your sample mean is smaller than the value in the null.
Now all we have to do is figure out how to use the t test to decide whether to reject or not reject the null
hypothesis. Look again at the significance value which is less than XXXXXXXXXXThat tells you that the
probability of being wrong if you rejected the null hypothesis is less than five out of ten thousand. With
odds like that, of course, we’re going to reject the null hypothesis. A common rule is to reject the null
hypothesis if the significance value is less than XXXXXXXXXXor less than five out of one hundred.
But wait a minute. The SPSS output said this was a two-tailed significance value. What does that mean?
Look back at the research hypothesis which was that the population mean was greater than XXXXXXXXXXWe’re
actually predicting the direction of the difference. We’re predicting that the population mean will be greater
than XXXXXXXXXXThat’s called a one-tailed test and we have to use a one-tailed significance value. It’s easy to get
the one-tailed significance value if we know the two-tailed significance value. If the two-tailed significance
value is less than XXXXXXXXXXthen the one-tailed significance value is half that or XXXXXXXXXXdivided by two or XXXXXXXXXX.
We still reject the null hypothesis which means that we have evidence to support our research hypothesis.
We haven’t proven the research hypothesis to be true but we have evidence to support it.
Part III. Now It’s Your Turn
There is another variable in the XXXXXXXXXXGSS called d18_hrs1 which is the number of hours that the
respondent worked last week if he or she was employed. Many people have suggested that Americans
are working longer hours than they used to. Since the traditional work week is XXXXXXXXXXhours, if it’s true that
we’re working more hours our research hypothesis would be that the mean number of hours worked last
week would be greater than XXXXXXXXXXDo a one-sample t test to test this hypothesis. For each value in the
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn5
output, explain what it means. Then decide whether you should reject or not reject the null hypothesis and
what this tells you about the research hypothesis.
I’ll tell you that you should reject the null hypothesis even though the mean difference was less than one
hour. You might wonder why you reject the null hypothesis when the mean difference is so small. Notice
that we have a large sample (N = 1, XXXXXXXXXXLet’s see what happens when we have a sample that’s only 10%
of that size. Take a simple random sample of 10% of the total sample. (Look back at Part I to see how to
do this.) Now we have a much smaller sample size. Rerun your t test and see what happens with a
smaller sample. For each value in the output, explain what it means. Then decide whether you should
reject or not reject the null hypothesis and what this tells you about the research hypothesis.
Now you probably won’t be able to reject the null hypothesis. [6] Why? Remember that we said the larger
the sample, the less the sampling error. If there is less sampling error, it’s going to be easier to reject the
null hypothesis. You can see this by looking at the standard error of the mean. It will probably be smaller in
the larger sample and bigger in the smaller sample. So when you have a really large sample don’t get too
excited when you reject the null hypothesis even though you have only a small mean difference.
[1] The GSS it itself not a simple random sample but rather is an example of a multistate cluster sample.
[2] Characteristics of a sample are called statistics while characteristics of a population are called
parameters.
[3] The null hypothesis is often called the hypothesis of no difference. We’re saying that the population
mean is still equal to XXXXXXXXXXIn other words, nothing has changed. There is no difference.
[4] Missing cases would include those who said they didn’t know or refused to answer the question.
[5] By the way, the value of the mean XXXXXXXXXXis a rounded value so that’s why the mean difference isn’t
exactly 1.68.
[6] Why probably? Because by chance you could get a much higher or lower mean which will produce a
larger t value and could mean that your significance value would be low enough to reject the null
hypothesis.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_edn6
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref3
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref4
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref5
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/500#_ednref6
STAT6S: Exercise Using SPSS to Explore Hypothesis Testing –
Independent-Samples
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses COMPARE
MEANS (means and independent-samples t test) to explore hypothesis testing. A good reference on using
SPSS is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson
(Editor), and Elizabeth Nelson. The online version of the book is on the Social Science Research and
Instructional Council's Website . You have permission to use this exercise and to revise it to fit your
needs. Please send a copy of any revision to the author. Included with this exercise (as separate files) are
more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax
file), and the SPSS output for the exercise (SPSS output file). Please contact the author for additional
information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore hypothesis testing and the independent-samples t test. The exercise
also gives you practice in using COMPARE MEANS.
Part I – Computing Means
Populations are the complete set of objects that we want to study. For example, a population might be all
the individuals that live in the United States at a particular point in time. The U.S. does a complete
enumeration of all individuals living in the United States every ten years (i.e., each year ending in a zero).
We call this a census. Another example of a population is all the students in a particular school or all
college students in your state. Populations are often large and it’s too costly and time consuming to carry
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT6S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT6S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT6S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT6S.docx
out a complete enumeration. So what we do is to select a sample from the population where a sample is a
subset of the population and then use the sample data to make an inference about the population.
A statistic describes a characteristic of a sample while a parameter describes a characteristic of a
population. The mean age of a sample is a statistic while the mean age of the population is a parameter.
We use statistics to make inferences about parameters. In other words, we use the mean age of the
sample to make an inference about the mean age of the population. Notice that the mean age of the
sample (our statistic) is known while the mean age of the population (our parameter) is usually unknown.
There are many different ways to select samples. Probability samples are samples in which every object in
the population has a known, non-zero, chance of being in the sample (i.e., the probability of selection).
This isn’t the case for non-probability samples. An example of a non-probability sample is an instant poll
which you hear about on radio and television shows. A show might invite you to go to a website and
answer a question such as whether you favor or oppose same-sex marriage. This is a purely volunteer
sample and we have no idea of the probability of selection.
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national probability
sample of adults in the United States conducted by the National Opinion Research Center (NORC). The
GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since. For this exercise we’re going
to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to access this data set which is called
gss14_subset_for_classes_STATISTICS.sav.
Let’s start by asking two questions.
● Do men and women differ in the number of years of school they have completed?
● Do men and women differ in the number of hours they worked in the last week?
Click on “Analyze” in the menu bar and then on “Compare Means” and finally on “Means.” (See Chapter 6,
introduction in the online SPSS book mentioned on page XXXXXXXXXXSelect the variables d4_educ and d18_hrs1
and move them to the “Dependent List” box. These are the variables for which you are going to compute
means. Then select the variable d5_sex and move it to the “Independent List” box. This is the variable
which defines the groups you want to compare. In our case we want to compare men and women. The
output from SPSS will show you the mean, number of cases, and standard deviation for men and women
for these two variables.
Men and women differ very little in the number of years of school they completed. Men have completed a
little less than one-tenth of a year more than women. But men worked quite a bit more than women in the
last week – a difference of almost six hours. By the way, only respondents who are employed are included
in this calculation but both part-time and full-time employees are included.
Why can’t we just conclude that men and women have about the same education and that men work more
than women? If we were just describing the sample , we could. But what we want to do is to make
inferences about differences between men and women in the population . We have a sample of men and
a sample of women and some amount of sampling error will always be present in both samples. The larger
the sample, the less the sampling error and the smaller the sample, the more the sampling error. Because
of this sampling error we need to make use of hypothesis testing as we did in the previous exercise
(STAT5S).
Part II – Now it’s Your Turn
In this part of the exercise you want to compare men and women to answer these two questions.
● Do men and women differ in the number of hours per day they have to relax? This is variable
d20_hrsrelax in the GSS.
● Do men and women differ in the number of hours per day they watch television? This is variable
tv1_tvhours in the GSS.
Use SPSS to get the sample means and then compare them to begin answering these questions.
Part III – Hypothesis Testing – Independent-Samples t Test
In Part I we compared the mean scores for men and women for the following variables.
● d4_educ
● d18_hrs1
Now we want to determine if that difference is statistically significant by carrying out the
independent-samples t test.
A t test is used when you want to compare two groups. The “grouping variable” defines these two groups.
The variable, d5_sex, is a dichotomy. It has only two categories – male (value XXXXXXXXXXand female (value XXXXXXXXXXBut
any variable can be made into a dichotomy by establishing a cut point or by recoding. For example, the
variable f4_satfin (satisfaction with financial situation) has three categories – satisfied (value 1), more or
less satisfied (value 2), and not at all satisfied (value XXXXXXXXXXThe cut point is the value that makes this into a
dichotomy. All values less than the cut point are in one category and all values equal to or larger than the
cut point are in the other category. If your cut point is 3, then values 1 and 2 are in one category and value
3 is in the other category.
Click on “Analyze” and then on “Compare Means” and finally on “Independent-Samples T Test.” (See
Chapter 6, independent-samples t test in the online SPSS book.) Move the two variables listed above into
the “Test Variable(s)” box. These are the variables for which you want to compute the mean scores. Right
below the “Test Variable(s)” box is the “Grouping Variable” box. This is where you indicate which variable
defines the groups you want to compare. In this problem the grouping variable is d5_sex. Once you have
entered the grouping variable, then enter either the values of the two groups or the cut point.
In our case, you would enter 1 for male into Group 1 and 2 for females into Group XXXXXXXXXXIt wouldn’t matter
which was Group 1 and which was Group XXXXXXXXXXFinally click on “OK.”
You should see two boxes in the output screen. The first box gives you four pieces of information.
● N which is the number of males and females on which the t test is based. This includes only those
cases with valid information. In other words, cases with missing information (e.g., don’t know, no
answer) are excluded.
● Means for males and females.
● Standard deviations for males and females.
● Standard error of the mean for males and females which is an estimate of the amount of sampling
error for the two samples.
The second box has more information in it. The first thing you notice is that there are two t tests for each
variable. One assumes that the two populations (i.e., all males and all females) have equal population
variances and the other doesn’t make this assumption. In our two examples, both t tests give about the
same results. We’ll come back to this in a little bit. The rest of the second box has the following
information. Let’s look at the t test for d4_educ.
● t is the value of the t test which is XXXXXXXXXXfor both t tests. There is a formula for computing t which
your instructor may or may not want to cover in your course.
● Degrees of freedom in the first t test is (N males – XXXXXXXXXXN females – 1) = N males XXXXXXXXXXN females XXXXXXXXXX = 2,535.
In the second t test the degrees of freedom is estimated and turns out to be a decimal.
● The significance (two-tailed) value which we’ll cover in a little bit.
● The mean difference is the mean for the first group (males) – the mean for the second group
(females) = XXXXXXXXXX – XXXXXXXXXX = XXXXXXXXXXInstead of using the rounded values, SPSS carries the
computation out to more decimal points which results in a mean difference of XXXXXXXXXXIn other words,
males have XXXXXXXXXXof a year more education than females which is a very small difference.
● The standard error of the difference which is XXXXXXXXXXis an estimate of the amount of sampling error for
the difference score.
● 95% confidence interval of the difference which we’ll talk about in a later exercise.
Notice how we are going about this. We have a sample of adults in the United States (i.e., the XXXXXXXXXXGSS).
We calculate the mean years of school completed by men and women in the sample who answered the
question. But we want to test the hypothesis that the mean years of school completed by men and women
in the population are different. We’re going to use our sample data to test a hypothesis about the
population.
The hypothesis we want to test is that the mean years of school completed by men in the population is
different than the mean years of school completed by women in the population. We’ll call this our research
hypothesis. It’s what we expect to be true. But there is no way to prove the research hypothesis directly.
So we’re going to use a method of indirect proof. We’re going to set up another hypothesis that says that
the research hypothesis is not true and call this the null hypothesis. If we can’t reject the null hypothesis
then we don’t have any evidence in support of the research hypothesis. You can see why this is called a
method of indirect proof. We can’t prove the research hypothesis directly but if we can reject the null
hypothesis then we have indirect evidence that supports the research hypothesis. We haven’t proven the
research hypothesis, but we have support for this hypothesis.
Here are our two hypotheses.
● research hypothesis – the population mean for men minus the population mean for women does
not equal XXXXXXXXXXIn other words, they are different from each other.
● null hypothesis – the population mean for men minus the population mean for women equals XXXXXXXXXXIn
other words, they are not different from each other.
It’s the null hypothesis that we are going to test.
Now all we have to do is figure out how to use the t test to decide whether to reject or not reject the null
hypothesis. Look again at the significance value which is XXXXXXXXXXfor both t tests. That tells you that the
probability of being wrong if you rejected the null hypothesis is just about XXXXXXXXXXor XXXXXXXXXXtimes out of one
hundred. With odds like that, of course, we’re not going to reject the null hypothesis. A common rule is to
reject the null hypothesis if the significance value is less than XXXXXXXXXXor less than five out of one hundred.
But wait a minute. The SPSS output said this was a two-tailed significance value. What does that mean?
Look back at the research hypothesis which was that the population mean for men minus the population
mean for women does not equal XXXXXXXXXXWe’re not predicting that one population mean will be larger or smaller
than the other. That’s called a two-tailed test and we have to use a two-tailed significance value. If we had
predicted that one population mean would be larger than the other that would be a two-tailed test. It’s easy
to get the one-tailed significance value if we know the two-tailed significance value. If the two-tailed
significance value is XXXXXXXXXXthen the one-tailed significance value is half that or XXXXXXXXXXdivided by two or .045.
We still haven’t explained why there are two t tests. As we said earlier, one assumes that the two
populations (i.e., all males and all females) have equal population variances and the other doesn’t make
this assumption. To compute the t value we need to estimate the population variances (see STAT2S). If
the population variances are about the same, we can pool our two samples to estimate the population
variance. If they are not about the same we wouldn’t want to do this. So how do we decide which t test to
use? Here’s where we’ll talk about the Levene’s test for the equality of variances which is in the second
box in your SPSS output. For this test, the null hypothesis is that the two population variances are equal.
The appropriate test would be the F test which we’re not going to discuss until a later exercise (STAT8S).
But we know how to interpret significance values so we can still make use of this test. The significance
value for the variable d4_educ is XXXXXXXXXXwhich is not less than XXXXXXXXXXso we do not reject the null hypothesis
that the population variances are equal. This means that we would use the t test that assumes equal
population variances.
Part IV – Now it’s Your Turn Again
In this part of the exercise you want to compare men and women to answer these two questions but this
time you want to test the appropriate null hypotheses.
● Do men and women differ in the number of hours per day they have to relax?
● Do men and women differ in the number of hours per day they watch television?
Use the independent-sample t test to carry out this part of the exercise. What are the research and the null
hypotheses? Do you reject or not reject the null hypotheses? Explain why.
Part V – What Does Independent Samples Mean?
Why do we call this t test the independent-samples t test? Independent samples are samples in which the
composition of one sample does not influence the composition of the other sample. In this exercise we’re
using the XXXXXXXXXXGSS which is a sample of adults in the United States. If we divide this sample into men and
women we would have a sample of men and a sample of women and they would be independent samples.
The individuals in one of the samples would not influence who is in the other sample.
Dependent samples are samples in which the composition of one sample does influence the composition of
the other sample. For example, if we have a sample of married couples and divide that sample into two
samples of men and women, then the men in one of the samples determines who the women are in the
other sample. The composition of the samples is dependent on each other. We’re going to discuss the
paired-samples t test in the next exercise (STAT7S).
STAT7S: Exercise Using SPSS to Explore Hypothesis Testing –
Paired-Samples t Test
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses COMPARE
MEANS (paired-samples t test) to explore hypothesis testing. A good reference on using SPSS is SPSS
for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and
Elizabeth Nelson. The online version of the book is on the Social Science Research and Instructional
Council's Website . You have permission to use this exercise and to revise it to fit your needs. Please
send a copy of any revision to the author. Included with this exercise (as separate files) are more detailed
notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax file), and the
SPSS output for the exercise (SPSS output file). Please contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore hypothesis testing and the paired-samples t test. The exercise also
gives you practice in using COMPARE MEANS.
Part I – Populations and Samples
Populations are the complete set of objects that we want to study. For example, a population might be all
the individuals that live in the United States at a particular point in time. The U.S. does a complete
enumeration of all individuals living in the United States every ten years (i.e., each year ending in a zero).
We call this a census. Another example of a population is all the students in a particular school or all
college students in your state. Populations are often large and it’s too costly and time consuming to carry
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT7S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT7S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT7S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT7S.docx
out a complete enumeration. So what we do is to select a sample from the population where a sample is a
subset of the population and then use the sample data to make an inference about the population.
A statistic describes a characteristic of a sample while a parameter describes a characteristic of a
population. The mean age of a sample is a statistic while the mean age of the population is a parameter.
We use statistics to make inferences about parameters. In other words, we use the mean age of the
sample to make an inference about the mean age of the population. Notice that the mean age of the
sample (our statistic) is known while the mean age of the population (our parameter) is usually unknown.
There are many different ways to select samples. Probability samples are samples in which every object in
the population has a known, non-zero, chance of being in the sample (i.e., the probability of selection).
This isn’t the case for non-probability samples. An example of a non-probability sample is an instant poll
which you hear about on radio and television shows. A show might invite you to go to a website and
answer a question such as whether you favor or oppose same-sex marriage. This is a purely volunteer
sample and we have no idea of the probability of selection.
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national probability
sample of adults in the United States conducted by the National Opinion Research Center (NORC). The
GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since. For this exercise we’re going
to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to access this data set which is called
gss14_subset_for_classes_STATISTICS.sav.
In STAT6S we compared means from two independent samples. Independent samples are samples in
which the composition of one sample does not influence the composition of the other sample. In this
exercise we’re using the XXXXXXXXXXGSS which is a sample of adults in the United States. If we divide this
sample into men and women we would have a sample of men and a sample of women and they would be
independent samples. The individuals in one of the samples would not influence who is in the other
sample.
In this exercise we’re going to compare means from two dependent samples. Dependent samples are
samples in which the composition of one sample influences the composition of the other sample. The 2014
GSS includes questions about the years of school completed by the respondent’s parents – d22_maeduc
and d24_paeduc. Let’s assume that we think that respondent’s fathers have more education than
respondent’s mothers. We would compare the mean years of school completed by mothers with the mean
years of school completed by fathers. If the respondent’s mother is in one sample, then the respondent’s
father must be in the other sample. The composition of the samples is therefore dependent on each other.
SPSS calls these paired-samples so we’ll use that term from now on.
Let’s start by asking whether fathers or mothers have more years of school? Click on “Analyze” in the
menu bar and then on “Compare Means” and finally on “Means.” (See Chapter 6, introduction in the online
SPSS book mentioned on page XXXXXXXXXXSelect the variables d22_maeduc and d24_paeduc and move them to
the “Dependent List” box. These are the variables for which you are going to compute means. The output
from SPSS will show you the mean, number of cases, and standard deviation for fathers and mothers.
Fathers have about two-tenths of a year more education than mothers. Why can’t we just conclude that
fathers have more education than mothers? If we were just describing the sample , we could. But what we
want to do is to make inferences about differences between fathers and mothers in the population . We
have a sample of fathers and a sample of mothers and some amount of sampling error will always be
present in both samples. The larger the sample, the less the sampling error and the smaller the sample,
the more the sampling error. Because of this sampling error we need to make use of hypothesis testing as
we did in the two previous exercises (STAT5S and STAT6S).
Part II – Now it’s Your Turn
In this part of the exercise you want to compare the years of school completed by respondents and their
spouses to determine whether men have more education than their spouses or whether women have more
education than their spouses.
Use SPSS to get the sample means as we did in Part I and then compare them to begin answering this
question. But we need to be careful here. Respondents could be either male or female. We need to
separate respondents into two groups – men and women – and then separately compare male
respondents with their spouses and female respondents with their spouses. We can do this by putting the
variables d4_educ and d29_speduc into the “Dependent List” box and d5_sex into the “Independent List”
box.
Part III – Hypothesis Testing – Paired-Samples t Test
In Part I we compared the mean years of school completed by fathers and mothers. Now we want to
determine if this difference is statistically significant by carrying out the paired-samples t test.
Click on “Analyze” and then on “Compare Means” and finally on “Paired-Samples T Test.” (See Chapter 6,
paired-samples t test in the online SPSS book.) Move the two variables listed above into the “Paired
Variables” box. Do this by selecting d22_maeduc and click on the arrow to move it into the “Variable 1”
box. Then select the other variable, d24_paeduc, and click on the arrow to move it into the “Variable 2”
box. Now click on “OK” and SPSS will carry out the paired-samples t test. It doesn’t matter which variable
you put in the “Variable 1” and “Variable 2” boxes.
You should see three boxes in the output screen. The first box gives you four pieces of information.
● Means for mothers and fathers.
● N which is the number of mothers and fathers on which the t test is based. This includes only
those cases with valid information. In other words, cases with missing information (e.g., don’t
know, no answer) are excluded.
● Standard deviations for mothers and fathers.
● Standard error of the mean for mothers and fathers which is an estimate of the amount of sampling
error for the two samples.
The second box gives you the paired sample correlation which is the correlation between mother’s and
father’s years of school completed for the paired samples. If you haven’t discussed correlation yet don’t
worry about what this means.
The third box has more information in it. With paired samples what we do is subtract the years of school
completed for one parent in each pair from the years of school completed for the other parent in the same
pair. Since we put mother’s years of school completed in variable 1 and father’s education in variable 2
SPSS will subtract father’s education from mother’s education. So if the father completed XXXXXXXXXXyears and the
mother completed XXXXXXXXXXyears we would subtract XXXXXXXXXXfrom XXXXXXXXXXwhich would give you XXXXXXXXXXFor this pair the father
completed two more years than the mother.
The third box gives you the following information.
● The mean difference score for all the pairs in the sample which is XXXXXXXXXXThis means that fathers
had an average of almost two-tenths of a year more education than the mothers. By the way, in
Part I when we compared the means for d22_maeduc and d24_paeduc the difference was 0.22.
Here the mean difference score is XXXXXXXXXXWhy aren’t they the same? See if you can figure this out.
(Hint: it has something to do with comparing differences for pairs.)
● The standard deviation of the difference scores for all these pairs which is XXXXXXXXXX.
● The standard error of the mean which is an estimate of the amount of sampling error.
● The 95% confidence interval for the mean difference score. If you haven’t talked about confidence
intervals yet, just ignore this. We’ll talk about confidence intervals in a later exercise.
● The value of t for the paired-sample t test which is XXXXXXXXXXThere is a formula for computing t which
your instructor may or may not want to cover in your course.
● The degrees of freedom for the t test which is 1,795 which is the number of pairs minus one or
1,796 – 1 or 1, XXXXXXXXXXIn other words, 1,795 of the difference scores are free to vary. Once these
difference scores are fixed, then the final difference score is fixed or determined.
● The two-tailed significance value which is XXXXXXXXXXwhich we’ll cover next.
Notice how we are going about this. We have a sample of adults in the United States (i.e., the XXXXXXXXXXGSS).
We calculate the mean years of school completed by respondent’s fathers and mothers in the sample who
answered the question. But we want to test the hypothesis that the mean years of school completed by
fathers is greater than the mean for mothers in the population . We’re going to use our sample data to test
a hypothesis about the population.
The hypothesis we want to test is that the mean years of school completed by fathers is greater than the
mean years of school completed by mothers in the population. We’ll call this our research hypothesis. It’s
what we expect to be true. But there is no way to prove the research hypothesis directly. So we’re going
to use a method of indirect proof. We’re going to set up another hypothesis that says that the research
hypothesis is not true and call this the null hypothesis. If we can’t reject the null hypothesis then we don’t
have any evidence in support of the research hypothesis. You can see why this is called a method of
indirect proof. We can’t prove the research hypothesis directly but if we can reject the null hypothesis then
we have indirect evidence that supports the research hypothesis. We haven’t proven the research
hypothesis, but we have support for this hypothesis.
Here are our two hypotheses.
· research hypothesis – the mean difference score in the population is negative. In other words, the
mean years of school completed by fathers is greater than the mean years for mothers for all pairs in the
population.
· null hypothesis – the mean difference score for all pairs in the population is equal to 0.
It’s the null hypothesis that we are going to test.
Now all we have to do is figure out how to use the t test to decide whether to reject or not reject the null
hypothesis. Look again at the significance value which is XXXXXXXXXXThat tells you that the probability of being
wrong if you rejected the null hypothesis is XXXXXXXXXXor 2 times out of one hundred. With odds like that, of
course, we’re going to reject the null hypothesis. A common rule is to reject the null hypothesis if the
significance value is less than XXXXXXXXXXor less than five out of one hundred.
But wait a minute. The SPSS output said this was a two-tailed significance value. What does that mean?
Look back at the research hypothesis which was that the mean difference score for all pairs in the
population was less than XXXXXXXXXXWe’re predicting that the mean difference score for all pairs in the population
will be negative. That’s called a one-tailed test and we have to use a one-tailed significance value. It’s
easy to get the one-tailed significance value if we know the two-tailed significance value. If the two-tailed
significance value is XXXXXXXXXXthen the one-tailed significance value is half that or XXXXXXXXXXdivided by two or .010.
We still reject the null hypothesis which means that we have evidence to support our research hypothesis.
We haven’t proven the research hypothesis to be true but we have evidence to support it.
Part IV – Now it’s Your Turn Again
In this part of the exercise you want to compare the years of school completed by respondents and their
spouses to determine if women have more education than their spouses but this time you want to test the
appropriate null hypotheses.
Remember from Part II that we have to test this hypothesis first for men and then for women. We’re going
to do this by selecting out all the men and then computing the paired-samples t test. Do this by clicking on
“Data” in the menu bar and then clicking on “Select Cases.” Select “If condition is satisfied” and then click
on “If” in the box below. Select d5_sex and move it to the box on the right by clicking on the arrow pointing
to the right. Now click on the equals sign and then on 1 so the expression in the box reads “d5_sex = 1”.
Click on “Continue” and then on “OK”. To make sure you have selected out the males run a frequency
distribution for d5_sex. You should only see the males (i.e., value XXXXXXXXXXNow carry out the paired-samples t
test. Repeat this for the females (i.e., value XXXXXXXXXXby selecting out the females and then running the
paired-samples t test again.
For each paired-sample t test, state the research and the null hypotheses. Do you reject or not reject the
null hypotheses? Explain why.
STAT8S: Exercise Using SPSS to Explore Hypothesis Testing –
One-Way Analysis of Variance
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses COMPARE
MEANS and one-way analysis of variance to explore hypothesis testing. A good reference on using SPSS
is SPSS for Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor),
and Elizabeth Nelson. The online version of the book is on the Social Science Research and
Instructional Council's Website . You have permission to use this exercise and to revise it to fit your
needs. Please send a copy of any revision to the author. Included with this exercise (as separate files) are
more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax
file), and the SPSS output for the exercise (SPSS output file). Please contact the author for additional
information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore hypothesis testing and one-way analysis of variance (sometimes
abbreviated one-way anova). The exercise also gives you practice in using COMPARE MEANS.
Part I – Populations and Samples
Populations are the complete set of objects that we want to study. For example, a population might be all
the individuals that live in the United States at a particular point in time. The U.S. does a complete
enumeration of all individuals living in the United States every ten years (i.e., each year ending in a zero).
We call this a census. Another example of a population is all the students in a particular school or all
college students in your state. Populations are often large and it’s too costly and time consuming to carry
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT8S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT8S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT8S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT8S.docx
out a complete enumeration. So what we do is to select a sample from the population where a sample is a
subset of the population and then use the sample data to make an inference about the population.
A statistic describes a characteristic of a sample while a parameter describes a characteristic of a
population. The mean age of a sample is a statistic while the mean age of the population is a parameter.
We use statistics to make inferences about parameters. In other words, we use the mean age of the
sample to make an inference about the mean age of the population. Notice that the mean age of the
sample (our statistic) is known while the mean age of the population (our parameter) is usually unknown.
There are many different ways to select samples. Probability samples are samples in which every object in
the population has a known, non-zero, chance of being in the sample (i.e., the probability of selection).
This isn’t the case for non-probability samples. An example of a non-probability sample is an instant poll
which you hear about on radio and television shows. A show might invite you to go to a website and
answer a question such as whether you favor or oppose same-sex marriage. This is a purely volunteer
sample and we have no idea of the probability of selection.
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national probability
sample of adults in the United States conducted by the National Opinion Research Center (NORC). The
GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since. For this exercise we’re going
to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to access this data set which is called
gss14_subset_for_classes_STATISTICS.sav.
In STAT6S and STAT7S we used the t test to compare means from two samples. In STAT6S the means
were from two independent samples while in STAT7S they were from paired samples. But what if we
wanted to compare means from more than two samples? For that we need to use a statistical test called
analysis of variance. In fact, the t test is a special case of analysis of variance.
The XXXXXXXXXXGSS includes a variable (d3_degree) that describes the highest degree in school that the person
achieved. The categories are less than high school, high school, junior college, bachelor’s degree,
graduate degree. Another variable is the number of hours per day that respondents say they watch
television (tv1_tvhours). We want to find out if there is any relationship between these two variables. One
way to answer this question would be to see if respondents with different levels of education watch different
amounts of television. For example, you might suspect that the more education respondents have, the less
television they watch.
Let’s start by looking at the mean number of hours that people watch television broken down by highest
educational degree. Click on “Analyze” in the menu bar and then on “Compare Means” and finally on
“Means.” (See Chapter 6, introduction in the online SPSS book mentioned on page XXXXXXXXXXSelect the variable
tv1_tvhours and move it to the “Dependent List” box. This is the variable for which you are going to
compute means. Then select the variable d3_degree and move it to the “Independent List” box. The
output from SPSS will show you the mean, number of cases, and standard deviation for the different levels
of education.
Respondents with more education watch less television than those with less education. For example,
respondents with a graduate degree watch an average of XXXXXXXXXXhours of television per day while those who
haven’t completed high school watch an average of XXXXXXXXXXhours – a difference of about two hours. Why
can’t we just conclude those with more education watch less television than those with less education? If
we were just describing the sample , we could. But what we want to do is to make inferences about
differences in the population . We have five samples from five different levels of education and some
amount of sampling error will always be present in all these samples. The larger the samples, the less the
sampling error and the smaller the samples, the more the sampling error. Because of this sampling error
we need to make use of hypothesis testing as we did in the three previous exercises (STAT5S, STAT6S,
and STAT7S).
Part II – Now it’s Your Turn
In this part of the exercise you want to determine whether people who live in some regions of the country
(d25_region) watch more television (tv1_tvhours) than people in other regions. Use SPSS to get the
sample means as we did in Part I and then compare them to begin answering this question. Write one or
two paragraphs describing the regions in which people watch more and less television.
Part III – Hypothesis Testing – One-Way Analysis of Variance
In Part I we compared the mean hours of television watched per day for different levels of education. Now
we want to determine if these differences are statistically significant by carrying out a one-way analysis of
variance.
Click on “Analyze” in the menu bar and then on “Compare Means” and finally on “Means.” Select the
variables tv1_tvhours and move it to the “Dependent List” box. Then select the variable d3_degree and
move it to the “Independent List” box. Now click on “Options” in the upper-right corner and then check the
“Anova table and eta” box. Finally click on “Continue” and then on “OK.”
You should see four boxes in the output screen. The first box tells you how many cases are included in the
analysis and how many cases are excluded. Any variable with missing data will be excluded.
The second table shows you the mean, number of cases, and standard deviation for each of the five levels
of education.
The third table gives you results of the one-way analysis of variance. We’re not going to explain these
statistics in this exercise. Your instructor will decide how much to cover on the calculation and meaning of
these statistics.
● Between groups and within groups sum of squares.
● Degrees of freedom for the between groups and within groups sum of squares.
● Mean square for the between groups and within groups sum of squares.
● F statistic.
● Significance value.
The fourth box gives you the value of Eta and Eta squared which measure the degree of association
between the two variables. Again we’ll leave it to your instructor to talk about these measures.
Notice how we are going about this. We have a sample of adults in the United States (i.e., the XXXXXXXXXXGSS).
We calculate the mean number of hours per day that respondents watch television for each level of
education in the sample . But we want to test the hypothesis that the amount respondents watch television
varies by level of education in the population . We’re going to use our sample data to test a hypothesis
about the population.
Our hypothesis is that the mean number of hours watching television is higher for some levels of education
than for other levels in the population. We’ll call this our research hypothesis. It’s what we expect to be
true. But there is no way to prove the research hypothesis directly. So we’re going to use a method of
indirect proof. We’re going to set up another hypothesis that says that the mean number of hours watching
television is the same for all levels of education in the population and call this the null hypothesis. If we
can’t reject the null hypothesis then we don’t have any evidence in support of the research hypothesis. You
can see why this is called a method of indirect proof. We can’t prove the research hypothesis directly but if
we can reject the null hypothesis then we have indirect evidence that supports the research hypothesis.
We haven’t proven the research hypothesis, but we have support for this hypothesis.
Here are our two hypotheses.
● research hypothesis – the mean number of hours watching television for at least one level of
education is different from at least one other population mean.
● null hypothesis – the mean number of hours watching television is the same for all five levels of
education in the population.
It’s the null hypothesis that we are going to test.
Now all we have to do is figure out how to use the F test to decide whether to reject or not reject the null
hypothesis. Look again at the significance value which is XXXXXXXXXXwhich actually means less than XXXXXXXXXX
since XXXXXXXXXXis a rounded value. That tells you that the probability of being wrong if you rejected the null
hypothesis is less than 5 out of ten thousand. With odds like that, of course, we’re going to reject the null
hypothesis. A common rule is to reject the null hypothesis if the significance value is less than XXXXXXXXXXor less
than five out of one hundred.
So what have we learned? We learned that the mean number of hours watching television for at least one
of the populations is different from at least one other population. But which ones? There are statistical
tests for answering this question. But we’re not going to cover that although your instructor might want to
discuss these tests.
Part IV – Now it’s Your Turn Again
In Part II you computed the mean number of hours that respondents watched television for each of the nine
regions of the country. Now we want to determine if these differences are statistically significant by
carrying out a one-way analysis of variance as described in Part III. Indicate what the research and null
hypotheses are and whether you can reject the null hypothesis. What does that tell you about the research
hypothesis?
STAT9S:Exercise Using SPSS to Explore Crosstabulation
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses CROSSTABS
in SPSS to explore crosstabulation. A good reference on using SPSS is SPSS for Windows Version XXXXXXXXXXA
Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth Nelson. The online
version of the book is on the Social Science Research and Instructional Council's Website . You have
permission to use this exercise and to revise it to fit your needs. Please send a copy of any revision to the
author. Included with this exercise (as separate files) are more detailed notes to the instructors, the SPSS
syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output for the exercise (SPSS
output file). Please contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; docx format).
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to introduce crosstabulation as a statistical tool to explore relationships between
variables. The exercise also gives you practice in using CROSSTABS in SPSS.
Part I—Relationships between Variables
In exercises STAT5S through STAT8S we used sample means to analyze relationships between variables.
For example, we compared men and women to see if they differed in the number of years of school
completed and the number of hours they worked in the previous week and discovered that men and
women had about the same amount of education but that men worked more hours than women. We were
able to compute means because years of school completed and hours worked are both ratio level
variables. The mean assumes interval or ratio level measurement (see STAT2S).
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT9S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT9S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT9S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT9S.docx
But what if we wanted to explore relationships between variables that weren’t interval or ratio?
Crosstabulation can be used to look at the relationship between nominal and ordinal variables. Let’s
compare men and women (d5_sex) in terms of the following:
● opinion about abortion (a1_abany),
● fear of crime (c1_fear),
● satisfaction with current financial situation (f4_satfin),
● opinion about gun control (g1_gunlaw),
● gun ownership (g2_owngun),
● voting (p5_pres08), and
● religiosity (r8_reliten).
Before we look at the relationship between sex and these other variables, we need to talk about
independent and dependent variables. The dependent variable is whatever you are trying to explain. In
our case, that would be how people feel about abortion, fear of crime, gun control and ownership, voting
and religiosity. The independent variable is some variable that you think might help you explain why some
people think abortion should be legal and others think it shouldn’t be legal or any of the other variables in
our list above. In our case, that would be sex. Normally we put the dependent variable in the row and the
independent variable in the column. We’ll follow that convention in this exercise.
Let’s start with the first two variables in our list. We’re going to use a1_abany as our measure of opinion
about abortion. Respondents were asked if they thought abortion ought to be legal for any reason. And
we’re going to use c1_fear as our measure of fear of crime. Respondents were asked if they were afraid to
walk alone at night in their neighborhood. Run CROSSTABS to produce two tables. (See Chapter 5,
Crosstabs in the online SPSS book.) One will be for the relationship between d5_sex and a1_abany. The
other will be for d5_sex and c1_fear. Put the independent variable in the column and the dependent
variable in the row. If you don’t ask for percents, SPSS will give you only the counts (i.e., frequencies) so
be sure to ask for the percents. SPSS can compute the row percents, column percents, and total
percents. Your instructor will probably talk about how to compute these different percents. But how do you
know which percents to ask for? Here’s a simple rule for computing percents.
● If your independent variable is in the column, then you want to use the column percents.
● If your independent variable is in the row, then you want to use the row percents.
Since you put the independent variable in the column, you want the column percents.
Part II – Interpreting the Percents
Your first table should look like this.
It’s easy to make sure that you have the correct percents. Your independent variable (d5_sex) should be in
the column and it is. Column percents should sum down to 100% and they do.
How are you going to interpret these percents? Here’s a simple rule for interpreting percents.
● If your percents sum down to 100%, then compare the percents across.
● If your percents sum across to 100%, then compare the percents down.
Since the percents sum down to 100%, you want to compare across.
Look at the first row. Approximately 47% of men think abortion should be legal for any reason compared to
44% of women. There’s a difference of 3.6% which is really small. We never want to make too much of
small differences. Why not? No sample is ever a perfect representation of the population from which the
sample is drawn. This is because every sample contains some amount of sampling error. Sampling error
is inevitable. There is always some amount of sampling error present in every sample. The larger the
sample size, the less the sampling error and the smaller the sample size, the more the sampling error. So
in this case we would conclude that there probably isn’t any difference in the population between men and
women in their approval of abortion for any reason.
Now let’s look at your second table.
This time the percent difference is quite a bit larger. About 22% of men are afraid to walk alone at night in
their neighborhood compared to 39% of women. This is a difference of 16.8%. This is a much larger
difference and we have reason to think that women are more fearful of being a victim of crime than men.
Part III – Now it’s Your Turn
Choose two of the tables from the following list and compare men and women:
● satisfaction with current financial situation (f4_satfin),
● opinion about gun control (g1_gunlaw),
● voting (p5_pres08), and
● religiosity (r8_reliten).
Make sure that you put the independent variable in the column and the dependent variable in the row. Be
sure to ask for the correct percents. What are values of the percents that you want to compare? What is
the percent difference? Does it look to you that there is much of a difference between men and women in
the variables you chose?
Part IV – Adding another Variable into the Analysis
So far we have only looked at variables two at a time. Often we want to add other variables into the
analysis. Let’s focus on the difference between men and women (d5_sex) in terms of gun ownership
(g2_owngun). First let’s get the two-variable table which should look like this.
Men were more likely to own guns by 9.5%. But what if we wanted to include social class in this analysis?
The XXXXXXXXXXGSS asked respondents whether they thought of themselves as lower, working, middle, or upper
class. This is variable d11_class. What we want to do is to hold constant perceived social class. In other
words, we want to divide our sample into four groups with each group consisting of one of these four
classes and then look at the relationship between d5_sex and g2_owngun separately for each of these four
groups.
We can do this by going back to the SPSS dialog box where we requested the crosstabulation and putting
the variable d11_class in the third box down right below the “Column(s)” box. (See Chapter 8, Crosstabs
Revisited in the online SPSS book.) Your table should look like this.
This table is more complicated. Notice that the table is actually divided into four tables with one on top of
the other. At the top we have those who said they were lower class, then working, middle and upper class.
Let’s look at the percent differences for each of these tables – 12.0%, 9.6%, 9.4%, and 0.4%. The first
three tables are similar to the two-variable table – 9.5% compared to 12.0%, 9.6%, and 9.4%. Remember
not to make too much out of small differences because of sampling error. But the last table for upper class
has a much smaller difference – 0.4%. In other words, when we look at only those who see themselves as
upper class, there really isn’t any difference between men and women in terms of gun ownership.
But notice something else. There are fewer people who say they are lower and upper class than say they
are working or middle class. There are only XXXXXXXXXXrespondents in the lower class table and even fewer, 48
respondents, in the upper class table. We’ll have more to say about this in the next exercise (STAT10S).
Part V – Now it’s Your Turn Again
In Part II we compared men and women (d5_sex) in terms of fear of crime (c1_fear). Run this table again
but this time add social class (d11_class) into the analysis as we did in Part IV. What happens to the
percent difference when you hold constant class? What does this tell you?
STAT10S: Exercise Using SPSS to Explore Chi Square
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses CROSSTABS
in SPSS to explore the Chi Square test. A good reference on using SPSS is SPSS for Windows Version
23.0 A Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth Nelson. The
online version of the book is on the Social Science Research and Instructional Council's Website . You
have permission to use this exercise and to revise it to fit your needs. Please send a copy of any revision
to the author. Included with this exercise (as separate files) are more detailed notes to the instructors, the
SPSS syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output for the exercise
(SPSS output file). Please contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; .docx format)
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; .docx format)
Goals of Exercise
The goal of this exercise is to introduce Chi Square as a test of significance. The exercise also gives you
practice in using CROSSTABS in SPSS.
Part I—Relationships between Variables
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national probability
sample of adults in the United States conducted by the National Opinion Research Center (NORC). The
GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since. For this exercise we’re going
to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to access this data set which is called
gss14_subset_for_classes_STATISTICS.sav.
The XXXXXXXXXXGSS is a sample from the population of all adults in the United States at the time the survey was
done. In the previous exercise (STAT9S) we used crosstabulation and percents to describe the
https://web.archive.org/web/ XXXXXXXXXX/mailto: XXXXXXXXXX
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT10S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT10S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT10S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT10S.docx
relationship between pairs of variables in the sample. But we want to go beyond just describing the
sample. We want to use the sample data to make inferences about the population from which the sample
was selected. Chi Square is a statistical test of significance that we can use to test hypotheses about the
population. Chi Square is the appropriate test when your variables are nominal or ordinal (see STAT1S).
In STAT9S we started by using crosstabulation to look at the relationship between sex and opinion about
abortion. We’re going to use a1_abany as our measure of opinion about abortion. Respondents were
asked if they thought abortion ought to be legal for any reason. Run CROSSTABS to produce the table.
(See Chapter 5, Crosstabs in the online SPSS book mentioned on page XXXXXXXXXXYou want to get the
crosstabulation of d5_sex and a1_abany. Put the independent variable in the column and the dependent
variable in the row. Since your independent variable is in the column, you want to use the column
percents.
Part II – Interpreting the Percents
Your table should look like this.
Since your percents sum down to 100% (i.e., column percents), you want to compare the percents across.
Look at the first row. Approximately 47% of men think abortion should be legal for any reason compared to
44% of women. There’s a difference of 3.6% which seems small. We never want to make too much of
small differences. Why not? No sample is ever a perfect representation of the population from which the
sample is drawn. This is because every sample contains some amount of sampling error. Sampling error
is inevitable. There is always some amount of sampling error present in every sample. The larger the
sample size, the less the sampling error and the smaller the sample size, the more the sampling error.
But what is a small percent difference? Probably you would agree that a one to four percent difference is
small. But what about a five or six or seven percent difference? Is that small? Or is it large enough for us
to conclude that there is a difference between men and women in the population. Here’s where we can
use Chi Square.
Part III – Chi Square
Let’s assume that you think that sex and opinion about abortion are related to each other. We’ll call this our
research hypothesis. It’s what we expect to be true. But there is no way to prove the research hypothesis
directly. So we’re going to use a method of indirect proof. We’re going to set up another hypothesis that
says that the research hypothesis is not true and call this the null hypothesis. In our case, the null
hypothesis would be that the two variables are unrelated to each other. [1] In statistical terms, we often say
that the two variables are independent of each other. If we can reject the null hypothesis then we have
evidence to support the research hypothesis. If we can’t reject the null hypothesis then we don’t have any
evidence in support of the research hypothesis. You can see why this is called a method of indirect proof.
We can’t prove the research hypothesis directly but if we can reject the null hypothesis then we have
indirect evidence that supports the research hypothesis.
Here are our two hypotheses.
● research hypothesis – sex and opinion about abortion are related to each other
● null hypothesis – sex and opinion about abortion are unrelated to each other; in other words, they
are independent of each other
It's the null hypothesis that we are going to test.
SPSS will compute Chi Square for you. Follow the same procedure you used to get the crosstabulation
between d5_sex and a1_abany. Remember to get the column percents. Then click on the “Statistics”
button in the upper right of the dialog box. Check the box for “Chi-Square” and then click on “Continue”
and then on “OK.”
Now you will see another output box below the crosstabulation called “Chi-Square Tests.” We want the test
that is called “Pearson Chi-Square” in the first row of the box. Ignore all the other rows in this box. [2] You
should see three values to the right of “Pearson Chi-Square.”
● The value of Chi Square is XXXXXXXXXXYour instructor may or may not want to go into the computation
of the Chi Square value but we’re not going to cover the computation in this exercise.
● The degrees of freedom (df) is XXXXXXXXXXDegrees of freedom is number of values that are free to vary. In
a table with two columns and two rows only one of the cell frequencies is free to vary assuming the
marginal frequencies are fixed. The marginal frequencies are the values in the margins of the
table. There are XXXXXXXXXXmales and XXXXXXXXXXfemales in this table and there are XXXXXXXXXXthat think abortion
should be legal for any reason and XXXXXXXXXXwho think abortion should not be legal for any reason. Try
filling in any one of the cell frequencies in the table. The other three cell frequencies are then fixed
assuming we keep the marginal frequencies the same so there is one degree of freedom.
● The two-tailed significance value is XXXXXXXXXX. [3] This tells us that there is a probability of XXXXXXXXXXthat we
would be wrong if we rejected the null hypothesis. In other words, we would be wrong XXXXXXXXXXout of
1,000 times. With odds like that, of course, we’re not going to reject the null hypothesis. A
common rule is to reject the null hypothesis if the significance value is less than XXXXXXXXXXor less than
five out of one hundred. Since XXXXXXXXXXis not smaller than .05, we don’t reject the null hypothesis.
Since we can’t reject the null hypothesis, we don’t have any support for our research hypothesis.
Part IV – Now it’s Your Turn
Choose any two of the tables from the following list and compare men and women using crosstabulation
and Chi Square:
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftn1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftn2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftn3
● satisfaction with current financial situation (f4_satfin),
● opinion about gun control (g1_gunlaw),
● gun ownership (g2_owngun),
● voting (p6_pres12), and
● religiosity (r8_reliten).
Make sure that you put the independent variable in the column and the dependent variable in the row. Be
sure to ask for the correct percents and Chi Square. What are the research hypothesis and the null
hypothesis? Do you reject the null hypothesis? How do you know? What does that tell you about the
research hypothesis?
Part V – Expected Values
We said we weren’t going to talk about how you compute Chi Square but we do have to introduce the idea
of expected values. The computation of Chi Square is based on comparing the observed cell frequencies
(i.e., the cell frequencies that you see in the table that SPSS gives you) and the cell frequencies that you
would expect by chance assuming the null hypothesis was true. SPSS will also compute these expected
frequencies for you. Rerun the crosstabulation for d5_sex and a1_abany remembering to ask for the
column percents and Chi Square. But this time when you click on the “Cells” button to ask for the column
percents look in the upper left of the dialog box where it says “Counts.” “Observed” is selected as the
default. These are the observed cell frequencies. Click on the “Expected” box to get the expected cell
frequencies.
Now you will see both the observed and the expected cell frequencies in your output table. Notice that they
aren’t very different. The closer they are to each other, the smaller Chi Square will be. The more different
they are, the larger Chi Square will be. The larger Chi Square is, the more likely you are to be able to reject
the null hypothesis.
Chi Square assumes that all the expected cell frequencies are greater than five. We can see from the table
that this is the case for this table. But we don’t have to get the expected frequencies to see this. Look
back at the “Chi-Square Tests” table in your output. Look at footnote a. It tells you that the smallest
expected cell frequency is XXXXXXXXXXSo clearly all four expected cell frequencies are at least five. If it’s just
a little bit below five, that’s no problem. But if it gets down around three you have a problem. What you’ll
have to do is to combine rows or columns that have small marginal frequencies.
For example, run the crosstabulation of d5_sex and d9_sibs which is the number of brothers and sisters
that the respondent has. [4] The minimum expected frequency is so small that it rounds to XXXXXXXXXXThat’s
because there are only a few respondents with more than XXXXXXXXXXsiblings. You will need to recode the number
of siblings into fewer categories making sure that you don’t have any categories with a really small number
of cases.
Part VI – Now it’s Your Turn Again
Look back at the two tables you ran in Part III and see if any of your expected frequencies were less than
five. What does that tell you?
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftn4

[1] The null hypothesis is often called the hypothesis of no difference. We’re saying that there is no
relationship between these two variables. In other words, there’s nothing there.
[2] Unfortunately there is no way to tell SPSS to just give us the “Pearson Chi-Square.”
[3] What do we mean by two-tailed? We’re not predicting the direction of the relationship. We’re not
predicting that men are more likely to think abortion should be legal or that women are more likely. So it’s a
two-tailed test.
[4] Number of siblings is a ratio level variable. You can use Chi Square with ratio level variables but usually
there are better tests. We’re just using this as an example.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftnref1
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftnref2
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftnref3
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/507#_ftnref4
STAT13S: Exercise Using SPSS to Explore Correlation
Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA XXXXXXXXXX
Email: XXXXXXXXXX
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav
which is a subset of the XXXXXXXXXXGeneral Social Survey. Some of the variables in the GSS have been recoded
to make them easier to use and some new variables have been created. The data have been weighted
according to the instructions from the National Opinion Research Center. This exercise uses CORRELATE
and COMPARE MEANS in SPSS to explore correlation. A good reference on using SPSS is SPSS for
Windows Version XXXXXXXXXXA Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth
Nelson. The online version of the book is on the Social Science Research and Instructional Council's
Website . You have permission to use this exercise and to revise it to fit your needs. Please send a copy of
any revision to the author. Included with this exercise (as separate files) are more detailed notes to the
instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output
for the exercise (SPSS output file). Please contact the author for additional information.
I’m attaching the following files.
● Data subset (.sav format)
● Extended notes for instructors (MS Word; .docx format)
● Syntax file (.sps format)
● Output file (.spv format)
● This page (MS Word; .docx format)
Goals of Exercise
The goal of this exercise is to introduce measures of correlation. The exercise also gives you practice
using CORRELATE and COMPARE MEANS in SPSS.
Part I – Scatterplots
We’re going to use the General Social Survey (GSS) for this exercise. The GSS is a national probability
sample of adults in the United States conducted by the National Opinion Research Center (NORC). The
GSS started in XXXXXXXXXXand has been an annual or biannual survey ever since. For this exercise we’re going
to use a subset of the XXXXXXXXXXGSS. Your instructor will tell you how to access this data set which is called
gss14_subset_for_classes_STATISTICS.sav.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/582
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/gss14_subset_for_classes_STATISTICS.sav
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/Extended_Notes_for_Instructors_for_STAT13S.docx
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Syntax_for_STAT13S.sps
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/SPSS_Output_for_STAT13S.spv
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/files/STAT13S-3.docx
In a previous exercise (STAT11S) we considered different measures of association that can be used to
determine the strength of the relationship between two variables that have nominal or ordinal level
measurement (see STAT1S). In this exercise we’re going to look at two different measures that are
appropriate for interval and ratio level variables. The terminology also changes in the sense that we’ll refer
to these measures as correlations rather than measures of association.
Before we look at these measures let’s talk about a type of graph that is used to display the relationship
between two variables called a scatterplot. SPSS refers to it as a Scatter/Dot chart. Click on GRAPH in the
menu bar at the top of the SPSS screen. Click on “Chart Builder” in the dropdown menu. A dialog box will
open up that will ask you to define the level of measurement for each variable and to provide labels for the
values. Click on “OK” since that has been done for you. In the bottom half of the dialog box the “Gallery”
tab should be selected by default. On the left you can choose the type of graph you want to build. Look
down the list and click on “Scatter/Dot.” There are eight different scatterplots that SPSS can create. If you
point your mouse at each of them you will see a label for the scatterplots. The one on the upper left is
called a “Simple Scatter.” Click and drag the icon up to the large box in the upper right of the dialog box.
Now all you have to do is to click and drag the variables you want to the X-Axis and Y-Axis. If you want to
treat one of these variables as independent, then put that variable on the X-Axis and the dependent
variable on the Y-Axis. So all our scatterplots will look the same let’s put d22_maeduc on the X-Axis and
d24_paeduc on the Y-Axis. Click “OK” and SPSS will display your graph.
Now let’s look for the general pattern to our scatterplot. You see more cases in the upper right and lower
left of the plot and fewer cases in the upper left and lower right. In general, as one of the variables
increases, the other variable tends to increase as well. Moreover, you can imagine drawing a straight line
that represents this relationship. The line would start in the lower left and continue towards the upper right
of the plot. That’s what we call a positive linear relationship. [1] But how strong is the relationship and where
exactly would you draw the straight line? The Pearson Correlation Coefficient will tell us the strength of the
linear relationship and linear regression will show us the straight line that best fits the data points. We’ll talk
about the Pearson Correlation Coefficient in part 3 of this exercise and linear regression in exercise
STAT14S.
Part II – Now it’s Your Turn
Use GRAPH in SPSS to create the scatter plot for the years of school completed by the respondent
(d4_educ) and the spouse’s years of school completed (d29_speduc). So all our plots will look the same,
put d29_speduc on the X-Axis and d4_educ on the Y-Axis. Look at your scatterplot and decide if the
scatterplot has a pattern to it. What is that pattern? Do you think it is a linear relationship? Is it a positive
linear or a negative linear relationship?
Part III - Pearson Correlation Coefficient
The Pearson Correlation Coefficient (r) is a numerical value that tells us how strongly related two variables
are. It varies between XXXXXXXXXXand XXXXXXXXXXThe sign indicates the direction of the relationship. A positive value means
that as one variable increases, the other variable also increases while a negative value means that as one
variable increases, the other variable decreases. The closer the value is to 1, the stronger the linear
relationship and the closer it is to 0, the weaker the linear relationship.
The usual way to interpret the Pearson Coefficient is to square its value. In other words, if r equals .5, then
we square XXXXXXXXXXwhich gives us XXXXXXXXXXThis is often called the Coefficient of Determination. This means that one
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/510#_ftn1
of the variables explains 25% of the variation of the other variable. Since the Pearson Correlation is a
symmetric measure in the sense that neither variable is designated as independent or dependent we could
say that 25% of the variation in the first variable is explained by the second variable or reverse this and say
that 25% of the variation in the second variable is explained by the first variable. It’s important not to read
causality into this statement. We’re not saying that one variable causes the other variable. We’re just
saying that 25% of the variation in one of the variables can be accounted for by the other variable.
The Pearson Correlation Coefficient assumes that the relationship between the two variables is linear. This
means that the relationship can be represented by a straight line. In geometric terms, this means that the
slope of the line is the same for every point on that line. Here are some examples of a positive and a
negative linear relationship and an example of the lack of any relationship.
Pearson r would be positive and close to 1 in the left-hand example, negative and close to XXXXXXXXXXin the middle
example, and closer to 0 in the right-hand example. You can search for “free images of a positive linear
relationship” to see more examples of linear relationships.
But what if the relationship is not linear? Search for “free images of a curvilinear relationship” and you’ll see
examples that look like this.
Here the relationship can’t be represented by a straight line. We would need a line with a bend in it to
capture this relationship. While there clearly is a relationship between these two variables, Pearson r would
be closer to XXXXXXXXXXPearson r does not measure the strength of a curvilinear relationship; it only measures the
strength of linear relationships.
Another way to think of correlation is to say that the Pearson Correlation Coefficient measures the fit of the
line to the data points. If r was equal to +1, then all the data points would fit on the line that has a positive
slope (i.e., starts in the lower left and ends in the upper right). If r was equal to -1, then all the data points
would fit on the line that has a negative slope (i.e., starts in the upper left and ends in the lower right). (See
the diagram above.)
Let’s get the Pearson Coefficient for the two variables in our scatterplot in Part XXXXXXXXXXSee Chapter 7,
Correlation in the online SPSS book mentioned on page XXXXXXXXXXClick on Analyze in the menu bar and then click
on CORRELATE. In the dropdown box, click on “Bivariate.” Bivariate just means that you want to compute
a correlation for two variables – d22_maeduc and d24_paeduc. Move these two variables into the
“Variable(s)” box. Make sure that the box for the Pearson Correlation Coefficient is checked which it should
be since this is the default. Notice that the circle for “Two-tailed” is filled in for “Test of Significance.” A
two-tailed significance test is used when you don’t make any prediction as to whether the relationship is
positive or negative. In our case, we would expect that the relationship would be greater than zero (i.e.,
positive) so we would want to use a one-tailed test. Click on the circle for one-tailed to change the
selection. Notice also that “Flag significant correlations” is checked. That means that SPSS will tell you
when a relationship is statistically significant. Now click “OK” and SPSS will display your correlation
coefficient.
You should see four correlations. The correlations in the upper left and lower right will be 1 since the
correlation of any variable with itself will always be XXXXXXXXXXThe correlation in the upper right and lower left will
both be XXXXXXXXXXThat’s because the correlation of variable X with variable Y is the same as the correlation of
variable Y with variable X. Pearson r is a symmetric measure (see STAT11S) meaning that we don’t
designate one of the variables as the dependent variable and the other as the independent variable. Notice
that the Pearson r is statistically significant using a one-tailed test at the XXXXXXXXXXlevel of significance. A Pearson
r of XXXXXXXXXXis really pretty large. You don’t see r’s that big very often. That’s telling us that the linear
regression line that we’re going to talk about in STAT14S fits the data points reasonably well.
Part IV – Now it’s Your Turn Again
Use CORRELATE in SPSS to get the Pearson Correlation Coefficient for the years of school completed by
the respondent (d4_educ) and the spouse’s years of school completed (d29_speduc). What does this
Pearson Correlation Coefficient tell you about the relationship between these two variables?
Part V – Correlation Matrices
What if you wanted to see the values of r for a set of variables? Let’s think of the four variables in Parts 1
through 4 as a set. That means that we want to see the values for r for each pair of variables. This time
move all four of the variables into the “Variable(s)” box (i.e., d4_educ, d22_maeduc, d24_paeduc, and
d29_speduc) and click on “OK.” That would mean we would calculate six coefficients. (Make sure you can
list all six.)
What did we learn from these correlations? First, the correlation of any variable with itself is XXXXXXXXXXSecond, the
correlations above the 1’s are the same as the correlations below the 1’s. They’re just the mirror image of
each other. That’s because r is a symmetric measure. Third, all the correlations are fairly large. Fourth, the
largest correlations are between father’s and mother’s education and between the respondent’s education
and the spouse’s education.
Part VI – The Correlation Ratio or Eta-Squared
The Pearson Correlation Coefficient assumes that both variables are interval or ratio variables (see
STAT1S). But what if one of the variables was nominal or ordinal and the other variable was interval or
ratio? This leads us back to one-way analysis of variance which we discussed in exercise STAT8S. Click
on “Analyze” in the menu bar and then on “Compare Means” and finally on “Means.” (See Chapter 6,
one-way analysis of variance in the online SPSS book mentioned on page XXXXXXXXXXSelect the variable
tv1_tvhours and move it to the “Dependent List” box. This is the variable for which you are going to
compute means. Then select the variable d3_degree and move it to the “Independent List” box. Notice that
we’re using our independent variable to predict our dependent variable. Now click on “Options” in the
upper-right corner and then check the “Anova table and eta” box. Finally click on “Continue” and then on
“OK.”
The F test in the one-way analysis of variance tells us to reject the null hypothesis that all the population
means are equal. So we know that at least one pair of population means are not equal. But that doesn’t tell
us how strongly related these two variables are. The SPSS output tells us that eta is equal to XXXXXXXXXXand
eta-squared is equal to XXXXXXXXXXThis tells us that 5.1% of the variation in the dependent variable, number of
hours the respondent watches television, can be explained or accounted for by the independent variable,
highest education degree. This doesn’t seem like much but it’s not an atypical outcome for many research
findings.
Part VII – Your Turn
In Exercise STAT8S you computed the mean number of hours that respondents watched television
(tv1_tvhours) for each of the nine regions of the country (d25_region). Then you determined if these
differences were statistically significant by carrying out a one-way analysis of variance. Repeat the
one-way analysis of variance but this time focus on eta-squared. What percent of the variation in television
viewing can be explained by the region of the country in which the respondent lived?
[1] This assumes that the variables are coded low to high (or high to low) on both the X-Axis and the Y-Axis.
https://web.archive.org/web/ XXXXXXXXXX/http://ssric.org/node/510#_ftnref1
Answered 1 days AfterMay 10, 2021

Solution

Saravana Priyan answered on May 12 2021
28 Votes

Submit New Assignment

Copy and Paste Your Assignment Here