# ST 502 R project 3 For this project you will work in groups of 2. The project involves involves analyzing a data set using the chi-square test for homogeneity, deriving this LRT, and conducting a...

ST 502 R project 3
For this project you will work in groups of 2. The project involves involves analyzing a data set using the
chi-square test for homogeneity, deriving this LRT, and conducting a Monte Carolo simulation to determine
properties of a similar test. You and your partner will create a final report to turn in.
Please make sure that your R file follows the guidelines on moodle. If these guidelines are not met, you will
lose credit. You can submit a .R or a .Rmd file for the code portion You should submit an HTML or PDF
file for the report portion.
Data Example
Consider three different hospitals. Each hospital has patients that end up with infections. Suppose we have
the following data:
Hospital Surgical Site Infections Pneumonia Infections Bloodstream Infections Total
A XXXXXXXXXX
B XXXXXXXXXX
C XXXXXXXXXX
Here we might consider that for a given hospital, we have a random sample with a fixed number of trials.
This implies that we have three separate multinomial distributions, one for each hospital. We might have an
interest in whether or not the multinomials are homogenous across the hospitals.
Use R to conduct a chi-square test for homogeneity using this data. You’ll need to manually create the table
(probably easiest to just use the matrix() function - leave off the total column). I’d like you to manually
calculate the LRT statistic, the Pearson Chi-square statistic, the critical value, and find approximate p-values
for the hypotheses using both test statistics (they will be very small).
Derivation
Ok, we’ve used the likelihood ratio test. Let’s derive it! Consider the generic case of comparing J independent
multinomial distributions, each with I categories.
Multinomial Cat 1 Cat 2 ... Cat I Sample size
1 π11 π21 ... πI1 n1
2 π12 π22 ... πI2 n2
... ... ... ... ... ...
J π1J π2J ... πIJ nJ
We want to test if
H0 : π11 = π12 = ... = π1J , π21 = π22 = ... = π2J , and , πI1 = πI2 = ... = πIJ
vs
HA : At least some probabilities differ
The likelihood here (in the general case) is just the product of the J multinomials.
Under the null hypothesis, π11 = π12 = ... = π1J so we can just replace this with a common π1. Similarly, we
can just consider having π1, ..., πI .
Derive the likelihood ratio test for the homogeneity test. Remember it should come out to be
LRT = 2
J∑
j=1
I∑
i=1
Obsij ln
(
Obsij
Expij
)
with approximate large sample distribution given by a χ2(I−1)(J−1) and expected cell counts given by n•jni•/n
where n is the total sample size.
1
Note: We did some of the details (like looking at the null max in the notes). You need to derive the test form
first and then reproduce other relevant parts.
Simulation
The Pearson statistic can be derived as a Taylor series approximation to the LRT. For the last part of the
project, we’ll investigate the α control of the Pearson chi-square test and its power (so we don’t have to worry
about the ln(0) that can sometimes pop up for the LRT).
Goal of simulation study:
• Determine how well the asymptotic rejection region performs at controlling α
• Determine the power of the asymptotic test when comparing certain alternative situations
Setup:
• Two multinomial case only, where each multinomial has three categorie
• All combinations of four sample sizes for each multinomial (16 total cases)
n1 = 20, 30, 50, 100
n2 = 20, 30, 50, 100
• Three different probabilities that may generate a particular multinomial:
p1 = 1/3, 1/3, 1/3 (equal)
p2 = 1/10, 3/10, 6/10 (mixed 1)
p3 = 1/10, 1/10, 8/10 (mixed 2)
• Use 50000 randomly created tables (but start with a much smaller number until you get your code
working)
• Add 0.5 to any expected counts that end up being 0 so as to avoid the divide by 0 case
To determine α control, you should generate data where both multinomials come from the same p vector.
This should be done for each of the sample size combinations (16 total situations where both multinomials
are generated from p1 for instance).
In total, you’ll have 48 simulated α values. You should create a plot similar to the one below to summarize.
To inspect the power, we’ll use the same sample sizes and probabilities, but we’ll vary the probabilities used
rather than using the same one for each multinomial.
• Compare Equal vs Mixed 1
2
• Compare Equal vs Mixed 2
• Compare Mixed 1 vs Mixed 2
You should summarize your results into something similar to that below.
Some coding hints:
• rmultinom(1, size, prob) can be used to generate one multinomial sample
• Two calls of this (using appropriate size and prob) would create a single ‘table’ to be analyze
• You can combine the two samples using cbind() and then transpose it using t() to get it in a similar
form to how you analyzed the hospital data
• Calculate the Pearson chi-square value and compare it to the appropriate theoretical cut-off (returns a
TRUE/FALSE)
• If you wrap all of the above into a replicate(N, { code to do above }), you can then just take
the mean of the result to find the approximated alpha or power value
• You can then copy and paste this a bunch of times or wrap that process in a function that allows you
to change n1, n2, prob1, prob2 (corresponding to the two multinomials)
Report
You should then write up all of the above into a coherent report with the following pieces:
• Introduce the general idea of testing for homogeneity in an introduction
• Analyze the dataset given and briefly discuss what the results are
• Derive the homogenity test (you should use math type, latex, markdown, etc. to typeset your math
symbols)
• Describe the simulation you will do. Your report should include the R code in the text or in an appendix.
The plots should be within the text with a brief discussion of the results.
That’s it! You’ve then finished ST 502 - woot
3
ST 502 R project 3
Data Example
Derivation
Simulation
Report
May 06, 2021

### Submit New Assignment

Copy and Paste Your Assignment Here