Assignment description Deadline: Now ~10/25/2021 9:00PM (You have almost four days to finish) Please do it on your own job, which is very important to me. Thank you! Any questions for assignment feels...

1 answer below »
Please read the assignment description. Programming and writing skills are strongly required.


Assignment description Deadline: Now ~10/25/2021 9:00PM (You have almost four days to finish) Please do it on your own job, which is very important to me. Thank you! Any questions for assignment feels free to contact me via [email protected] Please explain and write down the solution clearly and step by step!!! Note: You can either choose Python or SAS for this assignment depending on your proficiency. (But I myself know more about Python so I wish Python could be used if possible). If you choose SAS, please use SAS studio to do. (I only can access SAS Ondemend Academic studio) Your first project must address the following sections. 1. Import the data using an appropriate Python/SAS function  (explain how to import)  Extract the data using the Incidence rate and Mortality rate variables based on following cancer diseases ( use the age adjusted Incidence rate and Mortality rate) Female Breast Colon and Rectum kidney Leukemia Liver and Intrahepatic Bile Duct Lung and Bronchus Hodgkin Lymphoma Melanomas of the Skin Ovary Prostate You can download the United States Cancer data from the following web site ​Download USCS Data Tables | CDC 2. Filter the data using the 51 states including DC (Beside deliver the solution, Also need to explain how to filter). 3. Explain how the data set is imported and cleaned. Please also explain which folder you choose and where the data set obtain. 4. Rank all the states based on the above cancer diseases using  appropriate methods   5.     List the steps required to conduct your analysis   6.      Write the summary of  your report ( between 4 and 5 pages)   7.     Include some of your outputs with your summary   8.     Cover Sheet: Name / Date / Name of the project
Answered 5 days AfterOct 22, 2021

Answer To: Assignment description Deadline: Now ~10/25/2021 9:00PM (You have almost four days to finish) Please...

Karthi answered on Oct 28 2021
111 Votes
A Data Analysis on United States Cancer Statistics (USCS)
A Data Analysis on United States Cancer Statistics
(USCS)
Data
-Data Source: https://www.cdc.gov/cancer/uscs/dataviz/download_data.htm
-Description:
1. Data is collected between 1999-2015
2. Contains values from 50 States and different counties
3. 24 Million Cancer Cases
4. Contains different Variables such as Age, Sex, Race, etc.
5. Per Different Cancer sites or All Sites Combined
6. Reported from hospitals, physicians and labs across U.S. to central cancer registries supported by CDC
and NCI
Terms
1. Incidence:
“Total number of new cancer cases diagnosed in a specific year in the population category of interest, divided
by the at-risk population for that category and multiplied by 100,000 (cancers by primary site)”
2. Mortality:
“Total number of cancer deaths during a specific year in the population category of interest, divided by the
at-risk population for that category and multiplied by 100,000”
3. Age Adjusted Rate:
The number of cases (or deaths) per 100,000 people and are age-adjusted to the 2000 U.S. standard po
pulation
(19 age groups – Census P25–1130)
-Ensures that differences in incidence or deaths from one year to another, or between one geographic area
and another, are not due to differences in the age distribution of the populations being compared
Importing the data
library(tidyverse)
library(sf)
library(maps)
library(tmap)
1
https://www.cdc.gov/cancer/uscs/dataviz/download_data.htm
byarea <- read_delim("../FP/BYAREA.txt", delim = "|")
byareaCounty <- read_delim("../FP/BYAREA_COUNTY.txt", delim = "|")
bysite <- read_delim("../FP/BYSITE.txt", delim = "|")
byage <- read_delim("../FP/BYAGE.txt", delim = "|")
## Warning: 17214 parsing failures.
## row col expected actual file
## 220448 YEAR no trailing characters -2015 '../FP/BYAGE.txt'
## 220449 YEAR no trailing characters -2015 '../FP/BYAGE.txt'
## 220450 YEAR no trailing characters -2015 '../FP/BYAGE.txt'
## 220451 YEAR no trailing characters -2015 '../FP/BYAGE.txt'
## 220452 YEAR no trailing characters -2015 '../FP/BYAGE.txt'
## ...... .... ...................... ...... .................
## See problems(...) for more details.
#importing this dataset gives parsing error on "2011-2015" YEAR set,
#yet this is fine since after the import all observations remain the same.
read_csv("../FP/rural.csv")%>%
select(1, 2, 3, 8)%>%
slice(1:3142)-> #taking out the last rows
rural
## Warning: Missing column names filled in: 'X9' [9], 'X10' [10], 'X11' [11],
## 'X12' [12], 'X13' [13], 'X14' [14]
names(rural)[4]<-"percent"
#Defining NA values
byarea <- na_if(byarea, '~')
byarea <- na_if(byarea, '-')
byareaCounty <- na_if(byareaCounty, '~')
byareaCounty <- na_if(byareaCounty, '.')
byareaCounty <- na_if(byareaCounty, '-')
bysite <- na_if(bysite, '~')
bysite <- na_if(bysite, '.')
byage <- na_if(byage, '~')
byage <- na_if(byage, '.')
#parsing some numerical values that are in our data exploration
byage%>%
mutate(COUNT = parse_number(COUNT))%>%
mutate(RATE = parse_number(RATE)) ->
byage
bysite%>%
mutate(AGE_ADJUSTED_RATE = parse_number(AGE_ADJUSTED_RATE)) ->
bysite
byareaCounty%>%
mutate(AGE_ADJUSTED_RATE = parse_number(AGE_ADJUSTED_RATE)) ->
byareaCounty
2
Hypotheses
Our initial hypotheses were as below:
1. Between 2011-2015 the rate of cancer in rural areas should be lower than urban areas.
2. The mortality rate of cancer in elderlies are higher than the other age groups.
3. There’s an association between the death rate of skin cancer and different ethnicities.
4. Males are more prone to new cancers than females.
5. Rate of new cancers during the 1999-2015 should increase.
3
HYPOTHESIS 1
Between 2011-2015 the rate of cancer in rural areas should be lower than urban areas.
Related Article: 1. Zahnd, W. E., James, A. S., Jenkins, W. D., Izadi, S. R., Fogleman, A. J., Steward, D.
E., . Brard, L. (2018). Rural-Urban differences in cancer incidence and trends in the United States. Cancer
Epidemiology Biomarkers and Prevention. http://doi.org/10.1158/1055-9965.EPI-17-0430
Summary: The article describes that although the combined incidence rates were higher in urban areas, their
decline was also greater than the rural populations. Most of the discrepancy were related to tobacco-associated,
HPV-associated, lung and bronchus, cervical, and colorectal cancers across the population groups.
By using the byareacounty dataframe which has different area codes for each county, we import another data
set from the 2015 GEOID U.S. Census that has the percentage of rural and urban for each county. We join
these two dataframes together by the areacode and run a trend analysis using geom_smooth. Although the
geom_smooth is highly variable by the variation in the data points, it still shows a trend over the 0-100
Urban to Rural Counties.
byareaCounty %>%
mutate(areacode = str_extract(AREA, pattern = "\\d+")) %>%
full_join(rural, by = c("areacode" = "2015 GEOID")) ->
byareaRural
byareaRural%>%
select(STATE, AREA, areacode, percent, AGE_ADJUSTED_RATE, SITE, SEX, RACE, EVENT_TYPE,
YEAR)%>%
filter(!is.na(AGE_ADJUSTED_RATE), !is.na(percent) ,SITE == "All Cancer Sites Combined",
SEX != "Male and Female") %>%
ggplot(aes(x = percent, y = AGE_ADJUSTED_RATE)) +
geom_hex() +
facet_grid(EVENT_TYPE ~ SEX) +
scale_y_log10() +
geom_smooth(method = lm, se = FALSE)+
theme_bw()+
labs(title = "U.S. Mortality and Incidence Rate of Cancer",
y = "Log Scaled - Age Adjusted Rate per 100,000 People",
x = "Urban < 50 - 50>Rural",
subtitle = "All Cancer Sites Combined - Males vs. Females - 2011-2015 - Urban to Rural")+
theme(legend.position = 'bottom')
4
http://doi.org/10.1158/1055-9965.EPI-17-0430
Female Male
Incidence
M
ortality
0 25 50 75 100 0 25 50 75 100
100
300
1000
3000
100
300
1000
3000
Urban < 50 − 50>Rural
Lo
g
S
ca
le
d

A
ge
A
dj
us
te
d
R
at
e
pe
r
10
0,
00
0
P
eo
pl
e
100200300400500
count
All Cancer Sites Combined − Males vs. Females − 2011−2015 − Urban to Rural
U.S. Mortality and Incidence Rate of Cancer
There seems to be a slight increase as the percentage goes higher.
5
We further analyze this by measuring the rate in the lung and bronchus cancer and also the colon and rectum
cancer. We choose mainly these two cancer sites since we’re further hypothesizing that rural populations
might have high percentage of smokers and thus higher rate of Lung and Bronchus cancer.
byareaRural%>%
filter(!is.na(AGE_ADJUSTED_RATE),!is.na(percent) , SITE == "Lung and Bronchus",
SEX != "Male and Female") %>%
ggplot(aes(x = percent, y = AGE_ADJUSTED_RATE)) +
geom_hex() +
facet_grid(EVENT_TYPE ~ SEX) +
scale_y_log10() +
geom_smooth(method = lm, se = FALSE)+
theme_bw()+
labs(title = "U.S. Mortality and Incidence Rate of Cancer",
y = "Log Scaled - Age Adjusted Rate per 100,000 people",
x = "Urban < 50 - 50>Rural",
subtitle = "Lung and Bronchus Cancer - Males vs. Females - 2011-2015 - Urban to Rural")+
theme(legend.position = 'bottom')
Female Male
Incidence
M
ortality
0 25 50 75 100 0 25 50 75 100
10
30
100
300
10
30
100
300
Urban < 50 − 50>Rural
Lo
g
S
ca
le
d

A
ge
A
dj
us
te
d
R
at
e
pe
r
10
0,
00
0
pe
op
le
50 100
count
Lung and Bronchus Cancer − Males vs. Females − 2011−2015 − Urban to Rural
U.S. Mortality and Incidence Rate of Cancer
As seen by this plot, there’s an upward trend as the rural percentage goes higher.
6
byareaRural%>%
filter(SITE == "Colon and Rectum", !is.na(AGE_ADJUSTED_RATE),!is.na(percent),
SEX != "Male and Female") %>%
ggplot(aes(x = percent, y = AGE_ADJUSTED_RATE)) +
geom_hex() +
facet_grid(EVENT_TYPE ~ SEX) +
scale_y_log10() +
geom_smooth(method = lm, se = FALSE)+
theme_bw()+
labs(title = "U.S. Cancer Mortality and Incidence Rate of Cancer",
y = "Log Scaled - Age Adjusted Rate per 100,000 People",
x = "Urban < 50 - 50>Rural",
subtitle = "Colon and Rectum - Males vs. Females - 2011-2015 - Urban to Rural")+
theme(legend.position = 'bottom')
Female Male
Incidence
M
ortality
0 25 50 75 100 0 25 50 75 100
10
100
1000
10
100
1000
Urban < 50 − 50>Rural
Lo
g
S
ca
le
d

A
ge
A
dj
us
te
d
R
at
e
pe
r
10
0,
00
0
P
eo
pl
e
50 100
count
Colon and Rectum − Males vs. Females − 2011−2015 − Urban to Rural
U.S. Cancer Mortality and Incidence Rate of Cancer
We can also interpret that for the colon and rectum cancer there’s an slightly upward trend as the rural
percentage goes higher, yet slope is less than the lung and bronchus.
7
HYPOTHESIS 2
The mortality rate of cancer in elderlies are higher than the other age groups.
Related Article: 2. White, M. C., Holman, D. M., Boehm, J. E., Peipins, L. A., Grossman, M., &
Jane Henley, S. (2014). Age and cancer risk: A potentially modifiable relationship. American Journal of
Preventive Medicine. http://doi.org/10.1016/j.amepre.2013.10.029
Summary: After midlife the frequency of several cancer risk factors and the incidence rate begin to increase.
Using the byage dataframe, we create two groups of elderlies and non-elderlies and by filtering out for
mortality in all cancer sites combined, we visualize a boxplot showing the difference in the rate.
byage %>%#Taking out the 2011-2015 Grouping which is duplicate and not useful
filter(is.na(YEAR)==FALSE)->
byage
#Creating a group for the elderlies
byage %>%
filter(AGE == "65-69" | AGE == "70-74" | AGE == "75-79" | AGE == "80-84" |
AGE == "85+") %>%
mutate(Group = "Elderlies")->
byage1
#Creating a group for the rest or non-elderly
byage%>%
filter(AGE == "<1" |AGE== "1-4" |AGE== "5-9" |AGE== "10-14" |AGE== "15-19" |
AGE== "20-24" |AGE== "25-29" |AGE== "30-34" |AGE== "35-39" |
AGE== "40-44" |AGE== "45-49" |AGE== "50-54" |AGE== "55-59" |AGE== "60-64") %>%
mutate(Group = "Non-elderly")->
byage2
rbind(byage1, byage2) -> byage3
byage3%>%
filter(EVENT_TYPE == "Mortality", SEX == "Male and Female", RACE == "All Races",
!is.na(RATE), SITE == "All Cancer Sites Combined")%>%
ggplot(mapping = aes(Group, RATE, fill = Group))+
geom_boxplot()+
theme_bw()+
labs(title = "U.S. Cancer Mortality Rate of Different Age Groups",
y = "Mortality Rate",
x = "Age Group",
subtitle = "All Cancer Sites - Elderlies vs. Non-elderlies - 1999-2015")
8
http://doi.org/10.1016/j.amepre.2013.10.029
0
500
1000
1500
Elderlies Non−elderly
Age Group
M
or
ta
lit
y
R
at
e
Group
Elderlies
Non−elderly
All Cancer Sites − Elderlies vs. Non−elderlies − 1999−2015
U.S. Cancer Mortality Rate of Different Age Groups
After getting the boxplot which clearly shows the difference between the mortality of elderlies and
the rest, we try to measure the Death per Incidence rate of each group. We define this value as
Mortality/Incidence*100. Although this value is not really usable in terms of the same deaths being
related to the same incidences, it can still help us in determining a better proportion of death per each year.
#Spreading the count value for incidence and mortality
#in order to...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here