# . Assignment 5 Use Python to analyze the data sets associated with the two cases below. When you are done, upload this completed worksheet as a Word document (copy and paste your output and comments...

1 answer below »

.
Assignment 5
Use Python to analyze the data sets associated with the two cases below. When you are done, upload this completed worksheet as a Word document (copy and paste your output and comments where indicated). Also save and upload the .py files that contain your Python code. Your code should contain comments about what it does at each step. You do not have to include your Python code on this worksheet.
Refer to the Python examples that were covered in class exercises for guidance. Some of the code can be copied and used for this assignment, but be careful to make changes where needed. Also refer to the slide sets for additional information that can help you answer the questions.
Scoring for this assignment will be based on co
ectly constructing the needed Python code and providing the co
ect results, along with the quality of any comments and written responses. Be sure to follow the instructions exactly and provide all information requested to receive full credit.
Case 1: Which students are choosing to study STEM fields in college? There has been continued interest in understanding what type of students (based on demographic, aptitude, and other measures) plan to pursue undergraduate studies in fields related to science, technology, engineering, and mathematics (STEM). The CSV file “SurveyData” contains selected data collected from 240 college-bound students just after they graduated from high school in the United States. The data set is comprised of the following six variables for each student:
· STEM: Does the student intend to pursue a STEM field of study (0 = No, 1 = Yes)
· GPA: student’s high school grade point average (GPA)
· SAT: student’s SAT scores
· White: student is of European descent (Yes or No)
· Female: student is female (Yes or No)
· Asian: student is of Asian descent (Yes or No)
Some respondents are classified as neither White nor Asian (none are classified as both). Respondents that are not classified as female are male.
1A.    Create a logistic regression model using STEM as the response (y) variable and GPA and SAT as the predictor (x) variables. The model will predict whether a student is expected to pursue a STEM field or not. Divide the data into training and testing data, with the testing data size equal to 20% of the all data (use random_state=101). Create the model using the training data, then generate a set of predictions using the test data. Provide the intercept (b0) and coefficients (b1 and b2) for the model below.
1B.    Generate a classification report for the model. Describe how well the model seems to work based on the values (percentages) for the different metrics. Provide the classification report output and your interpretation below.
1C.    Now create a logistic regression model using STEM as the response (y) variable and all of the other five variables as the predictor (x) variables. (Note: you may want to create a separate Python file to run Parts C and D.) The model will again predict whether a student is expected to pursue a STEM field or not, but the decision is based on more variables. You will need to create dummy variables for the last three variables. Divide the data into training and testing data, with the testing data size equal to 20% of the all data (use random_state=101). Create the model using the training data, then generate a set of predictions using the test data. Provide the intercept (b0) and coefficients (b1 through b5) for the model below. Based on the coefficients, what variables seem to affect the decision to pursue a STEM field the most?
1D.    Generate a classification report for the model. Describe how well the model seems to work based on the values (percentages) for the different metrics. Provide the classification report output and your interpretation below. How does this model compare with the first model?
Case 2: Information Technology (IT) use among countries. The use of IT seems to be linked with the economic and societal development of many countries. The file “InfoTech.xlsx” contains data collected by a non-profit organization that measures the use and impact of IT from about 140 countries. The following variables are provided:
· IU: Individual (personal) usage of IT
· BU: Business usage of IT
· GU: Government usage of IT
Each variable contains a score between 1 (lowest) and 7 (highest).
2A.    Run descriptive statistics and seaborn “pairplot” on the data set. What do these tell you about the different variables (e.g., How are they similar and/or different? Do you notice any co
elations?). Provide the output and your interpretation below.
2B.    Create a dendrogram of the data using the Ward method. Label the axes and title as appropriate. Provide the chart below. Comment on how well you think a choice of 4 clusters would work for this data, based solely on the dendogram.
2C.    Now use Hierarchical clustering (agglomerative) to model the data. Use 4 clusters and Euclidean distances. How did the clustering group the countries? To answer this, generate two scatterplots, both with BU on the y-axis (since we have 3 variables, with need to think in three dimensions). Provide your charts and answer below.
2D.    Generate an “elbow” chart for the data set. Label the axes and title as appropriate. Paste the chart below. Comment on how well you think a choice of 4 clusters would work for this data, based solely on the chart.
2E.    Now use k-means clustering to model the data. Use 4 clusters and random state = 0. Again, create two scatterplots similar to those done in Part C. How do the results from hierarchical clustering and k-means clustering compare to each other? Provide your charts and answer below.
2

Info Tech
IU    BU    GU
3.5    3.3    3.6
2.7    2.8    2.6
4.8    3.3    3.2
4    3.3    4.3
6.2    4.7    4.9
5.8    5.5    4.7
4.7    3.6    4.6
6.2    3.9    5.6
2    3    3.7
5.9    5.1    4.5
2.1    3.4    2.7
2.8    3.1    3.5
2.9    2.9    3.2
3.9    3    2.5
3.1    3.3    3.5
4.7    3.6    3.5
4.9    3.4    3.2
1.2    2.4    2.3
2.7    3.2    2.9
1.9    3.5    3.2
5.6    4.8    5
3.6    3.3    3.5
1.2    2.5    2.5
4.8    3.8    4.5
3.8    3.8    4.5
5.9    5.4    4.9
4    3.4    4.7
4.7    3.9    4
2.5    3.5    3.6
5.1    3.3    3.4
4.8    3.7    3.6
5.7    4.2    3.3
6.8    5.6    4.6
3.1    3.4    3.4
3.4    3.4    3.8
3.7    2.9    3.7
3.2    3.4    3.5
6.2    4.3    5.3
1.5    2.9    3.7
6.5    5.7    4.9
5.9    4.9    5.2
2.4    3.1    2.8
2.5    3.4    3.6
4    3.1    4
6.1    5.7    4.7
3.4    3.4    3.3
4.8    3.4    3.4
2.7    3.8    2.8
1.7    2.7    2.4
2.6    3.4    3.3
1.7    2.7    2.1
2.7    3.8    3.6
6.2    4.8    4.6
5.2    3.5    3.7
6.5    5    4.6
2    3.5    4
3.2    4    3.8
3.2    3    3.4
5.8    4.8    4.8
5.5    5.7    5.2
5.4    3.7    3.9
3.4    3.6    3.5
6.3    5.8    5.3
4    3.8    4.3
4.7    3.5    4.7
2.5    3.8    4.3
6.4    5.3    5.5
5.5    3.5    3.6
3.4    3.1    2.9
1.9    3.3    3.2
5.4    4    4.2
5    3.3    2.8
2    3    2.8
1.7    3.1    2.8
5.4    4.2    4.6
6.7    5.3    5.3
4.9    3.3    4
1.5    3.3    2.7
1.4    3    2.7
5    4.6    5.4
2.4    3    3.1
5.8    3.9    4.2
2.1    2.7    2.4
4.2    3.7    4.2
3.5    3.5    4.1
4.2    3.1    3.8
3.6    3.6    4.1
4.5    3.3    4.1
4.1    3.2    4.5
1.8    3.1    3.2
1.7    2.5    2.2
2.9    3.6    3.4
2.1    2.9    2.6
6.5    5.7    5.3
6    4.9    5.3
2.4    2.9    2.2
2.4    3.4    3.2
6.6    5.4    5.1
5.2    3.3    4.6
2    3.1    3.2
3.9    3.9    4
3    3    2.6
3.1    3.3    3.6
3.7    3.9    3.9
5.2    3.5    3.5
5    4.1    4.7
5.9    4.7    5.4
4.6    3.5    3.4
5.2    3.5    4.3
1.8    3.6    5.2
5.9    3.8    5.3
2.5    3.7    3.7
4.8    3    3.2
4.2    3.5    3.6
6.3    5.3    6.2
5.5    3.8    3.6
5.3    4.2    3.5
3.8    4.1    3.2
5.5    3.8    4.6
2.7    3.8    4.9
2.3    3.1    2.6
6.6    5.9    4.9
6.5    6    4.4
2.2    3.3    3
1.6    3    3.3
4.2    3.8    3.7
4.6    3.4    3.4
3.8    3.2    4
4.2    3.7    4
1.8    3.2    3.3
3.8    3.5    3
6.1    4.5    6.1
6.5    5.1    5.3
6.1    5.8    5.3
5.1    3.3    4.7
3.8    2.9    2.9
3.5    3.4    3.9
1.9    3.5    3.2
2.4    3    2.8
Answered Same DayApr 11, 2022

## Solution

Sathishkumar answered on Apr 11 2022
SOLUTION.PDF