PYTHON ASSIGNMENT Send code as well as paste plots generated in a word, also explain what it concludes Let us use data analytical skills to determine which factors contribute to higher medical costs....

1 answer below »
Pyhton programming using any of this -matplotlib pyplot numpy panda


PYTHON ASSIGNMENT Send code as well as paste plots generated in a word, also explain what it concludes Let us use data analytical skills to determine which factors contribute to higher medical costs. The insurance.csv dataset is related to individual medical costs billed by health insurance companies. It also includes some personal information. Use from these -- matplotlib pyplot numpy panda Assignment Data Description · age: age of primary beneficiary · sex: insurance contractor gender, 1 (female), 0 (male) · bmi: body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 · children: number of children covered by health insurance / number of dependents · smoker: 1 (smoking), 0 (non-smoking) · region: the beneficiary's residential area in the US, 0 (southwest), 1(southeast), 2 (northwest), 3 (northeast) · charges: individual medical costs billed by health insurance Questions 1. We will examine if bmi has an impact on the medical costs. Put the bmi on the x-axis. The color of each point will be set according to whether the patient is a smoker. Set the transparency to be 0.7. Be sure to include the colorbar, and set appropriate labels for x-axis, y-axis and the colorbar. What business insights can you get? 2. We further compare the distribution of the medical costs of smokers and that of non-smokers. Plot the distribution of medical costs of smokers first. Then on the same figure, plot the distribution of medical costs of non-smokers and set the transparency to 0.6. The number of bins is 12 for both plots. Set appropriate labels and legends. 3. We study whether age is an important factor by comparing the distribution of medical costs of young people and that of elder people. On the same plot, generate a histogram of medical costs of patients younger than 40 years old, and then another histogram representing the rest of the patients. Set the transparency of the second histogram to 0.7. The number of bins is 15 for both histogram. Set appropriate labels and legends. What can you conclude from this figure? 4. Open-ended question. Now it is your turn to discover something interesting and valuable! What else can you conclude from this dataset using the data visualization skills we leant? Generate two more figures and explain your findings. PART 2 of Assignment >>. Visualization Practice: Bike Sharing Systems Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data. Data Description: We will be using the daily version of the Capital Bikeshare System dataset from the UCI Machine Learning Repository. This data set contains information about the daily count of bike rental checkouts in Washington, D.C.’s bikeshare program between 2011 and 2012. It also includes information about the weather and seasonal/temporal features for that day (like whether it was a weekday). • day: Day of the record (relative to day 1:2011-01-01) • season: Season (1:winter, 2:spring, 3:summer, 4:fall) • weekday: Day of the week (0=Sunday, 6=Saturday) • workingday: If day is neither weekend nor holiday is 1, otherwise is 0. • weathersit: – 1: Clear, Few clouds, Partly cloudy, Partly cloudy – 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist – 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds • temp: Normalized temperature in Celcius • windspeed: Normalized wind speed • casual: Count of checkouts by casual/non-registered users • registered: Count of checkouts by registered users • cnt: Total checkouts [ ]: import pandas as pd daily = pd.read_csv('day.csv') daily.head() Questions: 1. Understand Trends. Generate a line chart to show the checkouts over time by using day column as the x-axis and cnt column as the y-axis. Label the x-axis as ‘Day’, and y-axis as ‘Check Outs’. What can you conclude? 2. Explore Relationships. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. Color the points to be ‘#539cab’. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. What insight can you get? 3. Explore Relationships with Multidimensional Information. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. The color of each point will be set according to whether it is a working day. Set the trans- parency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. Change the legend of the color bar to whether it is a working day. What additional insights can you get? 4. Examine Distributions. Let’s first build a histogram of the registered bike checkouts with the number of bins as 10. Set appropriate labels. Also set the title to be “Distribution of Registered Check Outs”. 5. Compare Distributions. We now compare the distributions of registered and casual check- outs. To make the figure easy to understand, additional to the histogram we made for the previous question, we will set the transparency of the casual one to 0.8 and the number of bins to 5. Set appropriate labels. 6. How do the temperatures change across the seasons? You need to choose the type of visual- ization that best serves this purpose. What are the mean and median temperatures? 7. What else can you conclude from this dataset by using various data exploration? age,sex,bmi,children,smoker,region,charges 19,1,27.9,0,1,0,16884.924 18,0,33.77,1,0,1,1725.5523 28,0,33,3,0,1,4449.462 33,0,22.705,0,0,2,21984.47061 32,0,28.88,0,0,2,3866.8552 31,1,25.74,0,0,1,3756.6216 46,1,33.44,1,0,1,8240.5896 37,1,27.74,3,0,2,7281.5056 37,0,29.83,2,0,3,6406.4107 60,1,25.84,0,0,2,28923.13692 25,0,26.22,0,0,3,2721.3208 62,1,26.29,0,1,1,27808.7251 23,0,34.4,0,0,0,1826.843 56,1,39.82,0,0,1,11090.7178 27,0,42.13,0,1,1,39611.7577 19,0,24.6,1,0,0,1837.237 52,1,30.78,1,0,3,10797.3362 23,0,23.845,0,0,3,2395.17155 56,0,40.3,0,0,0,10602.385 30,0,35.3,0,1,0,36837.467 60,1,36.005,0,0,3,13228.84695 30,1,32.4,1,0,0,4149.736 18,0,34.1,0,0,1,1137.011 34,1,31.92,1,1,3,37701.8768 37,0,28.025,2,0,2,6203.90175 59,1,27.72,3,0,1,14001.1338 63,1,23.085,0,0,3,14451.83515 55,1,32.775,2,0,2,12268.63225 23,0,17.385,1,0,2,2775.19215 31,0,36.3,2,1,0,38711 22,0,35.6,0,1,0,35585.576 18,1,26.315,0,0,3,2198.18985 19,1,28.6,5,0,0,4687.797 63,0,28.31,0,0,2,13770.0979 28,0,36.4,1,1,0,51194.55914 19,0,20.425,0,0,2,1625.43375 62,1,32.965,3,0,2,15612.19335 26,0,20.8,0,0,0,2302.3 35,0,36.67,1,1,3,39774.2763 60,0,39.9,0,1,0,48173.361 24,1,26.6,0,0,3,3046.062 31,1,36.63,2,0,1,4949.7587 41,0,21.78,1,0,1,6272.4772 37,1,30.8,2,0,1,6313.759 38,0,37.05,1,0,3,6079.6715 55,0,37.3,0,0,0,20630.28351 18,1,38.665,2,0,3,3393.35635 28,1,34.77,0,0,2,3556.9223 60,1,24.53,0,0,1,12629.8967 36,0,35.2,1,1,1,38709.176 18,1,35.625,0,0,3,2211.13075 21,1,33.63,2,0,2,3579.8287 48,0,28,1,1,0,23568.272 36,0,34.43,0,1,1,37742.5757 40,1,28.69,3,0,2,8059.6791 58,0,36.955,2,1,2,47496.49445 58,1,31.825,2,0,3,13607.36875 18,0,31.68,2,1,1,34303.1672 53,1,22.88,1,1,1,23244.7902 34,1,37.335,2,0,2,5989.52365 43,0,27.36,3,0,3,8606.2174 25,0,33.66,4,0,1,4504.6624 64,0,24.7,1,0,2,30166.61817 28,1,25.935,1,0,2,4133.64165 20,1,22.42,0,1,2,14711.7438 19,1,28.9,0,0,0,1743.214 61,1,39.1,2,0,0,14235.072 40,0,26.315,1,0,2,6389.37785 40,1,36.19,0,0,1,5920.1041 28,0,23.98,3,1,1,17663.1442 27,1,24.75,0,1,1,16577.7795 31,0,28.5,5,0,3,6799.458 53,1,28.1,3,0,0,11741.726 58,0,32.01,1,0,1,11946.6259 44,0,27.4,2,0,0,7726.854 57,0,34.01,0,0,2,11356.6609 29,1,29.59,1,0,1,3947.4131 21,0,35.53,0,0,1,1532.4697 22,1,39.805,0,0,3,2755.02095 41,1,32.965,0,0,2,6571.02435 31,0,26.885,1,0,3,4441.21315 45,1,38.285,0,0,3,7935.29115 22,0,37.62,1,1,1,37165.1638 48,1,41.23,4,0,2,11033.6617 37,1,34.8,2,1,0,39836.519 45,0,22.895,2,1,2,21098.55405 57,1,31.16,0,1,2,43578.9394 56,1,27.2,0,0,0,11073.176 46,1,27.74,0,0,2,8026.6666 55,1,26.98,0,0,2,11082.5772 21,1,39.49,0,0,1,2026.9741 53,1,24.795,1,0,2,10942.13205 59,0,29.83,3,1,3,30184.9367 35,0,34.77,2,0,2,5729.0053 64,1,31.3,2,1,0,47291.055 28,1,37.62,1,0,1,3766.8838 54,1,30.8,3
Answered Same DayJul 22, 2021

Answer To: PYTHON ASSIGNMENT Send code as well as paste plots generated in a word, also explain what it...

Rajashekar answered on Jul 22 2021
142 Votes
Name:        Date:
PYTHON ASSIGNMENT
1.Insurance Dataset
Use data analytical skills to determine which factors contribute to higher medical costs. The insurance.csv dataset is related to individual medical costs billed by health insurance companies. It also includes some personal information.
1.1. Questions
1. We will examine if bmi has an impact on the medical costs. Put the bmi on the x-axis. The color of each point will be set according to whether the patient is a smoker. Set the transparency to be 0.7. Be sure to include the color bar, and set appropriate labels for x-axis, y-axis and the color bar. What business insights can you get?
Insights
The most obvious trend that we can observe here is that non-smokers have lower average charges accumulated compared to smokers.
Various other insights that can be derived from the plot are as follow:
1
. Maximum number of people who are Non-smokers do not incur more than 15000 with few outliers that do not exceed 40000.
2. The BMI of non-smokers is fairly distributed from 20-40
3. People that smoke and have a BMI between 15 and 30 incur higher charges than non-smokers with charges ranging between 15000 and 30000.
4. A significant number of smokers with BMI between 30 and 40 incur the highest amount of charges ranging from 30000 to 50000.
2. We further compare the distribution of the medical costs of smokers and that of non-smokers. Plot the distribution of medical costs of smokers first. Then on the same figure, plot the distribution of medical costs of non-smokers and set the transparency to 0.6. The number of bins is 12 for both plots. Set appropriate labels and legends.
3. We study whether age is an important factor by comparing the distribution of medical costs of young people and that of elder people. On the same plot, generate a histogram of medical costs of patients younger than 40 years old, and then another histogram representing the rest of the patients. Set the transparency of the second histogram to 0.7. The number of bins is 15 for both histograms. Set appropriate labels and legends. What can you conclude from this figure?
Insights
1. Majority of the young patients incur very insignificant charges signifying their superior health owing to their lower age. Major number of young patients incur charges less than 10000 with few percentages of patients incurring 20000 and 35000.
2. Compared to young patients the number of other patients with charges around 10000 is significantly less (200-250 compared to 350 of young patients). These patients on average incur higher costs compared to younger patients with the highest costs being more than 60000.
3. The costs incur increase as the age deteriorates.
4. Open-ended question. Now it is your turn to discover something interesting and valuable! What else can you conclude from this dataset using the data visualization skills we leant? Generate two more figures and explain your findings.
Insights
When we compare how male patients and female patients are associated with cost we observe that distribution is mostly similar with higher number of male patients incur larger charges between 30000 and 50000. This indicates that the charges incurred by patients are determined mostly by other factors like age and smoking as explored earlier.
Insights
1. Women with age between 30 and 53 incur the highest amount of charges.
2. Men with age between 43 and 52 incur highest amount of charges.
3. This indicates that women spend admitted to the hospital over a wide range of age groups compared to men
Insights
The south-East region has the highest number of smokers and consequently incur the highest amount of charges. The number of Non-smokers is evenly distributed across all regions with the south-west region having higher number of Non-smokers compared to smokers
2.Bike rental Dataset
The daily version of the Capital Bikeshare System dataset from the UCI Machine Learning Repository. This data set contains information about the daily count of bike rental checkouts in Washington, D.C.’s bikeshare program between 2011 and 2012. It also includes information about the weather and seasonal/temporal features for that day (like whether it was a weekday).
2.1. Questions
1. Understand Trends. Generate a line chart to show the checkouts over time by using day column as the x-axis and cnt column as the y-axis. Label the x-axis as ‘Day’, and y-axis as ‘Check Outs’. What can you conclude?
Insights
1. The general trend for both years seems to show that number of checkouts steadily increase over the year until they peak mid-year.
2. They steadily decrease until the end of the year and consecutively pick up as the next year progress following the same trend as previous year.
3. The number of overall checkouts significantly increase in the second year.
2. Explore Relationships. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. Color the points to be ‘#539cab’. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. What insight can you get?
Insights
1. People rent bikes less when it is colder as evident from the graph which shows checkouts ranging from 1500 to 4000 until 0.4 temp. This indicates the existence of various other factors like road conditions, body temperature, etc.
2. Highest number of check outs occur at mild temperatures with majority ranging between 4000 and 8000.
3. As the temperature increases the check outs decease but are still significantly higher than checkouts at lower temperatures
3. Explore Relationships with Multidimensional Information. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. The color of each point will be set according to whether it is a working day. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. Change the legend of the color bar to whether it is a working day. What additional insights can you get?
Insights
People rent bikes on working days much more than non-working days. This indicates that people are using bikes to travel to their work destinations more than they need for leisure on non-working days.
4. Examine Distributions. Let’s first build a histogram of the registered bike checkouts with the number of bins as 10. Set appropriate labels. Also set the title to be “Distribution of Registered Check Outs”.
Insights
The number of check outs that occur per day mostly fall between the 3000-4000 range.    
5. Compare Distributions. We now compare the distributions of registered and casual check- outs. To make the figure easy to understand, additional to the histogram we made for the previous question, we will set the transparency of the casual one to 0.8 and the number of bins to 5. Set appropriate labels.
Insights
1. The casual renters generally rent out fewer number of times compared to registered renters indicating that most of casual renters are not returning customers.
2. The maximum checkouts for casual renters steadily decrease to 3000.
3. From this we can concur that casual renters check out only once or the number of check outs by them happens during non-working days or holidays only.
6. How do the temperatures change across the seasons? You need to choose the type of visualization that best serves this purpose. What are the mean and median temperatures?
Insights
1. The temperature varies differently for different seasons for the two seasons as shown in the graphs.
2. The mean temperature for each season varies with winter mean temp as 0.3, spring mean temp is 0.55, summer mean temp is 0.72 and fall mean temp is 0.41.
3. We observe highest temperatures reaching in summer to 0.85 and higher average temps with winter having lower average temp with lowest temp of 0.1 as expected
7. What else can you conclude from this dataset by using various data exploration?
Insights
1. If we categorise the conditions of the day into 'Clear', 'Misty' and 'Rainy' and then plot the number of checkouts, we observe that on misty days and clear days people check out the similar number of bikes with highest check outs on clear days.
2. On rainy days these numbers drop significantly possibly due to road conditions and safety considerations.
3. Source code
import pandas as pd #import relavant libraries
import numpy as np
import matplotlib.pyplot as plt # This is the tool we will use to perform EDA
pip install --upgrade matplotlib #required to upgrade matlabplotlib to latest version for some code to...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here