Structured Data Processing: For the purposes of this write-up, we will use examples from Donors data (Donors_Data.csv) to help convey the requirements The main outline of your assignment is to write a...

1 answer below »

Structured Data Processing:


For the purposes of this write-up, we will use examples from Donors data (Donors_Data.csv) to help convey the requirements


The main outline of your assignment is to write a program that will read in the data from a file, such as a.csv, .tsv, .txt,or file saved from excel. This will be in a format that is structured with lines of data representing one type of unit (e.g. one donor in the donors file). Your program must represent the data using learned Python data structures. You may choose the overall structure to be one of the following:




  • Dictionaries, lists, or tuples




  • NumPy Arrays (this topic will be covered in class on9/21)




  • pandas DataFrame (this topic will be covered in class on9/28)




  • Or some combination of the above


    You will perform data cleaning and exploration on this data.


    The programs you write will do some processing to convert the data to a form that will answertwo questions, as described below, and write files with the data suitable for answering each question.Graphing is optional.


    Data:


    You must first choose a dataset to work with. As a guideline, datasets should be chosen that have from about 500 to 4,000 lines of data with some number of columns between 5 and 50.


    If the data comes in an Excel spread sheet with a lot of columns, it isOKto first edit the excel file to remove columns that you don’t need for your processing. For example, in the Donors data, you might wish to create a separate excel spread sheet with only a few columns of data.


    Questions:


    For this assignment, at least one question that you choose should look at the data in a different unit of analysis than is present in the data file. For example, instead of looking at individual donors, you could look at the donors of each of the 9 income or wealth types.




Simple example question (NOTE:you should do a more complex problem than this): For each wealth type, what is the average home value of all the donors of that type?




  • Unit of analysis: wealth types




  • Comparison: for each wealth type, compute the average home value of the


    neighborhoods of all the donors of that type




  • Output: should be in a file with 9 rows of data (you may also produce header and


    label rows), where each row has an income type (1 – 9) and the average home values.


    One way to have increased the complexity of this particular question would be to add more items to be compared to for income types (e.g. add columns to the output with average total gifts or values of the last gifts).


    Another option would have been to introduce a more detailed unit of analysis, for example, suppose that for each income level, you reported by gender, giving the average home values for both men and women in each category.


    Other ideas:


    Compare donors in the various zip codes with various types or amounts of giving. Compare donors by the number of promotions with the total amount of donations and the frequency of donations.
    Compare the number of months since the last donation to the donation amounts.


    Deliverable [total: 15 points]:
    For this mini project, you must submit your data set, a program*, and a report**. Your program must be submitted as a Jupyter notebook file(.ipynb) and your report should be a .docx (please do not send .pdf as adding comments is challenging).






  • *A program (.ipynb) which does the following[subtotal: 10 points]:
    oReads in data from a file [1 points]
    oCleans and formats the data [2 points]
    oAnalyzes/Summarizes the data intwodifferent ways [6 points (3 x 2)]oOutputs the table summaries to console (or optional graphs) [1 point]




  • **A report which describes the following[subtotal: 5 points]:




oThe data and its source [1 point]
oA description of your data exploration and data cleaning steps [1 point]oTwo clearly stated comparison questions with the unit of analysis, the


comparison values and how they are computed. [1 point]oA description of the program [1 point]
oA description of the output files [1 point]


For your program, you may use any of the code developed in class as a template, but it isabsolutely essentialthat you use appropriate variable names and that you write original comments for what your program does. Recall that good comments demonstrate your understanding of the code that you write and the problem that you are trying to solve.

Answered Same DayOct 06, 2021

Answer To: Structured Data Processing: For the purposes of this write-up, we will use examples from Donors data...

Sudipta answered on Oct 07 2021
129 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Defining function for reading data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n",
"0 1461 20 RH 80.0 11622 Pave NaN
Reg \n",
"1 1462 20 RL 81.0 14267 Pave NaN IR1 \n",
"2 1463 60 RL 74.0 13830 Pave NaN IR1 \n",
"3 1464 60 RL 78.0 9978 Pave NaN IR1 \n",
"4 1465 120 RL 43.0 5005 Pave NaN IR1 \n",
"5 1466 60 RL 75.0 10000 Pave NaN IR1 \n",
"6 1467 20 RL NaN 7980 Pave NaN IR1 \n",
"7 1468 60 RL 63.0 8402 Pave NaN IR1 \n",
"8 1469 20 RL 85.0 10176 Pave NaN Reg \n",
"9 1470 20 RL 70.0 8400 Pave NaN Reg \n",
"\n",
" LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence \\\n",
"0 Lvl AllPub ... 120 0 NaN MnPrv \n",
"1 Lvl AllPub ... 0 0 NaN NaN \n",
"2 Lvl AllPub ... 0 0 NaN MnPrv \n",
"3 Lvl AllPub ... 0 0 NaN NaN \n",
"4 HLS AllPub ... 144 0 NaN NaN \n",
"5 Lvl AllPub ... 0 0 NaN NaN \n",
"6 Lvl AllPub ... 0 0 NaN GdPrv \n",
"7 Lvl AllPub ... 0 0 NaN NaN \n",
"8 Lvl AllPub ... 0 0 NaN NaN \n",
"9 Lvl AllPub ... 0 0 NaN MnPrv \n",
"\n",
" MiscFeature MiscVal MoSold YrSold SaleType SaleCondition \n",
"0 NaN 0 6 2010 WD Normal \n",
"1 Gar2 12500 6 2010 WD Normal \n",
"2 NaN 0 3 2010 WD Normal \n",
"3 NaN 0 6 2010 WD Normal \n",
"4 NaN 0 1 2010 WD Normal \n",
"5 NaN 0 4 2010 WD Normal \n",
"6 Shed 500 3 2010 WD Normal \n",
"7 NaN 0 5 2010 WD Normal \n",
"8 NaN 0 2 2010 WD Normal \n",
"9 NaN 0 4 2010 WD Normal \n",
"\n",
"[10 rows x 80 columns]\n",
"1459\n",
"80\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"#Function defined for reading the data from an excel file.\n",
"def readFile():\n",
" df=pd.read_excel(r'path to data here\\house.xlsx')\n",
" df=pd.DataFrame(df)\n",
" return(df)\n",
"df=readFile()\n",
"index=df.index\n",
"number_of_rows=len(index)\n",
"number_of_columns=df.columns\n",
"#prints first 10 records\n",
"print(df.head(10))\n",
"#prints number of rows\n",
"print(number_of_rows)\n",
"#print number of columns \n",
"print(len(number_of_columns))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"# Cleaning data where columns with heading 'PoolQC', '3SsnPorch' and 'Alley' are removed, which are not required or dump value fields."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Id MSSubClass MSZoning LotFrontage LotArea Street LotShape \\\n",
"0 1461 20 RH 80.0 11622 Pave Reg \n",
"1 1462 20 RL 81.0 14267 Pave IR1 \n",
"2 1463 60 RL 74.0 13830 Pave IR1 \n",
"3 1464 60 RL 78.0 9978 Pave IR1 \n",
"4 1465 120 RL 43.0 5005 Pave IR1 \n",
"5 1466 60 RL 75.0 10000 Pave IR1 \n",
"6 1467 20 RL NaN 7980 Pave IR1 \n",
"7 1468 60 RL 63.0 8402 Pave IR1 \n",
"8 1469 20 RL 85.0 10176 Pave Reg \n",
"9 1470 20 RL 70.0 8400 Pave Reg \n",
"\n",
" LandContour Utilities LotConfig ... EnclosedPorch ScreenPorch \\\n",
"0 Lvl AllPub Inside ... 0 120 \n",
"1 Lvl AllPub Corner ... 0 0 \n",
"2 Lvl AllPub Inside ... 0 0 \n",
"3 Lvl AllPub Inside ... 0 0 \n",
"4 HLS AllPub Inside ... 0 144 \n",
"5 Lvl AllPub Corner ... ...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here