1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame. a.Convert all text to lowercase letters. b.Remove all punctuation from the text. c.Remove stop words....

1 answer below »

1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame.


a.Convert all text to lowercase letters.


b.Remove all punctuation from the text.


c.Remove stop words.


d.Apply NLTK’s PorterStemmer.


2.Use a Tf-idf vector instead of the word frequency vector.


3.Complete the 5.3 Encoding Dictionaries of Features examples. Be sure to keeping track of how many times a word is used in a document, also be sure to run the sample codes 6.9 . Finally, consider tokenizing words or sentences (see 6.4) and tagging parts of speech (see 6.7) Be sure to review how to encode days of the week (see 7.6).


4.You can start with the #1 program and add to it or you can start a new program. Provide me with an example (besides counting words in a document) of how these techniques could be used. (Just a couple sentences.)


5.Then implement at least 3 of these Text techniques in a program demonstrating how your example could be accomplished. Be sure to include lots of comments.


6.Create a datafile file or use one from resources file. You must use DataFrames!

The completed task must bein Jupyter Notebook and return with completed datafile


Handling Categorical Data, Text, Dates & Times Use the data file “Income.json” File 1. Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame. a. Convert all text to lowercase letters. b. Remove all punctuation from the text. c. Remove stop words. d. Apply NLTK’s PorterStemmer. 2. Use a Tf-idf vector instead of the word frequency vector. 3. Complete the 5.3 Encoding Dictionaries of Features examples. Be sure to keeping track of how many times a word is used in a document, also be sure to run the sample codes 6.9 . Finally, consider tokenizing words or sentences (see 6.4) and tagging parts of speech (see 6.7) Be sure to review how to encode days of the week (see 7.6). 4. You can start with the #1 program and add to it or you can start a new program. Provide me with an example (besides counting words in a document) of how these techniques could be used. (Just a couple sentences.) 5. Then implement at least 3 of these Text techniques in a program demonstrating how your example could be accomplished. Be sure to include lots of comments. 6. Create a datafile file or use one from resources file. You must use DataFrames! 7. The completed task must be in Jupyter Notebook and return with completed datafile. {"# of kids":{"0":5,"1":5,"2":2,"3":2,"4":0,"5":1,"6":1,"7":3,"8":3,"9":3},"Income":{"0":25000,"1":122500,"2":142007,"3":42007,"4":14704,"5":200704,"6":120070,"7":207040,"8":48000,"9":79000},"State":{"0":"CA","1":"NY","2":"TX","3":"TX","4":"TX","5":"TX","6":"CA","7":"NY","8":"NY","9":"NY"}}
Answered Same DaySep 04, 2021

Answer To: 1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame....

Ximi answered on Sep 08 2021
150 Votes
{
"cells": [
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import requests, json \n",
"from nltk.corpus import stopwords"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"gn: right;\">\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
# of kidsIncomeState
0525000CA
15122500NY
22142007TX
3242007TX
4014704TX
51200704TX
61120070CA
73207040NY
8348000NY
9379000NY
\n",
"
"
],
"text/plain": [
" # of kids Income State\n",
"0 5 25000 CA\n",
"1 5 122500 NY\n",
"2 2 142007 TX\n",
"3 2 42007 TX\n",
"4 0 14704 TX\n",
"5 1 200704 TX\n",
"6 1 120070 CA\n",
"7 3 207040 NY\n",
"8 3 48000 NY\n",
"9 3 79000 NY"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading JSON file \n",
"df = pd.read_json('income-q1ojinmg.json')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"# converting into records\n",
"data_records = [{key:value for key,value in zip(df.columns, list(item)[1:])} for item in df.to_records()]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df['State'] = df['State'].apply(lambda x: x.lower())"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# to be used later\n",
"en_stop_words = set(stopwords.words('english'))"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is an example of string with punctuation\n"
]
}
],
"source": [
"# Remove punctuation\n",
"import string\n",
"input_str = \"This &is [an] example? {of} string. with.? punctuation!!!!\" # Sample string\n",
"result = input_str.translate(str.maketrans(\"\",\"\", string.punctuation))\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5.1 sample code - Encoding Nominal Categorical Features**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['Texas']\n",
" ['California']\n",
" ['Texas']\n",
" ['Delaware']\n",
" ['Texas']]\n",
"feature created\n"
]
}
],
"source": [
"#create feature\n",
"feature = np.array([[\"Texas\"],\n",
" [\"California\"],\n",
" [\"Texas\"],\n",
" [\"Delaware\"],\n",
" [\"Texas\"]])\n",
"print(feature)\n",
"print(\"feature created\")\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 0 1]\n",
" [1 0 0]\n",
" [0 0 1]\n",
" [0 1 0]\n",
" [0 0 1]]\n",
"States are encoded\n"
]
}
],
"source": [
"#create one-hot encoder\n",
"from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer\n",
"one_hot = LabelBinarizer()\n",
"print(one_hot.fit_transform(feature))\n",
"print(\"States are encoded\")\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['California' 'Delaware' 'Texas' 'Delaware' 'Texas']\n",
"feature classes printed\n"
]
}
],
"source": [
"#view feature classes\n",
"one_hot.classes_\n",
"print(np.array(['California', 'Delaware', 'Texas', 'Delaware', 'Texas'], dtype=' "print(\"feature classes printed\")\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" California Delaware Texas\n",
"0 0 0 1\n",
"1 1 0 0\n",
"2 0 0 1\n",
"3 0 1 0\n",
"4 0 0 1\n",
"dummy variables\n"
]
}
],
"source": [
"#create dummy variables from feature\n",
"print(pd.get_dummies(feature[:,0]))\n",
"print(\"dummy variables\")\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Texas', 'Florida'), ('California', 'Alabama'), ('Texas', 'Florida'), ('Delaware', 'Florida'), ('Texas', 'Alabama')]\n",
"multiclass_feature\n",
"[[0 0 0 1 1]\n",
" [1 1 0 0 0]\n",
" [0 0 0 1 1]\n",
" [0 0 1 1 0]\n",
" [1 0 0 0 1]]\n",
"MultiLabelBinarizer(classes=None, sparse_output=False)\n",
"one_hot_multiclass\n",
"['Alabama' 'California' 'Delaware' 'Florida' 'Texas']\n",
"multiclass classes\n"
]
}
],
"source": [
"#multiclass features\n",
"multiclass_feature =[(\"Texas\", \"Florida\"),\n",
" (\"California\", \"Alabama\"),\n",
" (\"Texas\", \"Florida\"),\n",
" (\"Delaware\", \"Florida\"),\n",
" (\"Texas\", \"Alabama\")]\n",
"print (multiclass_feature)\n",
"print(\"multiclass_feature\")\n",
"\n",
"one_hot_multiclass = MultiLabelBinarizer()\n",
"print(one_hot_multiclass.fit_transform(multiclass_feature))\n",
"print(one_hot_multiclass)\n",
"print(\"one_hot_multiclass\")\n",
"print(one_hot_multiclass.classes_)\n",
"print(\"multiclass classes\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5.2 sample code - Encoding Ordinal Categorical Features**"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Score\n",
"0 Low\n",
"1 Low\n",
"2 Medium\n",
"3 Medium\n",
"4 High\n",
"features\n"
]
}
],
"source": [
"#create features\n",
"dataframe =...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here