1.Preparing Text: For this part, you will start by reading the Income.json file into a...

Question

1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame.

a.Convert all text to lowercase letters.

b.Remove all punctuation from the text.

c.Remove stop words.

d.Apply NLTK’s PorterStemmer.

2.Use a Tf-idf vector instead of the word frequency vector.

3.Complete the 5.3 Encoding Dictionaries of Features examples. Be sure to keeping track of how many times a word is used in a document, also be sure to run the sample codes 6.9 . Finally, consider tokenizing words or sentences (see 6.4) and tagging parts of speech (see 6.7) Be sure to review how to encode days of the week (see 7.6).

4.You can start with the #1 program and add to it or you can start a new program. Provide me with an example (besides counting words in a document) of how these techniques could be used. (Just a couple sentences.)

5.Then implement at least 3 of these Text techniques in a program demonstrating how your example could be accomplished. Be sure to include lots of comments.

6.Create a datafile file or use one from resources file. You must use DataFrames!

The completed task must bein Jupyter Notebook and return with completed datafile

Handling Categorical Data, Text, Dates & Times Use the data file “Income.json” File 1. Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame. a. Convert all text to lowercase letters. b. Remove all punctuation from the text. c. Remove stop words. d. Apply NLTK’s PorterStemmer. 2. Use a Tf-idf vector instead of the word frequency vector. 3. Complete the 5.3 Encoding Dictionaries of Features examples. Be sure to keeping track of how many times a word is used in a document, also be sure to run the sample codes 6.9 . Finally, consider tokenizing words or sentences (see 6.4) and tagging parts of speech (see 6.7) Be sure to review how to encode days of the week (see 7.6). 4. You can start with the #1 program and add to it or you can start a new program. Provide me with an example (besides counting words in a document) of how these techniques could be used. (Just a couple sentences.) 5. Then implement at least 3 of these Text techniques in a program demonstrating how your example could be accomplished. Be sure to include lots of comments. 6. Create a datafile file or use one from resources file. You must use DataFrames! 7. The completed task must be in Jupyter Notebook and return with completed datafile. {"# of kids":{"0":5,"1":5,"2":2,"3":2,"4":0,"5":1,"6":1,"7":3,"8":3,"9":3},"Income":{"0":25000,"1":122500,"2":142007,"3":42007,"4":14704,"5":200704,"6":120070,"7":207040,"8":48000,"9":79000},"State":{"0":"CA","1":"NY","2":"TX","3":"TX","4":"TX","5":"TX","6":"CA","7":"NY","8":"NY","9":"NY"}}

handling-categorical-data-ucahb4rr.ipynb handling-categorical-data-5nhgkzbt.docx income-q1ojinmg.json

Ximi · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd
",
    "import numpy as np
",
    "import requests, json 
",
    "from nltk.corpus import stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      # of kids
",
       "      Income
",
       "      State
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      5
",
       "      25000
",
       "      CA
",
       "    
",
       "    
",
       "      1
",
       "      5
",
       "      122500
",
       "      NY
",
       "    
",
       "    
",
       "      2
",
       "      2
",
       "      142007
",
       "      TX
",
       "    
",
       "    
",
       "      3
",
       "      2
",
       "      42007
",
       "      TX
",
       "    
",
       "    
",
       "      4
",
       "      0
",
       "      14704
",
       "      TX
",
       "    
",
       "    
",
       "      5
",
       "      1
",
       "      200704
",
       "      TX
",
       "    
",
       "    
",
       "      6
",
       "      1
",
       "      120070
",
       "      CA
",
       "    
",
       "    
",
       "      7
",
       "      3
",
       "      207040
",
       "      NY
",
       "    
",
       "    
",
       "      8
",
       "      3
",
       "      48000
",
       "      NY
",
       "    
",
       "    
",
       "      9
",
       "      3
",
       "      79000
",
       "      NY
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   # of kids  Income State
",
       "0          5   25000    CA
",
       "1          5  122500    NY
",
       "2          2  142007    TX
",
       "3          2   42007    TX
",
       "4          0   14704    TX
",
       "5          1  200704    TX
",
       "6          1  120070    CA
",
       "7          3  207040    NY
",
       "8          3   48000    NY
",
       "9          3   79000    NY"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading JSON file 
",
    "df = pd.read_json('income-q1ojinmg.json')
",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "# converting into records
",
    "data_records = [{key:value for key,value in zip(df.columns, list(item)[1:])} for item in df.to_records()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['State'] = df['State'].apply(lambda x: x.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to be used later
",
    "en_stop_words = set(stopwords.words('english'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is an example of string with punctuation
"
     ]
    }
   ],
   "source": [
    "# Remove punctuation
",
    "import string
",
    "input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
",
    "result = input_str.translate(str.maketrans("","", string.punctuation))
",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**5.1 sample code - Encoding Nominal Categorical Features**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['Texas']
",
      " ['California']
",
      " ['Texas']
",
      " ['Delaware']
",
      " ['Texas']]
",
      "feature created
"
     ]
    }
   ],
   "source": [
    "#create feature
",
    "feature = np.array([["Texas"],
",
    "                    ["California"],
",
    "                    ["Texas"],
",
    "                    ["Delaware"],
",
    "                    ["Texas"]])
",
    "print(feature)
",
    "print("feature created")
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 1]
",
      " [1 0 0]
",
      " [0 0 1]
",
      " [0 1 0]
",
      " [0 0 1]]
",
      "States are encoded
"
     ]
    }
   ],
   "source": [
    "#create one-hot encoder
",
    "from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
",
    "one_hot = LabelBinarizer()
",
    "print(one_hot.fit_transform(feature))
",
    "print("States are encoded")
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['California' 'Delaware' 'Texas' 'Delaware' 'Texas']
",
      "feature classes printed
"
     ]
    }
   ],
   "source": [
    "#view feature classes
",
    "one_hot.classes_
",
    "print(np.array(['California', 'Delaware', 'Texas', 'Delaware', 'Texas'], dtype='
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",

1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame. a.Convert all text to lowercase letters. b.Remove all punctuation from the text. c.Remove stop words....

Answer To: 1.Preparing Text: For this part, you will start by reading the Income.json file into a DataFrame....

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

	# of kids	Income	State
0	5	25000	CA
1	5	122500	NY
2	2	142007	TX
3	2	42007	TX
4	0	14704	TX
5	1	200704	TX
6	1	120070	CA
7	3	207040	NY
8	3	48000	NY
9	3	79000	NY