Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the...

Question

Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go through the tutorial. I need you to prove me comments out each step and run them separately so I can fully understand what you are doing for each step of the analysis.

(I am using the first part of it to practice Graphic Analytics but the updates to anaconda missed up some of the packages and I can’t run python.)

I got some of the code done - the data set is large (205mb)

I will have to put it in a dropbox link

Testing Hypothesis Exercise Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go through the tutorial. I need you to prove me comments out each step and run them separately so I can fully understand what you are doing for each step of the analysis. (I am using the first part of it to practice Graphic Analytics but the updates to anaconda missed up some of the packages and I can’t run python.) #Hypothesis: Articles about Climate Change are more likely to be published by "Liberal" sources NOTE: This case study is not complete! Here is some additional sample code to use: import pandas as pd import numpy as np import json import sys import warnings from sklearn.datasets import make_regression from sklearn.feature_selection import RFECV from sklearn import datasets, linear_model from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.decomposition import NMF from sklearn import datasets from sklearn.model_selection import train_test_split #9.1 reducing features using Principal Components digits = datasets.load_digits() features= StandardScaler().fit_transform(digits.data) pca=PCA(n_components=0.99, whiten=True) features_pca = pca.fit_transform(features) print("original number of features:", features.shape[1]) print("reduced number of features:", features_pca.shape[1]) print("output from 9.1 done!") #9.4 Reducing Features Using Matrix Factorization features = digits.data nmf=NMF(n_components=10, random_state=1) features_nmf=nmf.fit_transform(features) print("Original number of features:", features.shape[1]) print("reduced number of features:", features_nmf.shape[1]) print("output from 9.4 done!") #10.1 - Thresholding Numerical Feature Variance from sklearn import datasets from sklearn.feature_selection import VarianceThreshold #import data iris= datasets.load_iris() #create features and target features=iris.data target=iris.target #create thresholder thresholder = VarianceThreshold(threshold=.5) #create high variance feature matrix and print features_high_variance=thresholder.fit_transform(features) print(features_high_variance[0:3]) #10.2 - Thresholding Binary Feature Variance features = [[0,1,0], [0,1,1], [0,1,0], [0,1,1], [1,0,0]] thresholder=VarianceThreshold(threshold = (.75*(1-.75))) print(thresholder.fit_transform(features))

articles-frzlghck.ipynb testing-hypothesis-exercise-gzdu0usd.docx

Kshitij · Accepted Answer

45265.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CINDY HERRERA DSC550 WEEK 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applied Text Analysis With Python Exercises"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd
",
    "import numpy as np
",
    "import string
",
    "import re
",
    "import matplotlib.pyplot as plt
",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Step 1:  Load data into a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "addr1 = "articles1.csv""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2:  check the dimension of the table/look at the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The dimension of the table is:  (50000, 10)
"
     ]
    },
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Unnamed: 0
",
       "      id
",
       "      title
",
       "      publication
",
       "      author
",
       "      date
",
       "      year
",
       "      month
",
       "      url
",
       "      content
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      0
",
       "      17283
",
       "      House Republicans Fret About Winning Their Hea...
",
       "      New York Times
",
       "      Carl Hulse
",
       "      2016-12-31
",
       "      2016.0
",
       "      12.0
",
       "      NaN
",
       "      WASHINGTON  —   Congressional Republicans have...
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      17284
",
       "      Rift Between Officers and Residents as Killing...
",
       "      New York Times
",
       "      Benjamin Mueller and Al Baker
",
       "      2017-06-19
",
       "      2017.0
",
       "      6.0
",
       "      NaN
",
       "      After the bullet shells get counted, the blood...
",
       "    
",
       "    
",
       "      2
",
       "      2
",
       "      17285
",
       "      Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...
",
       "      New York Times
",
       "      Margalit Fox
",
       "      2017-01-06
",
       "      2017.0
",
       "      1.0
",
       "      NaN
",
       "      When Walt Disney’s “Bambi” opened in 1942, cri...
",
       "    
",
       "    
",
       "      3
",
       "      3
",
       "      17286
",
       "      Among Deaths in 2016, a Heavy Toll in Pop Musi...
",
       "      New York Times
",
       "      William McDonald
",
       "      2017-04-10
",
       "      2017.0
",
       "      4.0
",
       "      NaN
",
       "      Death may be the great equalizer, but it isn’t...
",
       "    
",
       "    
",
       "      4
",
       "      4
",
       "      17287
",
       "      Kim Jong-un Says North Korea Is Preparing to T...
",
       "      New York Times
",
       "      Choe Sang-Hun
",
       "      2017-01-02
",
       "      2017.0
",
       "      1.0
",
       "      NaN
",
       "      SEOUL, South Korea  —   North Korea’s leader, ...
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   Unnamed: 0     id                                              title  \
",
       "0           0  17283  House Republicans Fret About Winning Their Hea...   
",
       "1           1  17284  Rift Between Officers and Residents as Killing...   
",
       "2           2  17285  Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...   
",
       "3           3  17286  Among Deaths in 2016, a Heavy Toll in Pop Musi...   
",
       "4           4  17287  Kim Jong-un Says North Korea Is Preparing to T...   
",
       "
",
       "      publication                         author        date    year  month  \
",
       "0  New York Times                     Carl Hulse  2016-12-31  2016.0   12.0   
",
       "1  New York Times  Benjamin Mueller and Al Baker  2017-06-19  2017.0    6.0   
",
       "2  New York Times                   Margalit Fox  2017-01-06  2017.0    1.0   
",
       "3  New York Times               William McDonald  2017-04-10  2017.0    4.0   
",
       "4  New York Times                  Choe Sang-Hun  2017-01-02  2017.0    1.0   
",
       "
",
       "   url                                            content  
",
       "0  NaN  WASHINGTON  —   Congressional Republicans have...  
",
       "1  NaN  After the bullet shells get counted, the blood...  
",
       "2  NaN  When Walt Disney’s “Bambi” opened in 1942, cri...  
",
       "3  NaN  Death may be the great equalizer, but it isn’t...  
",
       "4  NaN  SEOUL, South Korea  —   North Korea’s leader, ...  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "articles = pd.read_csv(addr1)
",
    "
",
    "
",
    "print("The dimension of the table is: ", articles.shape)
",
    "
",
    "# here we displayed the top 5 rows of the dataframe we created , 
",
    "# so that we can have a idea of what type of things are there in the dataframe
",
    "articles.head(5)
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: what type of variables are in the table "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Describe Data
",
      "         Unnamed: 0            id          year         month  url
",
      "count  50000.000000  50000.000000  50000.000000  50000.000000  0.0
",
      "mean   25694.378380  44432.454800   2016.273700      5.508940  NaN
",
      "std    15350.143677  15773.615179      0.634694      3.333062  NaN
",
      "min        0.000000  17283.000000   2011.000000      1.000000  NaN
",
      "25%    12500.750000  31236.750000   2016.000000      3.000000  NaN
",
      "50%    25004.500000  43757.500000   2016.000000      5.000000  NaN
",
      "75%    38630.250000  57479.250000   2017.000000      8.000000  NaN
",
      "max    53291.000000  73469.000000   2017.000000     12.000000  NaN
",
      "Summarized Data
",
      "                                                    title publication  \
",
      "count                                               50000       50000   
",
      "unique                                              49920           5   
",
      "top     The 10 most important things in the world righ...   Breitbart   
",
      "freq                                                    7       23781   
",
      "
",
      "                author        date        content  
",
      "count            43694       50000          50000  
",
      "unique            3603         983          49888  
",
      "top     Breitbart News  2016-08-22  advertisement  
",
      "freq              1559         221             42  
"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Unnamed: 0       int64
",
       "id               int64
",
       "title           object
",
       "publication     object
",
       "author          object
",
       "date            object
",
       "year           float64
",
       "month          float64
",
       "url            float64
",
       "content         object
",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# now we are required to get the type of variables in the table , which is doen as follows
",
    "print("Describe Data")
",
    "print(articles.describe())
",
    "print("Summarized Data")
",
    "print(articles.describe(include=['O']))
",
    "
",
    "# this will return the datatypes of the columns
",
    "articles.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50000
"
     ]
    }
   ],
   "source": [
    "#display length of data
",
    "print(len(articles))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png":

Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go...

Answer To: Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

	Unnamed: 0	id	title	publication	author	date	year	month	url	content
0	0	17283	House Republicans Fret About Winning Their Hea...	New York Times	Carl Hulse	2016-12-31	2016.0	12.0	NaN	WASHINGTON — Congressional Republicans have...
1	1	17284	Rift Between Officers and Residents as Killing...	New York Times	Benjamin Mueller and Al Baker	2017-06-19	2017.0	6.0	NaN	After the bullet shells get counted, the blood...
2	2	17285	Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...	New York Times	Margalit Fox	2017-01-06	2017.0	1.0	NaN	When Walt Disney’s “Bambi” opened in 1942, cri...
3	3	17286	Among Deaths in 2016, a Heavy Toll in Pop Musi...	New York Times	William McDonald	2017-04-10	2017.0	4.0	NaN	Death may be the great equalizer, but it isn’t...
4	4	17287	Kim Jong-un Says North Korea Is Preparing to T...	New York Times	Choe Sang-Hun	2017-01-02	2017.0	1.0	NaN	SEOUL, South Korea — North Korea’s leader, ...