https://colab.research.google.com/drive/1ReN6mNkuUE0WtzPbQJKL5SMivl10Rlfz?usp=sharing{ "nbformat":...

Question

https://colab.research.google.com/drive/1ReN6mNkuUE0WtzPbQJKL5SMivl10Rlfz?usp=sharing{   "nbformat": 4,   "nbformat_minor": 0,   "metadata": {     "colab": {       "name": "ISTA_322_HW2_blank.ipynb",       "provenance": [],       "collapsed_sections": [         "FLRf120imkkc",         "NO73i3jGuu0F",         "xhiS7p5zwkAe"       ]     },     "kernelspec": {       "name": "python3",       "display_name": "Python 3"     }   },   "cells": [     {       "cell_type": "markdown",       "metadata": {         "id": "Hm-lEcSS1qVm",         "colab_type": "text"       },       "source": [         "# HW2 - Transforms!
",         " 
",         "In this homework we'll be practicing some data aggregations, joins, and flattening JSON files in order to make inference from two datasets!
",         " 
",         "The first is a dataset with 25,000,000 movie reviews.  These are already in flat files and are clean.  But, we obviously want to do some filtering to remove movies that only have a review or two.  And we'll want to do a couple of aggregations and joins to distill down those 25 million reviews into something understandable and useful.  
",         " 
",         "The other dataset will have you going to Yelp and making a dev account.  You'll then access various bits of information about local businesses and their reviews in order to figure out which ones are doing the best in Covid times.  
"       ]     },     {       "cell_type": "markdown",       "metadata": {         "id": "4dK6_6V0_VY0",         "colab_type": "text"       },       "source": [         "## Wranging and aggregating movie review data
",         " 
",         "The website MovieLens.com has a research group which provides open access to millions of reviews, for free!  We're going to work with those data for this homework.  [Feel free to check out the website here.](https://grouplens.org/datasets/movielens/)  You can go and download the raw data, but in this case I was nice and tossed it all on Amazon S3 for fast direct downloads.  
",         " 
",         "There are two datasets we'll be working with.
",         "* movies - this is a file of 60,000+ movies
",         "* reviews - this is a file of 25 *million* individual reviews for the 60k movies
",         " 
",         "The goal for this section of the homework is to do two types of data aggregations that will allow for someone to make inference on which movies were the most popular, reviewed, polarizing and were cult classics.  
",         " 
",         "To do this we'll first start by making a simple data set that brings just overall review properties together with movies.  We'll then do some deeper groping to create a dataset that looks at the same properties but over time. "       ]     },     {       "cell_type": "markdown",       "metadata": {         "id": "EGSFQInVD6ws",         "colab_type": "text"       },       "source": [         "### Data first
",         " 
",         "Let's bring in our two files and libraries.  The ratings file is understandably large.  So it's a good idea to download it and then save a copy as something else and work with that.  This way if you mess up you don't have to download it all over again
"       ]     },     {       "cell_type": "code",       "metadata": {         "id": "39GB_QJjD57f",         "colab_type": "code",         "colab": {}       },       "source": [         "# Libraries
",         "import pandas as pd
",         "from datetime import datetime
",         "import seaborn as sns
",         "import matplotlib.pyplot as plt
",         "import numpy as np
",         "
"       ],       "execution_count": null,       "outputs": []     },     {       "cell_type": "code",       "metadata": {         "id": "6MZ0784dEfia",         "colab_type": "code",         "colab": {}       },       "source": [         "# Movies data
",         "movies = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/movies.csv')"       ],       "execution_count": null,       "outputs": []     },     {       "cell_type": "code",       "metadata": {         "id": "p0yc4ooJdcUC",         "colab_type": "code",         "colab": {}       },       "source": [         "# Ratings data
",         "ratings = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/ratings.csv')"       ],       "execution_count": null,       "outputs": []     },     {       "cell_type": "code",       "metadata": {         "id": "qI52SkIGPmAW",         "colab_type": "code",         "colab": {}       },       "source": [         "# Make a copy of ratings
",         "ratings_backup = ratings.copy"       ],       "execution_count": null,       "outputs": []     },     {       "cell_type": "markdown",       "metadata": {         "id": "iPR1QogNFAgm",         "colab_type": "text"       },       "source": [         "## Explore your data - 1 point
",         " 
",         "Below take some time to explore your date.  Check out the following items.
",         " 
",         "* Head and tail of both datasets
",         "* Shape
",         "* Datatypes
",         "* The number of NaN values in the rating column of the ratings dataset
",         " 
",         "**Task** Do the head,tail, and shape operations all in one cell.  Count the number of NaNs in another. "       ]     },     {       "cell_type": "code",       "metadata": {         "id": "E6hE7ikTYeGk",         "colab_type": "code",         "colab": {}       },       "source": [         "# head, tail, shape of both datastes
",         "..."       ],       "execution_count": null,       "outputs": []     },     {       "cell_type": "code",       "metadata": {         "id": "NIxsJVC1F7jK",         "colab_type": "code",         "colab": {}       },       "source": [         "# Get count of NaNs
",         "..."       ],

Abr Writing · Accepted Answer

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Hm-lEcSS1qVm"
   },
   "source": [
    "# HW2 - Transforms!
",
    " 
",
    "In this homework we'll be practicing some data aggregations, joins, and flattening JSON files in order to make inference from two datasets!
",
    " 
",
    "The first is a dataset with 25,000,000 movie reviews.  These are already in flat files and are clean.  But, we obviously want to do some filtering to remove movies that only have a review or two.  And we'll want to do a couple of aggregations and joins to distill down those 25 million reviews into something understandable and useful.  
",
    " 
",
    "The other dataset will have you going to Yelp and making a dev account.  You'll then access various bits of information about local businesses and their reviews in order to figure out which ones are doing the best in Covid times.  
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4dK6_6V0_VY0"
   },
   "source": [
    "## Wranging and aggregating movie review data
",
    " 
",
    "The website MovieLens.com has a research group which provides open access to millions of reviews, for free!  We're going to work with those data for this homework.  [Feel free to check out the website here.](https://grouplens.org/datasets/movielens/)  You can go and download the raw data, but in this case I was nice and tossed it all on Amazon S3 for fast direct downloads.  
",
    " 
",
    "There are two datasets we'll be working with.
",
    "* movies - this is a file of 60,000+ movies
",
    "* reviews - this is a file of 25 *million* individual reviews for the 60k movies
",
    " 
",
    "The goal for this section of the homework is to do two types of data aggregations that will allow for someone to make inference on which movies were the most popular, reviewed, polarizing and were cult classics.  
",
    " 
",
    "To do this we'll first start by making a simple data set that brings just overall review properties together with movies.  We'll then do some deeper groping to create a dataset that looks at the same properties but over time. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "EGSFQInVD6ws"
   },
   "source": [
    "### Data first
",
    " 
",
    "Let's bring in our two files and libraries.  The ratings file is understandably large.  So it's a good idea to download it and then save a copy as something else and work with that.  This way if you mess up you don't have to download it all over again
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "39GB_QJjD57f"
   },
   "outputs": [],
   "source": [
    "# Libraries
",
    "import pandas as pd
",
    "from datetime import datetime
",
    "import seaborn as sns
",
    "import matplotlib.pyplot as plt
",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6MZ0784dEfia"
   },
   "outputs": [],
   "source": [
    "# Movies data
",
    "movies = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/movies.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "p0yc4ooJdcUC"
   },
   "outputs": [],
   "source": [
    "# Ratings data
",
    "ratings = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/ratings.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qI52SkIGPmAW"
   },
   "outputs": [],
   "source": [
    "# Make a copy of ratings
",
    "ratings_backup = ratings.copy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iPR1QogNFAgm"
   },
   "source": [
    "## Explore your data - 1 point
",
    " 
",
    "Below take some time to explore your date.  Check out the following items.
",
    " 
",
    "* Head and tail of both datasets
",
    "* Shape
",
    "* Datatypes
",
    "* The number of NaN values in the rating column of the ratings dataset
",
    " 
",
    "**Task** Do the head,tail, and shape operations all in one cell.  Count the number of NaNs in another. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((62423, 3), (25000095, 4))"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.shape, ratings.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      title
",
       "      genres
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      1
",
       "      2
",
       "      Jumanji (1995)
",
       "      Adventure|Children|Fantasy
",
       "    
",
       "    
",
       "      2
",
       "      3
",
       "      Grumpier Old Men (1995)
",
       "      Comedy|Romance
",
       "    
",
       "    
",
       "      3
",
       "      4
",
       "      Waiting to Exhale (1995)
",
       "      Comedy|Drama|Romance
",
       "    
",
       "    
",
       "      4
",
       "      5
",
       "      Father of the Bride Part II (1995)
",
       "      Comedy
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   movieId                               title  \
",
       "0        1                    Toy Story (1995)   
",
       "1        2                      Jumanji (1995)   
",
       "2        3             Grumpier Old Men (1995)   
",
       "3        4            Waiting to Exhale (1995)   
",
       "4        5  Father of the Bride Part II (1995)   
",
       "
",
       "                                        genres  
",
       "0  Adventure|Animation|Children|Comedy|Fantasy  
",
       "1                   Adventure|Children|Fantasy  
",
       "2                               Comedy|Romance  
",
       "3                         Comedy|Drama|Romance  
",
       "4                                       Comedy  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      userId
",
       "      movieId
",
       "      rating
",
       "      timestamp
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      296
",
       "      5.0
",
       "      1147880044
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      306
",
       "      3.5
",
       "      1147868817
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      307
",
       "      5.0
",
       "      1147868828
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      665
",
       "      5.0
",
       "      1147878820
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      899
",
       "      3.5
",
       "      1147868510
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp
",
       "0       1      296     5.0  1147880044
",
       "1       1      306     3.5  1147868817
",
       "2       1      307     5.0  1147868828
",
       "3       1      665     5.0  1147878820
",
       "4       1      899     3.5  1147868510"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "E6hE7ikTYeGk"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(   movieId                               title  \
",
       " 0        1                    Toy Story (1995)   
",
       " 1        2                      Jumanji (1995)   
",
       " 2        3             Grumpier Old Men (1995)   
",
       " 3        4            Waiting to Exhale (1995)   
",
       " 4        5  Father of the Bride Part II (1995)   
",
       " 
",
       "                                         genres  
",
       " 0  Adventure|Animation|Children|Comedy|Fantasy  
",
       " 1                   Adventure|Children|Fantasy  
",
       " 2                               Comedy|Romance  
",
       " 3                         Comedy|Drama|Romance  
",
       " 4                                       Comedy  ,
",
       "        movieId                           title                  genres
",
       " 62418   209157                       We (2018)                   Drama
",
       " 62419   209159       Window of the Soul (2001)             Documentary
",
       " 62420   209163                Bad Poems (2018)            Comedy|Drama
",
       " 62421   209169             A Girl Thing (2001)      (no genres listed)
",
       " 62422   209171  Women of Devil's Island (1962)  Action|Adventure|Drama,
",
       "    userId  movieId  rating   timestamp
",
       " 0       1      296     5.0  1147880044
",
       " 1       1      306     3.5  1147868817
",
       " 2       1      307     5.0  1147868828
",
       " 3       1      665     5.0  1147878820
",
       " 4       1      899     3.5  1147868510,
",
       "           userId  movieId  rating   timestamp
",
       " 25000090  162541    50872     4.5  1240953372
",
       " 25000091  162541    55768     2.5  1240951998
",
       " 25000092  162541    56176     2.0  1240950697
",
       " 25000093  162541    58559     4.0  1240953434
",
       " 25000094  162541    63876     5.0  1240952515,
",
       " (62423, 3),
",
       " (25000095, 4))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# head, tail, shape of both datastes
",
    "movies.head(),movies.tail(),ratings.head(),ratings.tail(), movies.shape, ratings.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Movies DF 
",
      " movieId     int64
",
      "title      object
",
      "genres     object
",
      "dtype: object 
",
      "
",
      "Ratings DF 
",
      " userId         int64
",
      "movieId        int64
",
      "rating       float64
",
      "timestamp      int64
",
      "dtype: object
"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(None, None)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print("Movies DF \n",movies.dtypes,"\n"), print("Ratings DF \n",ratings.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NIxsJVC1F7jK"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of NaNs in 'rating' column 0
"
     ]
    }
   ],
   "source": [
    "# Get count of NaNs
",
    "print("Number of NaNs in 'rating' column {0}".format(ratings['rating'].isna().sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "IbY5XgE33Kmc"
   },
   "source": [
    "## Convert timestamp in ratings to a datetime. - 0.5 points
",
    " 
",
    "One issue that you can see from your exploration is that the ratings only have a timestamp.  This timestamp is measured in the number of seconds since 00:00:00 on January 1st, 1970.  You'll need to convert this to a datetime in order to actually do our later data aggregations.  
",
    " 
",
    "You use `pd.to_datetime` on timestamps like this.  [For full details on the various ways to use this function please look at the Pandas documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) Briefly, this is how it's used. 
",
    " 
",
    "```
",
    ">>> pd.to_datetime(1490195805, unit='s')
",
    "Timestamp('2017-03-22 15:16:45')
",
    "```
",
    " 
",
    "**Task** Go and make a new column in the movies dataframe called `review_dt` that contains this timestamp but converted to a datetime.  
",
    " 
",
    "Note, it's worth doing this by assigning to a test vector first to make sure it works.  Once it works, then assign to the `reviews` dataframe. 
",
    " 
",
    "For example
",
    "```
",
    "test_vec = pd.to_datetime(arguments)
",
    "test_vec # to check what it contains
",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ecldZK15dqQC"
   },
   "outputs": [],
   "source": [
    "# Make test vector
",
    "test_vec = pd.to_datetime(ratings['timestamp'][0],unit='s')
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0rWM0NKhowev"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2006-05-17 15:34:04')"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check vector and datatype to make sure it make sense.  
",
    "test_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RBXoYE2Mo1pR"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      userId
",
       "      movieId
",
       "      rating
",
       "      timestamp
",
       "      review_dt
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      296
",
       "      5.0
",
       "      1147880044
",
       "      2006-05-17 15:34:04
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      306
",
       "      3.5
",
       "      1147868817
",
       "      2006-05-17 12:26:57
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      307
",
       "      5.0
",
       "      1147868828
",
       "      2006-05-17 12:27:08
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      665
",
       "      5.0
",
       "      1147878820
",
       "      2006-05-17 15:13:40
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      899
",
       "      3.5
",
       "      1147868510
",
       "      2006-05-17 12:21:50
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp           review_dt
",
       "0       1      296     5.0  1147880044 2006-05-17 15:34:04
",
       "1       1      306     3.5  1147868817 2006-05-17 12:26:57
",
       "2       1      307     5.0  1147868828 2006-05-17 12:27:08
",
       "3       1      665     5.0  1147878820 2006-05-17 15:13:40
",
       "4       1      899     3.5  1147868510 2006-05-17 12:21:50"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now ad this to the ratings dataframe
",
    "ratings['review_dt'] = ratings['timestamp'].apply(lambda x : pd.to_datetime(x,unit='s'))
",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle
",
    "with open('ratings_with_review_dt.pickle','wb') as handle:
",
    "    pickle.dump(ratings,handle,protocol=pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5RQV_35CdqdS"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      userId
",
       "      movieId
",
       "      rating
",
       "      timestamp
",
       "      review_dt
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      296
",
       "      5.0
",
       "      1147880044
",
       "      2006-05-17 15:34:04
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      306
",
       "      3.5
",
       "      1147868817
",
       "      2006-05-17 12:26:57
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      307
",
       "      5.0
",
       "      1147868828
",
       "      2006-05-17 12:27:08
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      665
",
       "      5.0
",
       "      1147878820
",
       "      2006-05-17 15:13:40
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      899
",
       "      3.5
",
       "      1147868510
",
       "      2006-05-17 12:21:50
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp           review_dt
",
       "0       1      296     5.0  1147880044 2006-05-17 15:34:04
",
       "1       1      306     3.5  1147868817 2006-05-17 12:26:57
",
       "2       1      307     5.0  1147868828 2006-05-17 12:27:08
",
       "3       1      665     5.0  1147878820 2006-05-17 15:13:40
",
       "4       1      899     3.5  1147868510 2006-05-17 12:21:50"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the head of your dataframe again
",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jt8syWKWdqji"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Oldest Review: 1995-01-09 11:46:49
",
      "Newest Review: 2019-11-21 09:15:03
"
     ]
    }
   ],
   "source": [
    "# What's the oldest and newest review? 
",
    "print("Oldest Review: {0}".format(ratings['review_dt'].min()))
",
    "print("Newest Review: {0}".format(ratings['review_dt'].max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3lQ8tFNgpPjF"
   },
   "source": [
    "## Your first aggregation and join - 1.5 points
",
    "
",
    "The first aggregation and join I want you to do is just at the whole movie level.  Your movies dataframe just has a single row for each movie, but obviously there are thousands of individual reviews for each of those movies.  We want to come up with some summary statistics on the reviews for each movie and then join them to the movies dataframe.  
",
    "
",
    "**Task:** I want you to do the following:
",
    "* Create an aggregated dataframe called `ratings_by_movie`.  This dataframe should be grouped by each movie.  I want you to use `.agg()` to calculate the mean, standard deviation ('std') and the number of reviews for each movie.  
",
    "* rename the columns of that dataframe `'movieId', 'rating_mean', 'rating_std', and 'rating_count'`
",
    "* Join this new `ratings_by_movie` dataframe such that it attaches all those summary statistics to their corresponding movies.  Called the joined dataframe `movies_with_ratings`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1diokGd0fghb"
   },
   "outputs": [],
   "source": [
    "# Make ratings_by_movie
",
    "ratings_by_movie = ratings.groupby('movieId', as_index=False).rating.agg(['mean','std','count']).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vaoGm6endqVu"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      mean
",
       "      std
",
       "      count
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      3.893708
",
       "      0.921552
",
       "      57309
",
       "    
",
       "    
",
       "      1
",
       "      2
",
       "      3.251527
",
       "      0.959851
",
       "      24228
",
       "    
",
       "    
",
       "      2
",
       "      3
",
       "      3.142028
",
       "      1.008443
",
       "      11804
",
       "    
",
       "    
",
       "      3
",
       "      4
",
       "      2.853547
",
       "      1.108531
",
       "      2523
",
       "    
",
       "    
",
       "      4
",
       "      5
",
       "      3.058434
",
       "      0.996611
",
       "      11714
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   movieId      mean       std  count
",
       "0        1  3.893708  0.921552  57309
",
       "1        2  3.251527  0.959851  24228
",
       "2        3  3.142028  1.008443  11804
",
       "3        4  2.853547  1.108531   2523
",
       "4        5  3.058434  0.996611  11714"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check it
",
    "ratings_by_movie.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eiNm1Qo4sUIr"
   },
   "outputs": [],
   "source": [
    "# Rename columns
",
    "ratings_by_movie.columns = ['movieId', 'rating_mean', 'rating_std','rating_count']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "gfwRPPcxrY71"
   },
   "outputs": [],
   "source": [
    "# Join it and call movies with ratings
",
    "movies_with_ratings = pd.merge(movies,
",
    "                              ratings_by_movie,
",
    "                              how='left',
",
    "                              on='movieId',
",
    "                              validate='1:1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_Ee4dKTTsIqd"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      title
",
       "      genres
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "      3.893708
",
       "      0.921552
",
       "      57309.0
",
       "    
",
       "    
",
       "      1
",
       "      2
",
       "      Jumanji (1995)
",
       "      Adventure|Children|Fantasy
",
       "      3.251527
",
       "      0.959851
",
       "      24228.0
",
       "    
",
       "    
",
       "      2
",
       "      3
",
       "      Grumpier Old Men (1995)
",
       "      Comedy|Romance
",
       "      3.142028
",
       "      1.008443
",
       "      11804.0
",
       "    
",
       "    
",
       "      3
",
       "      4
",
       "      Waiting to Exhale (1995)
",
       "      Comedy|Drama|Romance
",
       "      2.853547
",
       "      1.108531
",
       "      2523.0
",
       "    
",
       "    
",
       "      4
",
       "      5
",
       "      Father of the Bride Part II (1995)
",
       "      Comedy
",
       "      3.058434
",
       "      0.996611
",
       "      11714.0
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   movieId                               title  \
",
       "0        1                    Toy Story (1995)   
",
       "1        2                      Jumanji (1995)   
",
       "2        3             Grumpier Old Men (1995)   
",
       "3        4            Waiting to Exhale (1995)   
",
       "4        5  Father of the Bride Part II (1995)   
",
       "
",
       "                                        genres  rating_mean  rating_std  \
",
       "0  Adventure|Animation|Children|Comedy|Fantasy     3.893708    0.921552   
",
       "1                   Adventure|Children|Fantasy     3.251527    0.959851   
",
       "2                               Comedy|Romance     3.142028    1.008443   
",
       "3                         Comedy|Drama|Romance     2.853547    1.108531   
",
       "4                                       Comedy     3.058434    0.996611   
",
       "
",
       "   rating_count  
",
       "0       57309.0  
",
       "1       24228.0  
",
       "2       11804.0  
",
       "3        2523.0  
",
       "4       11714.0  "
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check movies_with_ratings.  
",
    "movies_with_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V7asUN4XtlA6"
   },
   "source": [
    "## Filtering and more transformations - 1 point
",
    " 
",
    "Now we want to clean up this dataset a bit and then do a couple more transforms. One issue you can see from your check above is that many movies only have one rating.  We want to set a minimum number of reviews needed to be included.  We also want to do some binning where movies with certain ratings levels 
",
    " 
",
    "**Task:** Please do the following operations
",
    "* Filter `movies_with_ratings` so it only contains movies that have at least 10 ratings
",
    "* Use the function `cut()` to automatically bin our `ratings_mean` column into three groups of 'bad', 'fine', or 'good' movies.  Call this `rating_group`.
",
    "* Use the same function to take the standard deviation in rating and make three groups of 'agreement', 'average', 'controversial'.  Thus, movies with low standard deviation have agreement in the rating, while movies with high standard deviation have controversy in the ratings.  Call this column `ratings_agreement`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((62423, 6), (62423, 3))"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_with_ratings.shape, movies.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "A-q2L_UDtoVU"
   },
   "outputs": [],
   "source": [
    "# Filter first being sure to overwrite dataframe
",
    "movies_with_ratings = movies_with_ratings.loc[movies_with_ratings['rating_count'] >= 10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "51jG155vvC64"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows left after filtering 'movies_with_ratings' : 24330
"
     ]
    }
   ],
   "source": [
    "# Check how many rows you're left with
",
    "print("Number of rows left after filtering 'movies_with_ratings' : {0}".format(movies_with_ratings.shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Gdax-0B9x2dN"
   },
   "source": [
    "I didn't show you how to use `cut()` in the lesson, but it's a transform just like anything else.  You could use a `np.where()` statement like we did, but `cut()` is a bit easier.  All it does is take a column as the first argument, the number of bins you want to group it in as the second argument, and then the labels you want to give those bins as the third.  It automatically divides them up into equal sized bins. 
",
    " 
",
    "For example, if I make the following list:
",
    "```
",
    "rating = [2, 4, 9, 8, 5, 3, 6, 10, 2, 1, 6, 7]
",
    "```
",
    " 
",
    "And run `cut()` on it with three bins and levels 'bad', 'fine', and 'good':
",
    "```
",
    "pd.cut(rating, 3, labels=['bad', 'fine', 'good'])
",
    "```
",
    " 
",
    "I get a return of:
",
    "```
",
    "[bad, bad, good, good, fine, ..., good, bad, bad, fine, fine]
",
    "Length: 12
",
    "Categories (3, object): [bad 
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      title
",
       "      genres
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "      rating_group
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "      3.893708
",
       "      0.921552
",
       "      57309.0
",
       "      good
",
       "    
",
       "    
",
       "      1
",
       "      2
",
       "      Jumanji (1995)
",
       "      Adventure|Children|Fantasy
",
       "      3.251527
",
       "      0.959851
",
       "      24228.0
",
       "      good
",
       "    
",
       "    
",
       "      2
",
       "      3
",
       "      Grumpier Old Men (1995)
",
       "      Comedy|Romance
",
       "      3.142028
",
       "      1.008443
",
       "      11804.0
",
       "      fine
",
       "    
",
       "    
",
       "      3
",
       "      4
",
       "      Waiting to Exhale (1995)
",
       "      Comedy|Drama|Romance
",
       "      2.853547
",
       "      1.108531
",
       "      2523.0
",
       "      fine
",
       "    
",
       "    
",
       "      4
",
       "      5
",
       "      Father of the Bride Part II (1995)
",
       "      Comedy
",
       "      3.058434
",
       "      0.996611
",
       "      11714.0
",
       "      fine
",
       "    
",
       "    
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "    
",
       "    
",
       "      61967
",
       "      207405
",
       "      Doctor Sleep (2019)
",
       "      Horror
",
       "      3.730769
",
       "      0.785109
",
       "      39.0
",
       "      good
",
       "    
",
       "    
",
       "      62027
",
       "      207612
",
       "      Tell Me Who I Am (2019)
",
       "      Documentary
",
       "      3.357143
",
       "      1.099450
",
       "      14.0
",
       "      good
",
       "    
",
       "    
",
       "      62088
",
       "      207830
",
       "      Terminator: Dark Fate (2019)
",
       "      Action|Sci-Fi
",
       "      3.372727
",
       "      0.794764
",
       "      55.0
",
       "      good
",
       "    
",
       "    
",
       "      62313
",
       "      208715
",
       "      Let It Snow (2019)
",
       "      Comedy|Romance
",
       "      2.933333
",
       "      1.222799
",
       "      15.0
",
       "      fine
",
       "    
",
       "    
",
       "      62319
",
       "      208737
",
       "      Midway (2019)
",
       "      Action|Drama|War
",
       "      3.333333
",
       "      0.887625
",
       "      12.0
",
       "      good
",
       "    
",
       "  
",
       "
",
       "24330 rows Ã 7 columns
",
       ""
      ],
      "text/plain": [
       "       movieId                               title  \
",
       "0            1                    Toy Story (1995)   
",
       "1            2                      Jumanji (1995)   
",
       "2            3             Grumpier Old Men (1995)   
",
       "3            4            Waiting to Exhale (1995)   
",
       "4            5  Father of the Bride Part II (1995)   
",
       "...        ...                                 ...   
",
       "61967   207405                 Doctor Sleep (2019)   
",
       "62027   207612             Tell Me Who I Am (2019)   
",
       "62088   207830        Terminator: Dark Fate (2019)   
",
       "62313   208715                  Let It Snow (2019)   
",
       "62319   208737                       Midway (2019)   
",
       "
",
       "                                            genres  rating_mean  rating_std  \
",
       "0      Adventure|Animation|Children|Comedy|Fantasy     3.893708    0.921552   
",
       "1                       Adventure|Children|Fantasy     3.251527    0.959851   
",
       "2                                   Comedy|Romance     3.142028    1.008443   
",
       "3                             Comedy|Drama|Romance     2.853547    1.108531   
",
       "4                                           Comedy     3.058434    0.996611   
",
       "...                                            ...          ...         ...   
",
       "61967                                       Horror     3.730769    0.785109   
",
       "62027                                  Documentary     3.357143    1.099450   
",
       "62088                                Action|Sci-Fi     3.372727    0.794764   
",
       "62313                               Comedy|Romance     2.933333    1.222799   
",
       "62319                             Action|Drama|War     3.333333    0.887625   
",
       "
",
       "       rating_count rating_group  
",
       "0           57309.0         good  
",
       "1           24228.0         good  
",
       "2           11804.0         fine  
",
       "3            2523.0         fine  
",
       "4           11714.0         fine  
",
       "...             ...          ...  
",
       "61967          39.0         good  
",
       "62027          14.0         good  
",
       "62088          55.0         good  
",
       "62313          15.0         fine  
",
       "62319          12.0         good  
",
       "
",
       "[24330 rows x 7 columns]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check it
",
    "movies_with_ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "xZpmz44l0AZB"
   },
   "outputs": [],
   "source": [
    "# Now use cut() again to create your ratings_agreement column.  
",
    "# Use three bins and order of 'agreement', 'average',  and 'controversial'
",
    "movies_with_ratings['ratings_agreement'] = pd.cut(movies_with_ratings['rating_std'], 3, labels=['agreement', 'average','controversial'])
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      title
",
       "      genres
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "      rating_group
",
       "      ratings_agreement
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "      3.893708
",
       "      0.921552
",
       "      57309.0
",
       "      good
",
       "      average
",
       "    
",
       "    
",
       "      1
",
       "      2
",
       "      Jumanji (1995)
",
       "      Adventure|Children|Fantasy
",
       "      3.251527
",
       "      0.959851
",
       "      24228.0
",
       "      good
",
       "      average
",
       "    
",
       "    
",
       "      2
",
       "      3
",
       "      Grumpier Old Men (1995)
",
       "      Comedy|Romance
",
       "      3.142028
",
       "      1.008443
",
       "      11804.0
",
       "      fine
",
       "      average
",
       "    
",
       "    
",
       "      3
",
       "      4
",
       "      Waiting to Exhale (1995)
",
       "      Comedy|Drama|Romance
",
       "      2.853547
",
       "      1.108531
",
       "      2523.0
",
       "      fine
",
       "      average
",
       "    
",
       "    
",
       "      4
",
       "      5
",
       "      Father of the Bride Part II (1995)
",
       "      Comedy
",
       "      3.058434
",
       "      0.996611
",
       "      11714.0
",
       "      fine
",
       "      average
",
       "    
",
       "    
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "    
",
       "    
",
       "      61967
",
       "      207405
",
       "      Doctor Sleep (2019)
",
       "      Horror
",
       "      3.730769
",
       "      0.785109
",
       "      39.0
",
       "      good
",
       "      average
",
       "    
",
       "    
",
       "      62027
",
       "      207612
",
       "      Tell Me Who I Am (2019)
",
       "      Documentary
",
       "      3.357143
",
       "      1.099450
",
       "      14.0
",
       "      good
",
       "      average
",
       "    
",
       "    
",
       "      62088
",
       "      207830
",
       "      Terminator: Dark Fate (2019)
",
       "      Action|Sci-Fi
",
       "      3.372727
",
       "      0.794764
",
       "      55.0
",
       "      good
",
       "      average
",
       "    
",
       "    
",
       "      62313
",
       "      208715
",
       "      Let It Snow (2019)
",
       "      Comedy|Romance
",
       "      2.933333
",
       "      1.222799
",
       "      15.0
",
       "      fine
",
       "      average
",
       "    
",
       "    
",
       "      62319
",
       "      208737
",
       "      Midway (2019)
",
       "      Action|Drama|War
",
       "      3.333333
",
       "      0.887625
",
       "      12.0
",
       "      good
",
       "      average
",
       "    
",
       "  
",
       "
",
       "24330 rows Ã 8 columns
",
       ""
      ],
      "text/plain": [
       "       movieId                               title  \
",
       "0            1                    Toy Story (1995)   
",
       "1            2                      Jumanji (1995)   
",
       "2            3             Grumpier Old Men (1995)   
",
       "3            4            Waiting to Exhale (1995)   
",
       "4            5  Father of the Bride Part II (1995)   
",
       "...        ...                                 ...   
",
       "61967   207405                 Doctor Sleep (2019)   
",
       "62027   207612             Tell Me Who I Am (2019)   
",
       "62088   207830        Terminator: Dark Fate (2019)   
",
       "62313   208715                  Let It Snow (2019)   
",
       "62319   208737                       Midway (2019)   
",
       "
",
       "                                            genres  rating_mean  rating_std  \
",
       "0      Adventure|Animation|Children|Comedy|Fantasy     3.893708    0.921552   
",
       "1                       Adventure|Children|Fantasy     3.251527    0.959851   
",
       "2                                   Comedy|Romance     3.142028    1.008443   
",
       "3                             Comedy|Drama|Romance     2.853547    1.108531   
",
       "4                                           Comedy     3.058434    0.996611   
",
       "...                                            ...          ...         ...   
",
       "61967                                       Horror     3.730769    0.785109   
",
       "62027                                  Documentary     3.357143    1.099450   
",
       "62088                                Action|Sci-Fi     3.372727    0.794764   
",
       "62313                               Comedy|Romance     2.933333    1.222799   
",
       "62319                             Action|Drama|War     3.333333    0.887625   
",
       "
",
       "       rating_count rating_group ratings_agreement  
",
       "0           57309.0         good           average  
",
       "1           24228.0         good           average  
",
       "2           11804.0         fine           average  
",
       "3            2523.0         fine           average  
",
       "4           11714.0         fine           average  
",
       "...             ...          ...               ...  
",
       "61967          39.0         good           average  
",
       "62027          14.0         good           average  
",
       "62088          55.0         good           average  
",
       "62313          15.0         fine           average  
",
       "62319          12.0         good           average  
",
       "
",
       "[24330 rows x 8 columns]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_with_ratings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SwMX5K530m-j"
   },
   "source": [
    "###  Exploring our data 
",
    " 
",
    "Making bins like this allows us to figure out things like which movies are both bad, but have differing opinions on.  For example, we can filter out movies that are in the bad category but have a lot of controversy about those ratings.  This could mean they're 'cult classic' movies where despite the low rating some people actually really love the movies.  
",
    " 
",
    "A dataset like this could be used in a recommendation engine where if you see people liking these 'bad but good' movies you could suggest others that meet the came criteria.  
",
    "
",
    "**Task:** There is no points here - only code to let you make figures to see what the dataset could be used for.  Also, if this code works it means you probably did your answers above are right :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8k2C-Lzs0Z33"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      title
",
       "      genres
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "      rating_group
",
       "      ratings_agreement
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      8618
",
       "      26157
",
       "      Manos: The Hands of Fate (1966)
",
       "      Horror
",
       "      1.798165
",
       "      1.550268
",
       "      109.0
",
       "      bad
",
       "      controversial
",
       "    
",
       "    
",
       "      16964
",
       "      89281
",
       "      Birdemic: Shock and Terror (2010)
",
       "      Romance|Thriller
",
       "      1.521739
",
       "      1.503492
",
       "      115.0
",
       "      bad
",
       "      controversial
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "       movieId                              title            genres  \
",
       "8618     26157    Manos: The Hands of Fate (1966)            Horror   
",
       "16964    89281  Birdemic: Shock and Terror (2010)  Romance|Thriller   
",
       "
",
       "       rating_mean  rating_std  rating_count rating_group ratings_agreement  
",
       "8618      1.798165    1.550268         109.0          bad     controversial  
",
       "16964     1.521739    1.503492         115.0          bad     controversial  "
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_with_ratings[(movies_with_ratings['rating_group'] == 'bad') & 
",
    "                    (movies_with_ratings['ratings_agreement'] == 'controversial') & 
",
    "                    (movies_with_ratings['rating_count'] >= 100)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zPBvEkD42NDF"
   },
   "source": [
    "## Grouping within years - 1 point
",
    " 
",
    "Now that we've done our overall grouping by movie, let's get a bit more detail about these ratings. Specifically, let's engineer a dataset that breaks down the average rating not only by movie, but also by the year the person provided the review.  This would allow for someone to see which movies continue to do well over time, which ones become more popular, and which ones don't age well!
",
    " 
",
    "**Task:** You're going to do the following steps:
",
    "* Create a new `ratings_by_movie` that groups both by `movieId` but also by your `review_dt` column.  I want you to group into year intervals.  
",
    "* Join `movies` to `ratings_by_movie` so that you have the summary review statistics for each year the movie has been out
",
    "* Clean up and filter your dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tioZCC2YB7Qa"
   },
   "source": [
    "First, create `ratings_by_movie`.  You can group by two levels by just adding a list of what levels you want to group by in the `groupby()` statement.  I'll give you some help there, but you have to complete the rest in order to group by movieId first and review_dt second. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "nXvUmhqxyiAp"
   },
   "outputs": [],
   "source": [
    "# Note the groupby syntax.  I first am grouping by movieId
",
    "# But then also am calling dt.year on our datetime column
",
    "# This will then tell Python to do the aggregations within year as well
",
    "ratings_by_movie = ratings.groupby(['movieId',ratings['review_dt'].dt.year], as_index=False).rating.agg(['mean','std','count']).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jZ8EJYtD0QXM"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      review_dt
",
       "      mean
",
       "      std
",
       "      count
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      1996
",
       "      4.132756
",
       "      0.884990
",
       "      6237
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      1997
",
       "      3.872500
",
       "      0.894079
",
       "      6000
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      1998
",
       "      3.889515
",
       "      0.944972
",
       "      887
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      1999
",
       "      3.975140
",
       "      0.913960
",
       "      2494
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      2000
",
       "      4.136634
",
       "      0.865127
",
       "      3535
",
       "    
",
       "    
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "    
",
       "    
",
       "      323732
",
       "      209157
",
       "      2019
",
       "      1.500000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323733
",
       "      209159
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323734
",
       "      209163
",
       "      2019
",
       "      4.500000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323735
",
       "      209169
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323736
",
       "      209171
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "  
",
       "
",
       "323737 rows Ã 5 columns
",
       ""
      ],
      "text/plain": [
       "        movieId  review_dt      mean       std  count
",
       "0             1       1996  4.132756  0.884990   6237
",
       "1             1       1997  3.872500  0.894079   6000
",
       "2             1       1998  3.889515  0.944972    887
",
       "3             1       1999  3.975140  0.913960   2494
",
       "4             1       2000  4.136634  0.865127   3535
",
       "...         ...        ...       ...       ...    ...
",
       "323732   209157       2019  1.500000       NaN      1
",
       "323733   209159       2019  3.000000       NaN      1
",
       "323734   209163       2019  4.500000       NaN      1
",
       "323735   209169       2019  3.000000       NaN      1
",
       "323736   209171       2019  3.000000       NaN      1
",
       "
",
       "[323737 rows x 5 columns]"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check
",
    "ratings_by_movie"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "owv03D5lCh6s"
   },
   "source": [
    "You need to rename columns, but this is a bit trickier as you have two levels of your dataframe index. I'm going to give you the code below.  But, what it's doing is resetting that one level of the index `review_dt` and putting it back as a regular column.  I'm then renaming the resulting columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "J-Xp1BY_dp3r"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      year
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      1996
",
       "      4.132756
",
       "      0.884990
",
       "      6237
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      1997
",
       "      3.872500
",
       "      0.894079
",
       "      6000
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      1998
",
       "      3.889515
",
       "      0.944972
",
       "      887
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      1999
",
       "      3.975140
",
       "      0.913960
",
       "      2494
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      2000
",
       "      4.136634
",
       "      0.865127
",
       "      3535
",
       "    
",
       "    
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "    
",
       "    
",
       "      323732
",
       "      209157
",
       "      2019
",
       "      1.500000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323733
",
       "      209159
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323734
",
       "      209163
",
       "      2019
",
       "      4.500000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323735
",
       "      209169
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "    
",
       "      323736
",
       "      209171
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "    
",
       "  
",
       "
",
       "323737 rows Ã 5 columns
",
       ""
      ],
      "text/plain": [
       "        movieId  year  rating_mean  rating_std  rating_count
",
       "0             1  1996     4.132756    0.884990          6237
",
       "1             1  1997     3.872500    0.894079          6000
",
       "2             1  1998     3.889515    0.944972           887
",
       "3             1  1999     3.975140    0.913960          2494
",
       "4             1  2000     4.136634    0.865127          3535
",
       "...         ...   ...          ...         ...           ...
",
       "323732   209157  2019     1.500000         NaN             1
",
       "323733   209159  2019     3.000000         NaN             1
",
       "323734   209163  2019     4.500000         NaN             1
",
       "323735   209169  2019     3.000000         NaN             1
",
       "323736   209171  2019     3.000000         NaN             1
",
       "
",
       "[323737 rows x 5 columns]"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reset index
",
    "#ratings_by_movie = ratings_by_movie.reset_index(level = 'review_dt')
",
    "# Here we are making 'movieId' as a column and not making it as an index. This helps us in easier processing of next steps.
",
    "ratings_by_movie.columns = ['movieId','year', 'rating_mean', 'rating_std', 'rating_count']
",
    "ratings_by_movie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "q9EFQCOshC4U"
   },
   "outputs": [],
   "source": [
    "# Now join the movie dataframe onto rating_by_movie
",
    "movies_with_ratings = pd.merge(ratings_by_movie,
",
    "                               movies,
",
    "                               how='left',
",
    "                               on='movieId',
",
    "                               validate='m:1'
",
    "                              )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "yBLjjJ880vvh"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      movieId
",
       "      year
",
       "      rating_mean
",
       "      rating_std
",
       "      rating_count
",
       "      title
",
       "      genres
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      1
",
       "      1996
",
       "      4.132756
",
       "      0.884990
",
       "      6237
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      1
",
       "      1
",
       "      1997
",
       "      3.872500
",
       "      0.894079
",
       "      6000
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      2
",
       "      1
",
       "      1998
",
       "      3.889515
",
       "      0.944972
",
       "      887
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      3
",
       "      1
",
       "      1999
",
       "      3.975140
",
       "      0.913960
",
       "      2494
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      4
",
       "      1
",
       "      2000
",
       "      4.136634
",
       "      0.865127
",
       "      3535
",
       "      Toy Story (1995)
",
       "      Adventure|Animation|Children|Comedy|Fantasy
",
       "    
",
       "    
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "      ...
",
       "    
",
       "    
",
       "      323732
",
       "      209157
",
       "      2019
",
       "      1.500000
",
       "      NaN
",
       "      1
",
       "      We (2018)
",
       "      Drama
",
       "    
",
       "    
",
       "      323733
",
       "      209159
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "      Window of the Soul (2001)
",
       "      Documentary
",
       "    
",
       "    
",
       "      323734
",
       "      209163
",
       "      2019
",
       "      4.500000
",
       "      NaN
",
       "      1
",
       "      Bad Poems (2018)
",
       "      Comedy|Drama
",
       "    
",
       "    
",
       "      323735
",
       "      209169
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "      A Girl Thing (2001)
",
       "      (no genres listed)
",
       "    
",
       "    
",
       "      323736
",
       "      209171
",
       "      2019
",
       "      3.000000
",
       "      NaN
",
       "      1
",
       "      Women of Devil's Island (1962)
",
       "      Action|Adventure|Drama
",
       "    
",
       "  
",
       "
",
       "323737 rows Ã 7 columns
",
       ""
      ],
      "text/plain": [
       "        movieId  year  rating_mean  rating_std  rating_count  \
",
       "0             1  1996     4.132756    0.884990          6237   
",
       "1             1  1997     3.872500    0.894079          6000   
",
       "2             1  1998     3.889515    0.944972           887   
",
       "3             1  1999     3.975140    0.913960          2494   
",
       "4             1  2000     4.136634    0.865127          3535   
",
       "...         ...   ...          ...         ...           ...   
",
       "323732   209157  2019     1.500000         NaN             1   
",
       "323733   209159  2019     3.000000         NaN             1   
",
       "323734   209163  2019     4.500000         NaN             1   
",
       "323735   209169  2019     3.000000         NaN             1   
",
       "323736   209171  2019     3.000000         NaN             1   
",
       "
",
       "                                 title  \
",
       "0                     Toy Story (1995)   
",
       "1                     Toy Story (1995)   
",
       "2                     Toy Story (1995)   
",
       "3                     Toy Story (1995)   
",
       "4                     Toy Story (1995)   
",
       "...                                ...   
",
       "323732                       We (2018)   
",
       "323733       Window of the Soul (2001)   
",
       "323734                Bad Poems (2018)   
",
       "323735             A Girl Thing (2001)   
",
       "323736  Women of Devil's Island (1962)   
",
       "
",
       "                                             genres  
",
       "0       Adventure|Animation|Children|Comedy|Fantasy  
",

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ISTA_322_HW2_blank.ipynb", "provenance": [], "collapsed_sections": [ "FLRf120imkkc", "NO73i3jGuu0F", "xhiS7p5zwkAe" ] },...

Answer To: { "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ISTA_322_HW2_blank.ipynb",...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

	userId	movieId	rating	timestamp
0	1	296	5.0	1147880044
1	1	306	3.5	1147868817
2	1	307	5.0	1147868828
3	1	665	5.0	1147878820
4	1	899	3.5	1147868510

	userId	movieId	rating	timestamp	review_dt
0	1	296	5.0	1147880044	2006-05-17 15:34:04
1	1	306	3.5	1147868817	2006-05-17 12:26:57
2	1	307	5.0	1147868828	2006-05-17 12:27:08
3	1	665	5.0	1147878820	2006-05-17 15:13:40
4	1	899	3.5	1147868510	2006-05-17 12:21:50

	movieId	mean	std	count
0	1	3.893708	0.921552	57309
1	2	3.251527	0.959851	24228
2	3	3.142028	1.008443	11804
3	4	2.853547	1.108531	2523
4	5	3.058434	0.996611	11714

	movieId	title	genres	rating_mean	rating_std	rating_count	rating_group	ratings_agreement
8618	26157	Manos: The Hands of Fate (1966)	Horror	1.798165	1.550268	109.0	bad	controversial
16964	89281	Birdemic: Shock and Terror (2010)	Romance\|Thriller	1.521739	1.503492	115.0	bad	controversial

	movieId	review_dt	mean	std	count
0	1	1996	4.132756	0.884990	6237
1	1	1997	3.872500	0.894079	6000
2	1	1998	3.889515	0.944972	887
3	1	1999	3.975140	0.913960	2494
4	1	2000	4.136634	0.865127	3535
...	...	...	...	...	...
323732	209157	2019	1.500000	NaN	1
323733	209159	2019	3.000000	NaN	1
323734	209163	2019	4.500000	NaN	1
323735	209169	2019	3.000000	NaN	1
323736	209171	2019	3.000000	NaN	1