{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ISTA_322_HW2_blank.ipynb", "provenance": [], "collapsed_sections": [ "FLRf120imkkc", "NO73i3jGuu0F", "xhiS7p5zwkAe" ] },...

2 answer below ยป
https://colab.research.google.com/drive/1ReN6mNkuUE0WtzPbQJKL5SMivl10Rlfz?usp=sharing


{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ISTA_322_HW2_blank.ipynb", "provenance": [], "collapsed_sections": [ "FLRf120imkkc", "NO73i3jGuu0F", "xhiS7p5zwkAe" ] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "Hm-lEcSS1qVm", "colab_type": "text" }, "source": [ "# HW2 - Transforms!\n", " \n", "In this homework we'll be practicing some data aggregations, joins, and flattening JSON files in order to make inference from two datasets!\n", " \n", "The first is a dataset with 25,000,000 movie reviews. These are already in flat files and are clean. But, we obviously want to do some filtering to remove movies that only have a review or two. And we'll want to do a couple of aggregations and joins to distill down those 25 million reviews into something understandable and useful. \n", " \n", "The other dataset will have you going to Yelp and making a dev account. You'll then access various bits of information about local businesses and their reviews in order to figure out which ones are doing the best in Covid times. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "4dK6_6V0_VY0", "colab_type": "text" }, "source": [ "## Wranging and aggregating movie review data\n", " \n", "The website MovieLens.com has a research group which provides open access to millions of reviews, for free! We're going to work with those data for this homework. [Feel free to check out the website here.](https://grouplens.org/datasets/movielens/) You can go and download the raw data, but in this case I was nice and tossed it all on Amazon S3 for fast direct downloads. \n", " \n", "There are two datasets we'll be working with.\n", "* movies - this is a file of 60,000+ movies\n", "* reviews - this is a file of 25 *million* individual reviews for the 60k movies\n", " \n", "The goal for this section of the homework is to do two types of data aggregations that will allow for someone to make inference on which movies were the most popular, reviewed, polarizing and were cult classics. \n", " \n", "To do this we'll first start by making a simple data set that brings just overall review properties together with movies. We'll then do some deeper groping to create a dataset that looks at the same properties but over time. " ] }, { "cell_type": "markdown", "metadata": { "id": "EGSFQInVD6ws", "colab_type": "text" }, "source": [ "### Data first\n", " \n", "Let's bring in our two files and libraries. The ratings file is understandably large. So it's a good idea to download it and then save a copy as something else and work with that. This way if you mess up you don't have to download it all over again\n" ] }, { "cell_type": "code", "metadata": { "id": "39GB_QJjD57f", "colab_type": "code", "colab": {} }, "source": [ "# Libraries\n", "import pandas as pd\n", "from datetime import datetime\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "6MZ0784dEfia", "colab_type": "code", "colab": {} }, "source": [ "# Movies data\n", "movies = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/movies.csv')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "p0yc4ooJdcUC", "colab_type": "code", "colab": {} }, "source": [ "# Ratings data\n", "ratings = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/ratings.csv')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "qI52SkIGPmAW", "colab_type": "code", "colab": {} }, "source": [ "# Make a copy of ratings\n", "ratings_backup = ratings.copy" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "iPR1QogNFAgm", "colab_type": "text" }, "source": [ "## Explore your data - 1 point\n", " \n", "Below take some time to explore your date. Check out the following items.\n", " \n", "* Head and tail of both datasets\n", "* Shape\n", "* Datatypes\n", "* The number of NaN values in the rating column of the ratings dataset\n", " \n", "**Task** Do the head,tail, and shape operations all in one cell. Count the number of NaNs in another. " ] }, { "cell_type": "code", "metadata": { "id": "E6hE7ikTYeGk", "colab_type": "code", "colab": {} }, "source": [ "# head, tail, shape of both datastes\n", "..." ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "NIxsJVC1F7jK", "colab_type": "code", "colab": {} }, "source": [ "# Get count of NaNs\n", "..." ],
Answered Same DaySep 22, 2021

Answer To: { "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ISTA_322_HW2_blank.ipynb",...

Abr Writing answered on Sep 25 2021
150 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Hm-lEcSS1qVm"
},
"source": [
"# HW2 - Transforms!\n",
" \n",
"In this homework we'll be practicing some data aggregations, joins, and flattening JSON files in order to make inference from two datasets!\n",
" \n",
"The first is a dataset with 25,000,000 movie reviews. These are already in flat files and are clean. But, we obviously want to do some filtering to remove movies that only have a review or two. And we'll want to do a couple of aggregations and joins to distill down those 25 million reviews into something understandable and useful. \n",
" \n",
"The other dataset will have you going to Yelp and making a dev account. You'll then access various bits of information about local businesses and their reviews in order to figure out which ones are doing the best in Covid times. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4dK6_6V0_VY0"
},
"source": [
"## Wranging and aggregating movie review data\n",
" \n",
"The website MovieLens.com has a research group which provides open access to millions of reviews, for free! We're going to work with those data for this homework. [Feel free to check out the website here.](https://grouplens.org/datasets/movielens/) You can go and download the raw data, but in this case I was nice and tossed it all on Amazon S3 for fast direct downloads. \n",
" \n",
"There are two datasets we'll be working with.\n",
"* movies - this is a file of 60,000+ movies\n",
"* reviews - this is a file of 25 *million* individual reviews for the 60k movies\n",
" \n",
"The goal for this section of the homework is to do two types of data aggregations that will allow for someone to make inference on which movies were the most popular, reviewed, polarizing and were cult classics. \n",
" \n",
"To do this we'll first start by making a simple data set that brings just overall review properties together with movies. We'll then do some deeper groping to create a dataset that looks at the same properties but over time. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "EGSFQInVD6ws"
},
"source": [
"### Data first\n",
" \n",
"Let's bring in our two files and libraries. The ratings file is understandably large. So it's a good idea to download it and then save a copy as something else and work with that. This way if you mess up you don't have to download it all over again\n"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "39GB_QJjD57f"
},
"outputs": [],
"source": [
"# Libraries\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "6MZ0784dEfia"
},
"outputs": [],
"source": [
"# Movies data\n",
"movies = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/movies.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "p0yc4ooJdcUC"
},
"outputs": [],
"source": [
"# Ratings data\n",
"ratings = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/ratings.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "qI52SkIGPmAW"
},
"outputs": [],
"source": [
"# Make a copy of ratings\n",
"ratings_backup = ratings.copy"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iPR1QogNFAgm"
},
"source": [
"## Explore your data - 1 point\n",
" \n",
"Below take some time to explore your date. Check out the following items.\n",
" \n",
"* Head and tail of both datasets\n",
"* Shape\n",
"* Datatypes\n",
"* The number of NaN values in the rating column of the ratings dataset\n",
" \n",
"**Task** Do the head,tail, and shape operations all in one cell. Count the number of NaNs in another. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((62423, 3), (25000095, 4))"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.shape, ratings.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"\n",
" genres \n",
"0 Adventure|Animation|Children|Comedy|Fantasy \n",
"1 Adventure|Children|Fan
tasy \n",
"2 Comedy|Romance \n",
"3 Comedy|Drama|Romance \n",
"4 Comedy "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
userIdmovieIdratingtimestamp
012965.01147880044
113063.51147868817
213075.01147868828
316655.01147878820
418993.51147868510
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp\n",
"0 1 296 5.0 1147880044\n",
"1 1 306 3.5 1147868817\n",
"2 1 307 5.0 1147868828\n",
"3 1 665 5.0 1147878820\n",
"4 1 899 3.5 1147868510"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {},
"colab_type": "code",
"collapsed": true,
"id": "E6hE7ikTYeGk"
},
"outputs": [
{
"data": {
"text/plain": [
"( movieId title \\\n",
" 0 1 Toy Story (1995) \n",
" 1 2 Jumanji (1995) \n",
" 2 3 Grumpier Old Men (1995) \n",
" 3 4 Waiting to Exhale (1995) \n",
" 4 5 Father of the Bride Part II (1995) \n",
" \n",
" genres \n",
" 0 Adventure|Animation|Children|Comedy|Fantasy \n",
" 1 Adventure|Children|Fantasy \n",
" 2 Comedy|Romance \n",
" 3 Comedy|Drama|Romance \n",
" 4 Comedy ,\n",
" movieId title genres\n",
" 62418 209157 We (2018) Drama\n",
" 62419 209159 Window of the Soul (2001) Documentary\n",
" 62420 209163 Bad Poems (2018) Comedy|Drama\n",
" 62421 209169 A Girl Thing (2001) (no genres listed)\n",
" 62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama,\n",
" userId movieId rating timestamp\n",
" 0 1 296 5.0 1147880044\n",
" 1 1 306 3.5 1147868817\n",
" 2 1 307 5.0 1147868828\n",
" 3 1 665 5.0 1147878820\n",
" 4 1 899 3.5 1147868510,\n",
" userId movieId rating timestamp\n",
" 25000090 162541 50872 4.5 1240953372\n",
" 25000091 162541 55768 2.5 1240951998\n",
" 25000092 162541 56176 2.0 1240950697\n",
" 25000093 162541 58559 4.0 1240953434\n",
" 25000094 162541 63876 5.0 1240952515,\n",
" (62423, 3),\n",
" (25000095, 4))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# head, tail, shape of both datastes\n",
"movies.head(),movies.tail(),ratings.head(),ratings.tail(), movies.shape, ratings.shape"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Movies DF \n",
" movieId int64\n",
"title object\n",
"genres object\n",
"dtype: object \n",
"\n",
"Ratings DF \n",
" userId int64\n",
"movieId int64\n",
"rating float64\n",
"timestamp int64\n",
"dtype: object\n"
]
},
{
"data": {
"text/plain": [
"(None, None)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"Movies DF \\n\",movies.dtypes,\"\\n\"), print(\"Ratings DF \\n\",ratings.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "NIxsJVC1F7jK"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of NaNs in 'rating' column 0\n"
]
}
],
"source": [
"# Get count of NaNs\n",
"print(\"Number of NaNs in 'rating' column {0}\".format(ratings['rating'].isna().sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IbY5XgE33Kmc"
},
"source": [
"## Convert timestamp in ratings to a datetime. - 0.5 points\n",
" \n",
"One issue that you can see from your exploration is that the ratings only have a timestamp. This timestamp is measured in the number of seconds since 00:00:00 on January 1st, 1970. You'll need to convert this to a datetime in order to actually do our later data aggregations. \n",
" \n",
"You use `pd.to_datetime` on timestamps like this. [For full details on the various ways to use this function please look at the Pandas documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) Briefly, this is how it's used. \n",
" \n",
"```\n",
">>> pd.to_datetime(1490195805, unit='s')\n",
"Timestamp('2017-03-22 15:16:45')\n",
"```\n",
" \n",
"**Task** Go and make a new column in the movies dataframe called `review_dt` that contains this timestamp but converted to a datetime. \n",
" \n",
"Note, it's worth doing this by assigning to a test vector first to make sure it works. Once it works, then assign to the `reviews` dataframe. \n",
" \n",
"For example\n",
"```\n",
"test_vec = pd.to_datetime(arguments)\n",
"test_vec # to check what it contains\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ecldZK15dqQC"
},
"outputs": [],
"source": [
"# Make test vector\n",
"test_vec = pd.to_datetime(ratings['timestamp'][0],unit='s')\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "0rWM0NKhowev"
},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2006-05-17 15:34:04')"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check vector and datatype to make sure it make sense. \n",
"test_vec"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "RBXoYE2Mo1pR"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
userIdmovieIdratingtimestampreview_dt
012965.011478800442006-05-17 15:34:04
113063.511478688172006-05-17 12:26:57
213075.011478688282006-05-17 12:27:08
316655.011478788202006-05-17 15:13:40
418993.511478685102006-05-17 12:21:50
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp review_dt\n",
"0 1 296 5.0 1147880044 2006-05-17 15:34:04\n",
"1 1 306 3.5 1147868817 2006-05-17 12:26:57\n",
"2 1 307 5.0 1147868828 2006-05-17 12:27:08\n",
"3 1 665 5.0 1147878820 2006-05-17 15:13:40\n",
"4 1 899 3.5 1147868510 2006-05-17 12:21:50"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now ad this to the ratings dataframe\n",
"ratings['review_dt'] = ratings['timestamp'].apply(lambda x : pd.to_datetime(x,unit='s'))\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open('ratings_with_review_dt.pickle','wb') as handle:\n",
" pickle.dump(ratings,handle,protocol=pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "5RQV_35CdqdS"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
userIdmovieIdratingtimestampreview_dt
012965.011478800442006-05-17 15:34:04
113063.511478688172006-05-17 12:26:57
213075.011478688282006-05-17 12:27:08
316655.011478788202006-05-17 15:13:40
418993.511478685102006-05-17 12:21:50
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp review_dt\n",
"0 1 296 5.0 1147880044 2006-05-17 15:34:04\n",
"1 1 306 3.5 1147868817 2006-05-17 12:26:57\n",
"2 1 307 5.0 1147868828 2006-05-17 12:27:08\n",
"3 1 665 5.0 1147878820 2006-05-17 15:13:40\n",
"4 1 899 3.5 1147868510 2006-05-17 12:21:50"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the head of your dataframe again\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jt8syWKWdqji"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Oldest Review: 1995-01-09 11:46:49\n",
"Newest Review: 2019-11-21 09:15:03\n"
]
}
],
"source": [
"# What's the oldest and newest review? \n",
"print(\"Oldest Review: {0}\".format(ratings['review_dt'].min()))\n",
"print(\"Newest Review: {0}\".format(ratings['review_dt'].max()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3lQ8tFNgpPjF"
},
"source": [
"## Your first aggregation and join - 1.5 points\n",
"\n",
"The first aggregation and join I want you to do is just at the whole movie level. Your movies dataframe just has a single row for each movie, but obviously there are thousands of individual reviews for each of those movies. We want to come up with some summary statistics on the reviews for each movie and then join them to the movies dataframe. \n",
"\n",
"**Task:** I want you to do the following:\n",
"* Create an aggregated dataframe called `ratings_by_movie`. This dataframe should be grouped by each movie. I want you to use `.agg()` to calculate the mean, standard deviation ('std') and the number of reviews for each movie. \n",
"* rename the columns of that dataframe `'movieId', 'rating_mean', 'rating_std', and 'rating_count'`\n",
"* Join this new `ratings_by_movie` dataframe such that it attaches all those summary statistics to their corresponding movies. Called the joined dataframe `movies_with_ratings`"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1diokGd0fghb"
},
"outputs": [],
"source": [
"# Make ratings_by_movie\n",
"ratings_by_movie = ratings.groupby('movieId', as_index=False).rating.agg(['mean','std','count']).reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "vaoGm6endqVu"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdmeanstdcount
013.8937080.92155257309
123.2515270.95985124228
233.1420281.00844311804
342.8535471.1085312523
453.0584340.99661111714
\n",
"
"
],
"text/plain": [
" movieId mean std count\n",
"0 1 3.893708 0.921552 57309\n",
"1 2 3.251527 0.959851 24228\n",
"2 3 3.142028 1.008443 11804\n",
"3 4 2.853547 1.108531 2523\n",
"4 5 3.058434 0.996611 11714"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check it\n",
"ratings_by_movie.head()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "eiNm1Qo4sUIr"
},
"outputs": [],
"source": [
"# Rename columns\n",
"ratings_by_movie.columns = ['movieId', 'rating_mean', 'rating_std','rating_count']"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "gfwRPPcxrY71"
},
"outputs": [],
"source": [
"# Join it and call movies with ratings\n",
"movies_with_ratings = pd.merge(movies,\n",
" ratings_by_movie,\n",
" how='left',\n",
" on='movieId',\n",
" validate='1:1')"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "_Ee4dKTTsIqd"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdtitlegenresrating_meanrating_stdrating_count
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy3.8937080.92155257309.0
12Jumanji (1995)Adventure|Children|Fantasy3.2515270.95985124228.0
23Grumpier Old Men (1995)Comedy|Romance3.1420281.00844311804.0
34Waiting to Exhale (1995)Comedy|Drama|Romance2.8535471.1085312523.0
45Father of the Bride Part II (1995)Comedy3.0584340.99661111714.0
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"\n",
" rating_count \n",
"0 57309.0 \n",
"1 24228.0 \n",
"2 11804.0 \n",
"3 2523.0 \n",
"4 11714.0 "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check movies_with_ratings. \n",
"movies_with_ratings.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "V7asUN4XtlA6"
},
"source": [
"## Filtering and more transformations - 1 point\n",
" \n",
"Now we want to clean up this dataset a bit and then do a couple more transforms. One issue you can see from your check above is that many movies only have one rating. We want to set a minimum number of reviews needed to be included. We also want to do some binning where movies with certain ratings levels \n",
" \n",
"**Task:** Please do the following operations\n",
"* Filter `movies_with_ratings` so it only contains movies that have at least 10 ratings\n",
"* Use the function `cut()` to automatically bin our `ratings_mean` column into three groups of 'bad', 'fine', or 'good' movies. Call this `rating_group`.\n",
"* Use the same function to take the standard deviation in rating and make three groups of 'agreement', 'average', 'controversial'. Thus, movies with low standard deviation have agreement in the rating, while movies with high standard deviation have controversy in the ratings. Call this column `ratings_agreement`."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((62423, 6), (62423, 3))"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings.shape, movies.shape"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "A-q2L_UDtoVU"
},
"outputs": [],
"source": [
"# Filter first being sure to overwrite dataframe\n",
"movies_with_ratings = movies_with_ratings.loc[movies_with_ratings['rating_count'] >= 10]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "51jG155vvC64"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows left after filtering 'movies_with_ratings' : 24330\n"
]
}
],
"source": [
"# Check how many rows you're left with\n",
"print(\"Number of rows left after filtering 'movies_with_ratings' : {0}\".format(movies_with_ratings.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Gdax-0B9x2dN"
},
"source": [
"I didn't show you how to use `cut()` in the lesson, but it's a transform just like anything else. You could use a `np.where()` statement like we did, but `cut()` is a bit easier. All it does is take a column as the first argument, the number of bins you want to group it in as the second argument, and then the labels you want to give those bins as the third. It automatically divides them up into equal sized bins. \n",
" \n",
"For example, if I make the following list:\n",
"```\n",
"rating = [2, 4, 9, 8, 5, 3, 6, 10, 2, 1, 6, 7]\n",
"```\n",
" \n",
"And run `cut()` on it with three bins and levels 'bad', 'fine', and 'good':\n",
"```\n",
"pd.cut(rating, 3, labels=['bad', 'fine', 'good'])\n",
"```\n",
" \n",
"I get a return of:\n",
"```\n",
"[bad, bad, good, good, fine, ..., good, bad, bad, fine, fine]\n",
"Length: 12\n",
"Categories (3, object): [bad < fine < good]\n",
"```\n",
" \n",
"Note how it orders them for you based on the order of the labels. "
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "SYVhXGaGx1-7"
},
"outputs": [
{
"data": {
"text/plain": [
"['bad', 'bad', 'good', 'good', 'fine', ..., 'good', 'bad', 'bad', 'fine', 'fine']\n",
"Length: 12\n",
"Categories (3, object): ['bad' < 'fine' < 'good']"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If you want to test it!\n",
"rating = [2, 4, 9, 8, 5, 3, 6, 10, 2, 1, 6, 7]\n",
"pd.cut(rating, 3, labels=['bad', 'fine', 'good'])"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "3DVYIrbDvG4i"
},
"outputs": [],
"source": [
"# Now make 'bad', 'fine', 'good' levels for ratings\n",
"# Assign to new column called 'rating_group' \n",
"movies_with_ratings['rating_group'] = pd.cut(movies_with_ratings['rating_mean'], 3, labels=['bad', 'fine', 'good'])"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "VmC2I-kFwWgo"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdtitlegenresrating_meanrating_stdrating_countrating_group
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy3.8937080.92155257309.0good
12Jumanji (1995)Adventure|Children|Fantasy3.2515270.95985124228.0good
23Grumpier Old Men (1995)Comedy|Romance3.1420281.00844311804.0fine
34Waiting to Exhale (1995)Comedy|Drama|Romance2.8535471.1085312523.0fine
45Father of the Bride Part II (1995)Comedy3.0584340.99661111714.0fine
........................
61967207405Doctor Sleep (2019)Horror3.7307690.78510939.0good
62027207612Tell Me Who I Am (2019)Documentary3.3571431.09945014.0good
62088207830Terminator: Dark Fate (2019)Action|Sci-Fi3.3727270.79476455.0good
62313208715Let It Snow (2019)Comedy|Romance2.9333331.22279915.0fine
62319208737Midway (2019)Action|Drama|War3.3333330.88762512.0good
\n",
"

24330 rows รƒย— 7 columns

\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"... ... ... \n",
"61967 207405 Doctor Sleep (2019) \n",
"62027 207612 Tell Me Who I Am (2019) \n",
"62088 207830 Terminator: Dark Fate (2019) \n",
"62313 208715 Let It Snow (2019) \n",
"62319 208737 Midway (2019) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"... ... ... ... \n",
"61967 Horror 3.730769 0.785109 \n",
"62027 Documentary 3.357143 1.099450 \n",
"62088 Action|Sci-Fi 3.372727 0.794764 \n",
"62313 Comedy|Romance 2.933333 1.222799 \n",
"62319 Action|Drama|War 3.333333 0.887625 \n",
"\n",
" rating_count rating_group \n",
"0 57309.0 good \n",
"1 24228.0 good \n",
"2 11804.0 fine \n",
"3 2523.0 fine \n",
"4 11714.0 fine \n",
"... ... ... \n",
"61967 39.0 good \n",
"62027 14.0 good \n",
"62088 55.0 good \n",
"62313 15.0 fine \n",
"62319 12.0 good \n",
"\n",
"[24330 rows x 7 columns]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check it\n",
"movies_with_ratings"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xZpmz44l0AZB"
},
"outputs": [],
"source": [
"# Now use cut() again to create your ratings_agreement column. \n",
"# Use three bins and order of 'agreement', 'average', and 'controversial'\n",
"movies_with_ratings['ratings_agreement'] = pd.cut(movies_with_ratings['rating_std'], 3, labels=['agreement', 'average','controversial'])\n"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdtitlegenresrating_meanrating_stdrating_countrating_groupratings_agreement
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy3.8937080.92155257309.0goodaverage
12Jumanji (1995)Adventure|Children|Fantasy3.2515270.95985124228.0goodaverage
23Grumpier Old Men (1995)Comedy|Romance3.1420281.00844311804.0fineaverage
34Waiting to Exhale (1995)Comedy|Drama|Romance2.8535471.1085312523.0fineaverage
45Father of the Bride Part II (1995)Comedy3.0584340.99661111714.0fineaverage
...........................
61967207405Doctor Sleep (2019)Horror3.7307690.78510939.0goodaverage
62027207612Tell Me Who I Am (2019)Documentary3.3571431.09945014.0goodaverage
62088207830Terminator: Dark Fate (2019)Action|Sci-Fi3.3727270.79476455.0goodaverage
62313208715Let It Snow (2019)Comedy|Romance2.9333331.22279915.0fineaverage
62319208737Midway (2019)Action|Drama|War3.3333330.88762512.0goodaverage
\n",
"

24330 rows รƒย— 8 columns

\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"... ... ... \n",
"61967 207405 Doctor Sleep (2019) \n",
"62027 207612 Tell Me Who I Am (2019) \n",
"62088 207830 Terminator: Dark Fate (2019) \n",
"62313 208715 Let It Snow (2019) \n",
"62319 208737 Midway (2019) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"... ... ... ... \n",
"61967 Horror 3.730769 0.785109 \n",
"62027 Documentary 3.357143 1.099450 \n",
"62088 Action|Sci-Fi 3.372727 0.794764 \n",
"62313 Comedy|Romance 2.933333 1.222799 \n",
"62319 Action|Drama|War 3.333333 0.887625 \n",
"\n",
" rating_count rating_group ratings_agreement \n",
"0 57309.0 good average \n",
"1 24228.0 good average \n",
"2 11804.0 fine average \n",
"3 2523.0 fine average \n",
"4 11714.0 fine average \n",
"... ... ... ... \n",
"61967 39.0 good average \n",
"62027 14.0 good average \n",
"62088 55.0 good average \n",
"62313 15.0 fine average \n",
"62319 12.0 good average \n",
"\n",
"[24330 rows x 8 columns]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "SwMX5K530m-j"
},
"source": [
"### Exploring our data \n",
" \n",
"Making bins like this allows us to figure out things like which movies are both bad, but have differing opinions on. For example, we can filter out movies that are in the bad category but have a lot of controversy about those ratings. This could mean they're 'cult classic' movies where despite the low rating some people actually really love the movies. \n",
" \n",
"A dataset like this could be used in a recommendation engine where if you see people liking these 'bad but good' movies you could suggest others that meet the came criteria. \n",
"\n",
"**Task:** There is no points here - only code to let you make figures to see what the dataset could be used for. Also, if this code works it means you probably did your answers above are right :)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8k2C-Lzs0Z33"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdtitlegenresrating_meanrating_stdrating_countrating_groupratings_agreement
861826157Manos: The Hands of Fate (1966)Horror1.7981651.550268109.0badcontroversial
1696489281Birdemic: Shock and Terror (2010)Romance|Thriller1.5217391.503492115.0badcontroversial
\n",
"
"
],
"text/plain": [
" movieId title genres \\\n",
"8618 26157 Manos: The Hands of Fate (1966) Horror \n",
"16964 89281 Birdemic: Shock and Terror (2010) Romance|Thriller \n",
"\n",
" rating_mean rating_std rating_count rating_group ratings_agreement \n",
"8618 1.798165 1.550268 109.0 bad controversial \n",
"16964 1.521739 1.503492 115.0 bad controversial "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings[(movies_with_ratings['rating_group'] == 'bad') & \n",
" (movies_with_ratings['ratings_agreement'] == 'controversial') & \n",
" (movies_with_ratings['rating_count'] >= 100)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zPBvEkD42NDF"
},
"source": [
"## Grouping within years - 1 point\n",
" \n",
"Now that we've done our overall grouping by movie, let's get a bit more detail about these ratings. Specifically, let's engineer a dataset that breaks down the average rating not only by movie, but also by the year the person provided the review. This would allow for someone to see which movies continue to do well over time, which ones become more popular, and which ones don't age well!\n",
" \n",
"**Task:** You're going to do the following steps:\n",
"* Create a new `ratings_by_movie` that groups both by `movieId` but also by your `review_dt` column. I want you to group into year intervals. \n",
"* Join `movies` to `ratings_by_movie` so that you have the summary review statistics for each year the movie has been out\n",
"* Clean up and filter your dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tioZCC2YB7Qa"
},
"source": [
"First, create `ratings_by_movie`. You can group by two levels by just adding a list of what levels you want to group by in the `groupby()` statement. I'll give you some help there, but you have to complete the rest in order to group by movieId first and review_dt second. "
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "nXvUmhqxyiAp"
},
"outputs": [],
"source": [
"# Note the groupby syntax. I first am grouping by movieId\n",
"# But then also am calling dt.year on our datetime column\n",
"# This will then tell Python to do the aggregations within year as well\n",
"ratings_by_movie = ratings.groupby(['movieId',ratings['review_dt'].dt.year], as_index=False).rating.agg(['mean','std','count']).reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jZ8EJYtD0QXM"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdreview_dtmeanstdcount
0119964.1327560.8849906237
1119973.8725000.8940796000
2119983.8895150.944972887
3119993.9751400.9139602494
4120004.1366340.8651273535
..................
32373220915720191.500000NaN1
32373320915920193.000000NaN1
32373420916320194.500000NaN1
32373520916920193.000000NaN1
32373620917120193.000000NaN1
\n",
"

323737 rows รƒย— 5 columns

\n",
"
"
],
"text/plain": [
" movieId review_dt mean std count\n",
"0 1 1996 4.132756 0.884990 6237\n",
"1 1 1997 3.872500 0.894079 6000\n",
"2 1 1998 3.889515 0.944972 887\n",
"3 1 1999 3.975140 0.913960 2494\n",
"4 1 2000 4.136634 0.865127 3535\n",
"... ... ... ... ... ...\n",
"323732 209157 2019 1.500000 NaN 1\n",
"323733 209159 2019 3.000000 NaN 1\n",
"323734 209163 2019 4.500000 NaN 1\n",
"323735 209169 2019 3.000000 NaN 1\n",
"323736 209171 2019 3.000000 NaN 1\n",
"\n",
"[323737 rows x 5 columns]"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"ratings_by_movie"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "owv03D5lCh6s"
},
"source": [
"You need to rename columns, but this is a bit trickier as you have two levels of your dataframe index. I'm going to give you the code below. But, what it's doing is resetting that one level of the index `review_dt` and putting it back as a regular column. I'm then renaming the resulting columns. "
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "J-Xp1BY_dp3r"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
movieIdyearrating_meanrating_stdrating_count
0119964.1327560.8849906237
1119973.8725000.8940796000
2119983.8895150.944972887
3119993.9751400.9139602494
4120004.1366340.8651273535
..................
32373220915720191.500000NaN1
32373320915920193.000000NaN1
32373420916320194.500000NaN1
32373520916920193.000000NaN1
32373620917120193.000000NaN1
\n",
"

323737 rows รƒย— 5 columns

\n",
"
"
],
"text/plain": [
" movieId year rating_mean rating_std rating_count\n",
"0 1 1996 4.132756 0.884990 6237\n",
"1 1 1997 3.872500 0.894079 6000\n",
"2 1 1998 3.889515 0.944972 887\n",
"3 1 1999 3.975140 0.913960 2494\n",
"4 1 2000 4.136634 0.865127 3535\n",
"... ... ... ... ... ...\n",
"323732 209157 2019 1.500000 NaN 1\n",
"323733 209159 2019 3.000000 NaN 1\n",
"323734 209163 2019 4.500000 NaN 1\n",
"323735 209169 2019 3.000000 NaN 1\n",
"323736 209171 2019 3.000000 NaN 1\n",
"\n",
"[323737 rows x 5 columns]"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reset index\n",
"#ratings_by_movie = ratings_by_movie.reset_index(level = 'review_dt')\n",
"# Here we are making 'movieId' as a column and not making it as an index. This helps us in easier processing of next steps.\n",
"ratings_by_movie.columns = ['movieId','year', 'rating_mean', 'rating_std', 'rating_count']\n",
"ratings_by_movie"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "q9EFQCOshC4U"
},
"outputs": [],
"source": [
"# Now join the movie dataframe onto rating_by_movie\n",
"movies_with_ratings = pd.merge(ratings_by_movie,\n",
" movies,\n",
" how='left',\n",
" on='movieId',\n",
" validate='m:1'\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yBLjjJ880vvh"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
" ...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions ยป

Submit New Assignment

Copy and Paste Your Assignment Here
movieIdyearrating_meanrating_stdrating_counttitlegenres
0119964.1327560.8849906237Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
1119973.8725000.8940796000Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
2119983.8895150.944972887Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
3119993.9751400.9139602494Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
4120004.1366340.8651273535Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
........................
32373220915720191.500000NaN1We (2018)Drama
323733209159