Background Since your work with Jennifer on the SellBy project was so successful, you’ve been tasked with another, larger project: analyzing Amazon reviews written by members of the paid Amazon Vine...


Background


Since your work with Jennifer on the SellBy project was so successful, you’ve been tasked with another, larger project: analyzing Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.


In this project, you’ll have access to approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. You’ll need to pick one of these datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, you’ll use PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in your dataset. Then, you’ll write a summary of the analysis for Jennifer to submit to the SellBy stakeholders.


What You're Creating


This new assignment consists of two technical analysis deliverables and a written report. You will submit the following:



  • Deliverable 1: Perform ETL on Amazon Product Reviews

  • Deliverable 2: Determine Bias of Vine Reviews

  • Deliverable 3: A Written Report on the Analysis (README.md)


Files


Use the following links to download the Challenge starter codes.


Download theSQL table schema(Links to an external site.).


Download theAmazon ETL starter code(Links to an external site.).


Before You Start


Create a new GitHub repository entitled "Amazon_Vine_Analysis", and initialize the repository with a README.


IMPORTANT

Be sure to regularly monitor your AWS usage so you don’t go above the free tier. Consult Lessons16.9.2and16.9.3for more on shutting down your AWS instances and checking your AWS billing, respectively.


Deliverable 1: Perform ETL on Amazon Product Reviews (40 points)


Deliverable 1 Instructions


Using your knowledge of the cloud ETL process, you’ll create an AWS RDS database with tables in pgAdmin, pick a dataset from theAmazon Review datasets(Links to an external site.), and extract the dataset into a DataFrame. You'll transform the DataFrame into four separate DataFrames that match the table schema in pgAdmin. Then, you'll upload the transformed data into the appropriate tables and run queries in pgAdmin to confirm that the data has been uploaded.


REWIND

For this deliverable, you’ve already done the following in this module:



Follow the instructions below to complete Deliverable 1.



  1. From the followingAmazon Review datasets(Links to an external site.), pick a dataset that you would like to analyze. All the datasets have the same schemata, as shown in this image:



The format and information of Amazon Review datasets columns.




  1. Create a new database with Amazon RDS just as you did in this module.




  2. In pgAdmin, create a new database in your Amazon RDS server that you just create.




  3. Download thechallenge_schema.sqlfile to your computer.




  4. In pgAdmin, run a new query to create the tables for your new database using the code from thechallenge_schema.sqlfile.



    • After you run the query, you should have the following four tables in your database:customers_table,products_table,review_id_table, andvine_table.




  5. Download theAmazon_Reviews_ETL_starter_code.ipynbfile, then upload the file as a Google Colab Notebook, and rename itAmazon_Reviews_ETL.




NOTE

If you try to open theAmazon_Reviews_ETL_starter_code.ipynbwith jupyter notebook it will give you an error.



  1. Firstextractone of the review datasets, then create a new DataFrame.

  2. Next, follow the steps below totransformthe dataset into four DataFrames that will match the schema in the pgAdmin tables:


NOTE

Some datasets have a large number of rows, which will affect the time it takes to complete the following steps.


The customers_table DataFrame


To create thecustomers_table, use the code in theAmazon_Reviews_ETL_starter_code.ipynbfile and follow the steps below to aggregate the reviews bycustomer_id.



  • Use thegroupby()function on thecustomer_idcolumn of the DataFrame you created in Step 6.

  • Count all the customer ids using theagg()function by chaining it to thegroupby()function. After you use this function, a new column will be created,count(customer_id).

  • Rename thecount(customer_id)column using thewithColumnRenamed()function so it matches the schema for thecustomers_tablein pgAdmin.

  • The finalcustomers_tableDataFrame should look like this:



The customers_table DataFrame,


The products_table DataFrame


To create theproducts_table, use theselect()function to select theproduct_idandproduct_title, then drop duplicates with thedrop_duplicates()function to retrieve only uniqueproduct_ids. Refer to the code snippet provided in theAmazon_Reviews_ETL_starter_code.ipynbfile for assistance.


The finalproducts_tableDataFrame should look like this:



The products_table DataFrame,


The review_id_table DataFrame


To create thereview_id_table, use theselect()function to select the columns that are in thereview_id_tablein pgAdmin (as shown in the following image), and convert thereview_datecolumn to a date using the code snippet provided in theAmazon_Reviews_ETL_starter_code.ipynbfile.


The finalreview_id_tableDataFrame should look like this:



The review_id_table DataFrame,


The vine_table DataFrame


To create thevine_table, use theselect()function to select only the columns that are in thevine_tablein pgAdmin (as shown in the following image).


The finalvine_tableDataFrame should look like this:



The vine_table DataFrame.


Load the DataFrames into pgAdmin



  1. Make the connection to your AWS RDS instance.

  2. Load the DataFrames that correspond to tables in pgAdmin.

  3. In pgAdmin, run a query to check that the tables have been populated.


IMPORTANT

Before uploading anything to GitHub be sure to remove all sensitive information such as passwords and connection strings. If you have accidentally done so already see thislink(Links to an external site.)for more information.


When you’re done, export yourAmazon_Reviews_ETLGoogle Colab Notebook as anipynbfile, and save it to your Amazon_Vine_Analysis GitHub repository.


NOTE

Uploading each DataFrame can take up to 10 minutes or longer, so it’s a good idea to double-check your work before uploading. If you have problems uploading your work, you may have to shut down the pgAdmin server and restart. Alternatively, you may have to delete the tables and create them again, then re-run yourAmazon_Reviews_ETLGoogle Colab Notebook.


IMPORTANT

Be sure that you don’t leave your RDS instance up too long. Try to get all your work for Deliverable 1 done in one sitting, then shut down your instance. Please consult the AWS clean-up videos for more information about shutting down your RDS instance. You will not be graded on anything contained strictly in your RDS, so be sure to shut it down.


Deliverable 1 Requirements


You will earn a perfect score for Deliverable 1 by completing all requirements below:



  • TheAmazon_Reviews_ETL.ipynbfile does the following:

    • An Amazon Review dataset is extracted as a DataFrame(10 pt)

    • The extracted dataset is transformed into four DataFrames with the correct columns(20 pt)

    • All four DataFrames are loaded into their respective tables in pgAdmin(10 pt)




Deliverable 2: Determine Bias of Vine Reviews (40 points)


Deliverable 2 Instructions


Using your knowledge of PySpark, Pandas, or SQL, you’ll determine if there is any bias towards reviews that were written as part of the Vine program. For this analysis, you'll determine if having a paid Vine review makes a difference in the percentage of 5-star reviews.


REWIND

For this deliverable, you’ve already done the following in this module using PySpark:



Using either PySpark, Pandas, or SQL, follow the instructions below to complete Deliverable 2.




  1. Filter the data and create a new DataFrame or table to retrieve all the rows where thetotal_votescount is equal to or greater than 20 to pick reviews that are more likely to be helpful and to avoid having division by zero errors later on.




  2. Filter the new DataFrame or table created in Step 1 and create a new DataFrame or table to retrieve all the rows where the number ofhelpful_votesdivided bytotal_votesis equal to or greater than 50%.



    • If you use the SQL option below, you’ll need to cast your columns as floats usingWHERE CAST(helpful_votes AS FLOAT)/CAST(total_votes AS FLOAT) >=0.5.




  3. Filter the DataFrame or table created in Step 2, and create a new DataFrame or table that retrieves all the rows where a review was written as part of the Vine program (paid),vine == 'Y'.




  4. Repeat Step 3, but this time retrieve all the rows where the review was not part of the Vine program (unpaid),vine == 'N'.




  5. Determine the total number of reviews, the number of 5-star reviews, and the percentage of 5-star reviews for the two types of review (paid vs unpaid).




NOTE

We recommend using either PySpark or Pandas to perform the analysis. Using SQL queries is more challenging, especially for Step 5 above, and is better suited for intermediate and experienced SQL programmers.


Using PySpark



  1. Create a new Google Colab Notebook, and name itVine_Review_Analysis.

  2. Extract the dataset you used in Deliverable 1.

  3. Recreate thevine_table, and perform your analysis using the steps above.

  4. Export yourVine_Review_AnalysisGoogle Colab Notebook as anipynbfile, and save it to your Amazon_Vine_Analysis GitHub repository.


Using Pandas



  1. From pgAdmin, export thevine_tableas a CSV file, and save it to your Amazon_Vine_Analysis GitHub repository.

  2. Create a new Jupyter Notebook, and name itVine_Review_Analysis.ipynb.

  3. Read in thevine_table.csvfile as a DataFrame, and perform your analysis using the steps above.

  4. Save yourVine_Review_Analysis.ipynbfile to your Amazon_Vine_Analysis GitHub repository.


Using SQL in pgAdmin



  1. From your AWS database, export thevine_tableas a CSV file and save it to your Amazon_Vine_Analysis GitHub repository.

  2. In pgAdmin, create a new database that is not linked to your AWS RDS instance. This way, you don’t have to keep incurring charges while connected to your AWS RDS instance.

  3. Create a new SQL file and name itVine_Review_Analysis.sql.

  4. Recreate thevine_tableusing the schema provided in thechallenge_schema.sqlfile.

  5. Import thevine_table.csvfile into the table, and perform your analysis using the steps above.

  6. Save all your SQL queries to theVine_Review_Analysis.sqlfile, then add it to your Amazon_Vine_Analysis GitHub repository.


Deliverable 2 Requirements


You will earn a perfect score for Deliverable 2 by completing all requirements below:



  • The analysis does the following:

    • There is a DataFrame or table for thevine_tabledata using one of three methods above(5 pt)

    • The data is filtered to create a DataFrame or table where there are 20 or more total votes(5 pt)

    • The data is filtered to create a DataFrame or table where the percentage ofhelpful_votesis equal to or greater than 50%(5 pt)

    • The data is filtered to create a DataFrame or table where there is a Vine review(5 pt)

    • The data is filtered to create a DataFrame or table where there isn’t a Vine review(5 pt)

    • The total number of reviews, the number of 5-star reviews, and the percentage 5-star reviews are calculated for all Vine and non-Vine reviews(15 pt)




Deliverable 3: A Written Report on the Analysis (20 points)


Deliverable 3 Instructions


For this part of the Challenge, you’ll write a report that summarizes the analysis you performed in Deliverable 2.


The report should contain the following:





  1. Overview of the analysis:Explain the purpose of this analysis.





  2. Results:Using bulleted lists and images of DataFrames as support, address the following questions:



    • How many Vine reviews and non-Vine reviews were there?

    • How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?

    • What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?





  3. Summary:In your summary, state if there is any positivity bias for reviews in the Vine program. Use the results of your analysis to support your statement. Then, provide one additional analysis that you could do with the dataset to support your statement.




Deliverable 3 Requirements


Structure, Organization, and Formatting (6 points)


The written analysis has the following structure, organization, and formatting:



  • There is a title, and there are multiple sections(2 pt)

  • Each section has a heading and subheading(2 pt)

  • Links to images are working, and code is formatted and displayed correctly(2 pt).


Analysis (14 points)


The written analysis has the following:




  • Overview of the analysis of the Vine program:



    • The purpose of this analysis is well defined(3 pt)




  • Results:



    • There is a bulleted list that addresses the three questions for unpaid and paid program reviews(7 pt)




  • Summary:



    • The summary states whether or not there is bias, and the results support this statement(2 pt)

    • An additional analysis is recommended to support the statement(2 pt)




Submission


Once you’re ready to submit, make sure to check your work against the rubric to ensure you are meeting the requirements for this Challenge one final time. It’s easy to overlook items when you’re in the zone!


As a reminder, the deliverables for this Challenge are as follows:



  • Deliverable 1: Perform ETL on Amazon Product Reviews

  • Deliverable 2: Determine Bias of Vine Reviews

  • Deliverable 3: A Written Report on the Analysis (README.md)


Upload the following to your Amazon_Vine_Analysis GitHub repository:



  • YourAmazon_Reviews_ETL.ipynbfile.

  • YourVine_Review_Analysis.ipynborVine_Review_Analysis.sqlfile.

  • An updated README.md that has your written analysis.


To submit your challenge assignment for grading in Bootcamp Spot, click Start Assignment, click the Website URL tab, then provide the URL of your Amazon_Vine_Analysis GitHub repository, and then click Submit. Comments are disabled for graded submissions in BootCampSpot. If you have questions about your feedback, please notify your instructional staff or the Student Success Manager. If you would like to resubmit your work for an improved grade, you can use theRe-Submit Assignmentbutton to upload new links. You may resubmit up to 3 times for a total of 4 submissions.


IMPORTANT

Once you receive feedback on your Challenge, make any suggested updates or adjustments to your work. Then, add this week’s Challenge to your professional portfolio.


NOTE

You are allowed to miss up to two Challenge assignments and still earn your certificate. If you complete all Challenge assignments, your lowest two grades will be dropped. If you wish to skip this assignment, click Next, and move on to the next Module.


Rubric

Module-16 RubricModule-16 Rubric
































CriteriaRatingsPts
This criterion is linked to a learning outcomeDeliverable 1: Perform ETL on Amazon Product Reviews










40to >36.0PtsDemonstrating Proficiency✓An Amazon Review dataset is extracted as a DataFrame. ✓The extracted dataset is transformed into four DataFrames with the correct columns. ✓ALL FOUR DataFrames are loaded into their respective tables in pgAdmin.36to >31.0PtsApproaching Proficiency✓An Amazon Review dataset is extracted as a DataFrame. ✓The extracted dataset is transformed into four DataFrames with the correct columns, AND ONE df has incorrect column names. ✓THREE of the FOUR DataFrames are loaded into pgAdmin tables.31to >27.0PtsDeveloping Proficiency✓An Amazon Review dataset is extracted as a DataFrame. ✓The extracted dataset is transformed into four DataFrames, but the products df has duplicate rows, OR ONE or more dfs has incorrect column names. ✓TWO of the FOUR DataFrames are loaded into pgAdmin tables.27to >0.0PtsEmerging✓An Amazon Review dataset is extracted as a DataFrame. ✓The extracted dataset is transformed into four DataFrames, but the products df has duplicate rows, AND ONE or more dfs has incorrect column names. ✓TWO of the FOUR DataFrames are loaded into pgAdmin tables.0PtsIncomplete

40pts

This criterion is linked to a learning outcomeDeliverable 2: Determine Bias of Vine Reviews










40to >36.0PtsDemonstrating Proficiency✓There is a DataFrame or table for the vine_table data. ✓The data is filtered to create a DataFrame or table where there are 20 or more total votes. ✓The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%. ✓The data is filtered to create a DataFrame or table where there is a Vine review. ✓The data is filtered to create a DataFrame or table where there isn’t a Vine review. ✓The total number of votes, the number of 5 star reviews, and the percentage of 5 star reviews are calculated.36to >34.0PtsApproaching Proficiency✓There is a DataFrame or table for the vine_table data. ✓The data is filtered to create a DataFrame or table where there are 20 or more total votes. ✓The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%. ✓The data is filtered to create a DataFrame or table where there is a Vine review. ✓The data is filtered to create a DataFrame or table where there isn’t a Vine review. ✓The total number of votes and the number of 5 star reviews are calculated, but there is an error calculating the percentage of 5 star reviews.34to >31.0PtsDeveloping Proficiency✓There is a DataFrame or table for the vine_table data. ✓The data is filtered to create a DataFrame or table where there are 20 or more total votes. ✓The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%. ✓The data is filtered to create a DataFrame or table where there is a Vine review. ✓The data is filtered to create a DataFrame or table where there isn’t a Vine review. ✓The total number of votes AND the number of 5 star reviews are calculated.31to >0.0PtsEmerging✓There is a DataFrame or table for the vine_table data. ✓The data is filtered to create a DataFrame or table where there are 20 or more total votes. ✓The data is filtered to create a DataFrame or table where the percentage of helpful_votes is equal to or greater than 50%. ✓The data is filtered to create a DataFrame or table where there is a Vine review. ✓The data is filtered to create a DataFrame or table where there isn’t a Vine review. ✓Either the total number of votes OR the number of 5 star reviews are calculated.0PtsIncomplete

40pts

This criterion is linked to a learning outcomeDeliverable 3: Structure, Organization, and Formatting










6to >5.0PtsDemonstrating ProficiencyThe written analysis has ALL of the following: ✓There is a title, and there are multiple sections. ✓Each section has a heading and subheading. ✓There are images and references to code, and they are formatted and displayed correctly.5to >4.0PtsApproaching ProficiencyThe written analysis has ALL of the following: ✓There is a title, and there are multiple sections. ✓Each section has a heading and subheading. ✓There are images and references to code, and they are formatted and displayed correctly with one or two minor errors.4to >3.0PtsDeveloping ProficiencyThe written analysis has ALL of the following: ✓There is a title, and there are multiple sections. AND ONE of the following: ✓Each section may have a heading and subheading. ✓There are images and references to code, and they are formatted and displayed correctly with one or two minor errors.3to >0.0PtsEmergingThe written analysis has ALL of the following: ✓There is a title. ✓There may be a subheading for a section. ✓There are no headings for each section, but there are three sections.0PtsIncomplete

6pts

This criterion is linked to a learning outcomeDeliverable 3: Analysis










14to >12.0PtsDemonstrating Proficiency✓The purpose is well defined. ✓ALL THREE questions are addressed in the results. ✓The summary states whether or not there is bias and is supported by the results and there is a supporting additional recommendation.12to >10.0PtsApproaching Proficiency✓The purpose is well defined. ✓ALL THREE questions are addressed in the results, but there is a misinterpretation. ✓The summary states whether or not there is bias and is supported by the results, but the additional recommendation doesn’t support the statement.10to >8.0PtsDeveloping Proficiency✓The purpose is well defined. ✓TWO of the THREE questions are addressed in the results. ✓The summary states whether or not there is bias and is supported by the results, but there is no additional recommendation.8to >0.0PtsEmerging✓The purpose is well defined. ✓ONE of the THREE questions is addressed in the results. ✓The summary states whether or not there is bias, but it is not supported by the results.0PtsIncomplete

14pts

Total points:100

16.9.3: Check AWS billing" style="float: left;">Previous
Module 16 Career Connection" style="float: right;">Next© 2020 - 2022 Trilogy Education Services, a 2U, Inc. brand. All Rights Reserved.
Feb 27, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here