Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by Mar 5, 9:59 PM PST Instructions 1. Overview Welcome to the Spark MapReduce programming...

1 answer below »
Refer to PDF attached


Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by Mar 5, 9:59 PM PST Instructions 1. Overview Welcome to the Spark MapReduce programming assignment. You will implement the solution to this machine problem in Python. To work on this assignment, you need Docker Desktop installed. 2. General Requirements Please note that our grader runs on a docker container NOT connected to the internet. Therefore, no additional libraries are allowed for this assignment (you can only use the default libraries of Python but no pip installs). Also, you will NOT be allowed to create any file or folder outside the current folder (i.e., you can only create files and folders in the folder that your solutions are in). 3. Setup Download the docker file, build a docker image and run it in a container. If you have already created this container, do not create a new one. Copy commands below # clone the repository and find the docker file git clone https://github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git cd MP5_SparkMapReduce_Template/Docker # build an image for mp5 based on the docker file docker build -t mp5 . # create a container named 'mp5-cntr' for mp5 using the image mp5 docker run --name mp5-cntr -it mp5 # or start the 'mp5-cntr' container if you have created it docker start -a mp5-cntr 4. Sorting When selecting the top N items in a list, sorting is necessary. Use the following steps to sort: 1. Sort the list ASCENDING based on count first, then on the key. If the key is a string, sort lexicographically. 2. Select the bottom N items in the sorted list as Top items. This logic is implemented in the third example of the Hadoop MapReduce Tutorial. For example, to select top 5 items in the list {"A": 100, "B": 99, "C":98, "D": 97, "E": 96, "F": 96, "G":90}, first sort the items ASCENDING: "G":90 "E": 96 "F": 96 "D": 97 "C":98 "B": 99 "A": 100 Then, the bottom 5 items are A, B, C, D, F. Another example, to select 5 top items in the list {"43": 100, "12": 99, "44":98, "12": 97, "1": 96, "100": 96, "99":90} "99":90 "1": 96 "100": 96 "12": 97 "44":98 "12": 99 "43": 100 Then, the bottom 5 items are 43, 12, 44, 12, 100. Submission 1. Requirements This assignment will be graded based on Python 3.6. 2. Procedures Step 1: Launch and go into the 'mp5-cntr' container after the setup. Note that files inside the container and the host machine are not shared. Therefore, you should clone the repository again within the container. Download the templates and change the current folder, run: git clone https://github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git cd MP5_SparkMapReduce_Template/PythonTemplate Step 2: Finish the exercises by editing the provided templates files. All you need to do is complete the parts marked with TODO. Please note that you are NOT allowed to import any additional libraries. • Each exercise has one or more code templates. Simply edit these files. • Our autograder runs the code on the provided Docker image. More information about these exercises is provided in the next section. Step 3: After you are done with the assignments, put all your 5 python files (TitleCountSpark.py, TopTitleStatisticsSpark.py, OrphanPagesSpark.py, TopPolularLinksSpark.py, PopularityLeagueSpark.py) into a .zip file named "MP5.zip". Remember not to include the parent folder. Submit your "MP5.zip". Exercise A: Top Titles In this exercise, you will implement a counter for words in Wikipedia titles and find the top words used in these titles. We have provided a template for this exercise in the following file: TitleCountSpark.py You need to make the necessary changes to parts marked with TODO. Your application takes a list of Wikipedia titles (one in each line) as an input and first tokenizes them using the provided delimiters. It then makes the tokens lowercased and removes common words from the provided stopwords. Next, your application selects the top 10 words, and finally, saves the count for them in the output. Use the method in the Sorting section to select top words. You can test your output with: # spark-submit TitleCountSpark.py stopwords.txt delimiters.txt dataset/titles/ partA # cat partA Here is an example output showing the top 5 words in alphabetical order. Note that the autograder requires the top 10 (after they are chosen based on count): The order of lines matters. Please sort the output in alphabetic order as shown above. Also, make sure the key and value pairs in the final output are tab-separated. Exercise B: Top Title Statistics In this exercise, you will implement an application to find some statistics about the top words used in Wikipedia titles. We have provided a template for this exercise in the following file: TopTitleStatisticsSpark.py You need to make the necessary changes to parts marked with TODO. Your output from Exercise A will be used as input here. The application saves the following statistics about the top words in the output: “Mean” , “Sum”, “Minimum”, “Maximum”, and “Variance” of the counts. All values should be floored to be an integer. For the sake of simplicity, use integers in all calculations. The following is the sample command we will use to run the application: # spark-submit TopTitleStatisticsSpark.py partA partB # cat partB Here is the output of an application that selects the top 5 words, though we still require the top 10 as described above: Make sure the stats and the corresponding results are tab-separated. Exercise C: Orphan Pages In this exercise, you will implement an application to find orphan pages in Wikipedia. We have provided a template for this exercise in the following files: OrphanPagesSpark.py You need to make the necessary changes to parts marked with TODO. Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All pages are represented by their ID numbers. Each line starts with a page ID, followed by a list of pages that the ID links to. The following is a sample line in the input: In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not necessarily two-way. The application should save the IDs of orphan pages in the output. Orphan pages are pages to which no pages link. A page that links to itself is NOT an orphan page. The following is the sample command we will use to run the application: # spark-submit OrphanPagesSpark.py dataset/links/ partC # cat partC # head partC Here is a part of the output of this application: The order of lines matters. Please sort your output (key value) in alphabetic order. Exercise D: Top Popular Links In this exercise, you will implement an application to find orphan pages in Wikipedia. We have provided a template for this exercise in the following file: TopPopularLinksSpark.py You need to make the necessary changes to parts marked with TODO. Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All pages are represented by their ID numbers. Each line starts with a page ID, followed by a list of pages that the ID has a link to. The following is a sample line in the input: In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not necessarily two-way. The application should save the IDs of the top 10 popular pages and the number of links to them in the output. A page is popular if more pages are linked to it. Use the method in Sorting to select top links. The following is the sample command we will use to run the application: # spark-submit TopPopularLinksSpark.py dataset/links/ partD # cat partD Here is the output of an application that selects the top 5 popular links: The order of lines matters. Please sort your output (key value) in alphabetical order. Also, make sure the key-value pair in the final output are tab-separated. Exercise E: Popularity League In this exercise, you will implement an application to find the most popular pages in Wikipedia. Again, we have provided a tempalte for this exercise in the following file: PopularityLeagueSpark.py You need to make the necessary changes to parts marked with TODO. Your application takes a list of Wikipedia links as input. All pages are represented by their ID numbers. Each line starts with a page ID, followed by the pages the ID links to. The following is a sample line in the input: In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not necessarily two-way. The popularity of a page is determined by the number of pages in the whole Wikipedia graph that link to that specific page. (Same number as Exercise D) The application also takes a list of page IDs as an input (also called a league list). The goal of the application is to calculate the rank of pages in the league using their popularity. A page's rank is the number of pages in the league with less popularity than the original page. The following is the sample command we use to run the application: # spark-submit PopularityLeagueSpark.py dataset/links/ dataset/league.txt partE # cat partE Here is the output with League={5300058,3294332,3078798,1804986,2370447,81615,3,1}): Here is the output with League={88822,774931,4861926,1650573,66877,5115901,75323,4189215}): The order matters. Please sort your output (key value) in alphabetic order. Also, make sure the key and value pairs in the final output are tab-separated. Note that we will use a different League file in our autograder runs.
Answered 2 days AfterMar 03, 2023

Answer To: Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by...

Nidhi answered on Mar 05 2023
32 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here