Refer to PDF attachedProgramming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass...

Question

Refer to PDF attachedProgramming Assignment: Machine Problem 5:  Spark MapReduce  Deadline  Pass this assignment by Mar 5, 9:59 PM PST  Instructions  1. Overview  Welcome to the Spark MapReduce programming assignment. You will implement the solution  to this machine problem in Python. To work on this assignment, you need Docker Desktop  installed.  2. General Requirements  Please note that our grader runs on a docker container NOT connected to the internet.  Therefore, no additional libraries are allowed for this assignment (you can only use the  default libraries of Python but no pip installs). Also, you will NOT be allowed to create  any file or folder outside the current folder (i.e., you can only create files and folders in  the folder that your solutions are in).   3. Setup  Download the docker file, build a docker image and run it in a container. If you have already  created this container, do not create a new one. Copy commands below # clone the repository and find the docker file  git clone https://github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git  cd MP5_SparkMapReduce_Template/Docker # build an image for mp5 based on the docker file  docker build -t mp5 . # create a container named 'mp5-cntr' for mp5 using the image mp5  docker run --name mp5-cntr -it mp5  # or start the 'mp5-cntr' container if you have created it  docker start -a mp5-cntr  4. Sorting  When selecting the top N items in a list, sorting is necessary. Use the following steps to sort:  1. Sort the list ASCENDING based on count first, then on the key. If the key is a string, sort  lexicographically.  2. Select the bottom N items in the sorted list as Top items.  This logic is implemented in the third example of the Hadoop MapReduce Tutorial.  For example, to select top 5 items in the list {"A": 100, "B": 99, "C":98, "D": 97, "E": 96, "F": 96,  "G":90}, first sort the items ASCENDING:  "G":90  "E": 96  "F": 96  "D": 97  "C":98  "B": 99  "A": 100  Then, the bottom 5 items are A, B, C, D, F.  Another example, to select 5 top items in the list {"43": 100, "12": 99, "44":98, "12": 97, "1": 96,  "100": 96, "99":90}  "99":90  "1": 96  "100": 96  "12": 97  "44":98  "12": 99  "43": 100  Then, the bottom 5 items are 43, 12, 44, 12, 100.  Submission  1. Requirements  This assignment will be graded based on Python 3.6.  2. Procedures  Step 1: Launch and go into the 'mp5-cntr' container after the setup. Note that files inside the  container and the host machine are not shared. Therefore, you should clone the repository  again within the container. Download the templates and change the current folder, run: git clone https://github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git  cd MP5_SparkMapReduce_Template/PythonTemplate  Step 2: Finish the exercises by editing the provided templates files. All you need to do is  complete the parts marked with TODO. Please note that you are NOT allowed to import  any additional libraries.  • Each exercise has one or more code templates. Simply edit these files.  • Our autograder runs the code on the provided Docker image.  More information about these exercises is provided in the next section.  Step 3: After you are done with the assignments, put all your 5 python files  (TitleCountSpark.py, TopTitleStatisticsSpark.py, OrphanPagesSpark.py,  TopPolularLinksSpark.py, PopularityLeagueSpark.py) into a .zip file named "MP5.zip".  Remember not to include the parent folder. Submit your "MP5.zip".  Exercise A: Top Titles  In this exercise, you will implement a counter for words in Wikipedia titles and find the top  words used in these titles. We have provided a template for this exercise in the following  file: TitleCountSpark.py   You need to make the necessary changes to parts marked with TODO.  Your application takes a list of Wikipedia titles (one in each line) as an input and first  tokenizes them using the provided delimiters. It then makes the tokens lowercased and  removes common words from the provided stopwords. Next, your application selects the  top 10 words, and finally, saves the count for them in the output. Use the method in the  Sorting section to select top words.  You can test your output with: # spark-submit TitleCountSpark.py stopwords.txt delimiters.txt dataset/titles/ partA  # cat partA  Here is an example output showing the top 5 words in alphabetical order. Note that the  autograder requires the top 10 (after they are chosen based on count):    The order of lines matters. Please sort the output in alphabetic order as shown above. Also,  make sure the key and value pairs in the final output are tab-separated.  Exercise B: Top Title Statistics  In this exercise, you will implement an application to find some statistics about the top words  used in Wikipedia titles. We have provided a template for this exercise in the following  file: TopTitleStatisticsSpark.py  You need to make the necessary changes to parts marked with TODO.  Your output from Exercise A will be used as input here. The application saves the following  statistics about the top words in the output: “Mean” , “Sum”, “Minimum”, “Maximum”, and  “Variance” of the counts. All values should be floored to be an integer. For the sake of  simplicity, use integers in all calculations.  The following is the sample command we will use to run the application: # spark-submit TopTitleStatisticsSpark.py partA partB  # cat partB  Here is the output of an application that selects the top 5 words, though we still require the  top 10 as described above:    Make sure the stats and the corresponding results are tab-separated.  Exercise C: Orphan Pages  In this exercise, you will implement an application to find orphan pages in Wikipedia. We have  provided a template for this exercise in the following files: OrphanPagesSpark.py  You need to make the necessary changes to parts marked with TODO.  Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All  pages are represented by their ID numbers. Each line starts with a page ID, followed by a list  of pages that the ID links to. The following is a sample line in the input:    In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not  necessarily two-way. The application should save the IDs of orphan pages in the output.  Orphan pages are pages to which no pages link. A page that links to itself is NOT an orphan  page.  The following is the sample command we will use to run the application: # spark-submit OrphanPagesSpark.py dataset/links/ partC  # cat partC  # head partC  Here is a part of the output of this application:    The order of lines matters. Please sort your output (key value) in alphabetic order.  Exercise D: Top Popular Links  In this exercise, you will implement an application to find orphan pages in Wikipedia. We have  provided a template for this exercise in the following file: TopPopularLinksSpark.py  You need to make the necessary changes to parts marked with TODO.  Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All  pages are represented by their ID numbers. Each line starts with a page ID, followed by a list  of pages that the ID has a link to. The following is a sample line in the input:    In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not  necessarily two-way. The application should save the IDs of the top 10 popular pages and the  number of links to them in the output. A page is popular if more pages are linked to it. Use  the method in Sorting to select top links.  The following is the sample command we will use to run the application: # spark-submit TopPopularLinksSpark.py dataset/links/ partD  # cat partD  Here is the output of an application that selects the top 5 popular links:    The order of lines matters. Please sort your output (key value) in alphabetical order. Also,  make sure the key-value pair in the final output are tab-separated.  Exercise E: Popularity League  In this exercise, you will implement an application to find the most popular pages in  Wikipedia. Again, we have provided a tempalte for this exercise in the following  file: PopularityLeagueSpark.py  You need to make the necessary changes to parts marked with TODO.  Your application takes a list of Wikipedia links as input. All pages are represented by their ID  numbers. Each line starts with a page ID, followed by the pages the ID links to. The following  is a sample line in the input:    In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not  necessarily two-way.  The popularity of a page is determined by the number of pages in the whole Wikipedia graph  that link to that specific page. (Same number as Exercise D)  The application also takes a list of page IDs as an input (also called a league list). The goal of  the application is to calculate the rank of pages in the league using their popularity.  A page's rank is the number of pages in the league with less popularity than the original  page.  The following is the sample command we use to run the application: # spark-submit PopularityLeagueSpark.py dataset/links/ dataset/league.txt partE  # cat partE  Here is the output with League={5300058,3294332,3078798,1804986,2370447,81615,3,1}):    Here is the output with  League={88822,774931,4861926,1650573,66877,5115901,75323,4189215}):    The order matters. Please sort your output (key value) in alphabetic order. Also, make sure  the key and value pairs in the final output are tab-separated.   Note that we will use a different League file in our autograder runs.

Nidhi · Accepted Answer

Answer Attached Below:

Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by Mar 5, 9:59 PM PST Instructions 1. Overview Welcome to the Spark MapReduce programming...

Answer To: Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment