Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of Yahoo’s MapReduce framework Hadoop to execute the WordCount application. This assignment will be broken...

1 answer below »
Can you do this homework ?



Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of Yahoo’s MapReduce framework Hadoop to execute the WordCount application. This assignment will be broken down into two parts: 1) setup virtual cluster of 3 nodes and install Hadoop; 2) run 3 applications on Hadoop cluster: 1. Example 1: Calculate Π 2. Example 2: Distributed Grep 3. Example 3: Word Count in Python Instructions on building the Hadoop Cluster and executing all three example applications are provided in the slides. Requirements · Build a Hadoop cluster with one master node and two worker nodes · Successfully run application example 1 to calculate Π · Successfully run application example 2 to grep information from large amount of files · Successfully WordCount application using Hadoop MapReduce Please feel free to make any necessary modifications to the original source code to support your additional functionalities. · You need to submit a report, the report should include following sections: · Introduction · Implementation of your Hadoop cluster (Step by step configuration with screenshots) · Test: test your WordCount application with different input files. (with screenshots) Deliverables You need to submit an IEEE format report on BlackBoard You need to submit all the source code as a zip file on BlackBoard 1 Grading · Report format (10%) · Implementation (50%) • · Test (40%) 2 Week8 MapReduce.key Multithreaded Distributed Programming Motivation: Large Scale Data Processing • Want to: · Process lots of data (TB ~ PB) · Automatically parallelize across hundreds/thousands of CPUs · Have status and monitoring tools · Provide clean abstraction for programmers · Make this easy MapReduce • “A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. ” - Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc. Typical Problem · Iterate over a large number of records Map · Extract something of interest from each · Shuffle and sort intermediate results · Aggregate intermediate results Reduce · Generate final output · Key idea: provide an abstraction at the point of these two operations MapReduce: Programming Model · Process data using special map() and reduce() functions · The map() function is called on every item in the input and emits a series of intermediate key/value pairs · All values associated with a given key are grouped together · The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output Programming Model · Borrows from functional programming · Users implement interface of two functions: • — map (in_key, in_value) —> (out_key, intermediate_value) list — reduce (out_key, intermediate_value list) —> out_value list map · Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). · map() produces one or more intermediate values along with an output key from the input. reduce · After the map phase is over, all the intermediate values for a given output key are combined together into a list · reduce() combines those intermediate values into one or more final values for that same output key · (in practice, usually only one final value per key) MapReduce Examples • Word frequency MapReduce Examples · Distributed grep · Map function emits if word matches search criteria · Reduce function is the identity function · URL access frequency · Map function processes web logs, emits · Reduce function sums values and emits MapReduce: Programming Model MapReduce Execution Overview • 1. The user program, via the MapReduce library, shards the input data MapReduce Execution Overview • 1. The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. • MapReduce Resources • 3. The master distributes M map and R reduce tasks to idle workers. · M == number of shards · R == the intermediate key space is divided into R parts MapReduce Resources · 4. Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. · Output buffered in RAM. 2Hao Wu, CSC 563 Multithreaded Distributed ProgrammingWeek 8 · 5. Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. · 6. Master process gives disk locations to an available reducetask worker who reads all associated intermediate data. • · 7. Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. • · 8. Master process wakes up user process when all tasks have completed. Output contained in R output files. • MapReduce Execution Overview MapReduce Execution Overview MapReduce Execution Overview 2Hao Wu, CSC 563 Multithreaded Distributed ProgrammingWeek 8 2Hao Wu, CSC 563 Multithreaded Distributed ProgrammingWeek 8 MapReduce Runtime System 1. Partitions input data 2. Schedules execution across a set of machines 3. Handles machine failure 4. Manages interprocess communication Parallelism · map() functions run in parallel, creating different intermediate values from different input data sets · reduce() functions also run in parallel, each working on a different output key · All values are processed independently · Bottleneck: reduce phase can’t start until map phase is completely finished Locality · Master program divvies up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack · map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Fault Tolerance · Master detects worker failures · Re-executes completed & in-progress map() tasks · Re-executes in-progress reduce() tasks · Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. · Effect: Can work around bugs in third-party libraries! Optimizations · No reduce can start until map is complete: - A single slow disk controller can rate-limit the whole process · Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish MapReduce Conclusions · MapReduce has proven to be a useful abstraction · Greatly simplifies large-scale computations at Google · Functional programming paradigm can be applied to largescale applications · Fun to use: focus on problem, let library deal w/ messy details · Greatly reduces parallel programming complexity · Reduces synchronization complexity · Automatically partitions data · Provides failure transparency · Handles load balancing 2Hao Wu, CSC 563 Multithreaded Distributed ProgrammingWeek 8 2Hao Wu, CSC 563 Multithreaded Distributed ProgrammingWeek 8 Week10 Hadoop Programming in Python Multithreaded Distributed Programming Python Deal with HDFS · Make directory: hdfs dfs -mkdir -p dir_name · use “-p” option on first time directory creation only. It creates all parents directories. · Put files into hdfs: hdfs dfs -put files location_in_hdfs · Move one of the file to the local filesystem: • hdfs dfs -get files local_path Deal with HDFS · list the contents of the directory: hdfs dfs -ls dir_name · print the content of file: hdfs dfs -cat file • To test your hadoop cluster · Run test 1 to calculate PI value: hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar pi 30 100 · Run test 2 to grep information from files: $ hdfs dfs -mkdir input $ hdfs dfs -put etc/hadoop/*.xml input $ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce- examples-3.3.0.jar grep input output 'dfs[a-z.]+' · Check the results: $ hdfs dfs -get output output $ cat output/* Write your Python word count program — mapper #!/usr/bin/env python import sys #--- get all lines from stdin --- for line in sys.stdin: #--- remove leading and trailing whitespace--- line = line.strip() #--- split the line into words --- words = line.split() #--- output tuples [word, 1] in tab-delimited format--- for word in words: print '%s\t%s' % (word, "1") · Make your python code executable: chmod +x mapper.py Write your Python word count program — reducer #!/usr/bin/env python import sys # maps words to their counts word2count = {} # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count # write the tuples to stdout # Note: they are unsorted for word in word2count.keys(): print '%s\t%s'% ( word, word2count[word] ) · Make your python code executable: chmod +x reducer.py Test your scripts [hadoop@node1 ~]$ echo "foo foo quux labs foo bar quux" | ./mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1 [hadoop@node1 ~]$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort | ./reducer.py labs 1 quux 2 foo 3 bar 1 Test your scripts [hadoop@node1 ~]$ cat mapper.py|./mapper.py|sort|./reducer
Answered Same DayDec 04, 2021

Answer To: Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of...

Robert answered on Dec 13 2021
151 Votes
Assignment 2
Word Count using map reduce
(Fig 1)
Fig 1 shows the various details about our cluste
r like:
· It tells information about master node/manger node
· It tells information about worker node/slave node

(Fig 2)
Figure 2 tells us about setting up of directory for the process of map reduce:
· Here Grey is the parent directory which is created at /user/cloudera location
· With the help of ls command we are showing list of all directory...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here