Can you do this homework ?Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project...

Question

Can you do this homework ?Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of Yahoo’s MapReduce framework Hadoop to execute the WordCount application. This assignment will be broken down into two parts: 1) setup virtual cluster of 3 nodes and install Hadoop; 2) run 3 applications on Hadoop cluster: 1. Example 1: Calculate Π 2. Example 2: Distributed Grep 3. Example 3: Word Count in Python Instructions on building the Hadoop Cluster and executing all three example applications are provided in the slides. Requirements · Build a Hadoop cluster with one master node and two worker nodes · Successfully run application example 1 to calculate Π · Successfully run application example 2 to grep information from large amount of files · Successfully WordCount application using Hadoop MapReduce Please feel free to make any necessary modifications to the original source code to support your additional functionalities. · You need to submit a report, the report should include following sections: · Introduction · Implementation of your Hadoop cluster (Step by step configuration with screenshots) · Test: test your WordCount application with different input files. (with screenshots) Deliverables You need to submit an IEEE format report on BlackBoard You need to submit all the source code as a zip file on BlackBoard 1 Grading · Report format (10%) · Implementation (50%) •  · Test (40%) 2   Week8 MapReduce.key Multithreaded Distributed Programming Motivation: Large Scale Data Processing • Want to: · Process lots of data (TB ~ PB) · Automatically parallelize across hundreds/thousands of CPUs · Have status and monitoring tools · Provide clean abstraction for programmers  · Make this easy MapReduce • “A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. ” - Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.  Typical Problem · Iterate over a large number of records Map · Extract something of interest from each · Shuffle and sort intermediate results · Aggregate intermediate results Reduce · Generate final output · Key idea: provide an abstraction at the point of these two operations MapReduce: Programming Model · Process data using special map() and reduce() functions · The map() function is called on every item in the input and emits a series of intermediate key/value pairs  · All values associated with a given key are grouped together · The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output  Programming Model · Borrows from functional programming · Users implement interface of two functions: • — map (in_key, in_value) —> (out_key, intermediate_value) list — reduce (out_key, intermediate_value list) —> out_value list map · Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).  · map() produces one or more intermediate values along with an output key from the input.  reduce · After the map phase is over, all the intermediate values for a given output key are combined together into a list  · reduce() combines those intermediate values into one or more final values for that same output key  · (in practice, usually only one final value per key)  MapReduce Examples • Word frequency MapReduce Examples · Distributed grep · Map function emits  if word matches search criteria  · Reduce function is the identity function  · URL access frequency  · Map function processes web logs, emits   · Reduce function sums values and emits   MapReduce: Programming Model MapReduce Execution Overview • 1. The user program, via the MapReduce library, shards the input data MapReduce Execution Overview • 1. The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads.  • MapReduce Resources • 3. The master distributes M map and R reduce tasks to idle workers. · M == number of shards · R == the intermediate key space is divided into R parts MapReduce Resources · 4. Each map-task worker reads assigned input shard and outputs intermediate key/value pairs.  · Output buffered in RAM.  2	Hao Wu, CSC 563 Multithreaded Distributed Programming	Week 8  · 5. Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. · 6. Master process gives disk locations to an available reducetask worker who reads all associated intermediate data.  • · 7. Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file.  • · 8.  Master process wakes up user process when all tasks have completed. Output contained in R output files.  • MapReduce Execution Overview MapReduce Execution Overview MapReduce Execution Overview 2	Hao Wu, CSC 563 Multithreaded Distributed Programming	Week 8  2	Hao Wu, CSC 563 Multithreaded Distributed Programming	Week 8  MapReduce Runtime System 1. Partitions input data 2. Schedules execution across a set of machines 3. Handles machine failure 4. Manages interprocess communication Parallelism · map() functions run in parallel, creating different intermediate values from different input data sets  · reduce() functions also run in parallel, each working on a different output key  · All values are processed independently  · Bottleneck: reduce phase can’t start until map phase is completely finished  Locality · Master program divvies up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack  · map() task inputs are divided into 64 MB blocks: same size as Google File System chunks  Fault Tolerance · Master detects worker failures  · Re-executes completed & in-progress map() tasks  · Re-executes in-progress reduce() tasks  · Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.  · Effect: Can work around bugs in third-party libraries!  Optimizations · No reduce can start until map is complete:  - A single slow disk controller can rate-limit the whole process  · Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish  MapReduce Conclusions · MapReduce has proven to be a useful abstraction  · Greatly simplifies large-scale computations at Google  · Functional programming paradigm can be applied to largescale applications  · Fun to use: focus on problem, let library deal w/ messy details  · Greatly reduces parallel programming complexity  · Reduces synchronization complexity  · Automatically partitions data  · Provides failure transparency  · Handles load balancing 2	Hao Wu, CSC 563 Multithreaded Distributed Programming	Week 8  2	Hao Wu, CSC 563 Multithreaded Distributed Programming	Week 8    Week10 Hadoop Programming in Python Multithreaded Distributed Programming  Python Deal with HDFS · Make directory: hdfs dfs -mkdir -p dir_name · use “-p” option on first time directory creation only. It creates all parents directories. · Put files into hdfs: hdfs dfs -put files location_in_hdfs · Move one of the file to the local filesystem: • hdfs dfs -get files local_path Deal with HDFS · list the contents of the directory: hdfs dfs -ls dir_name · print the content of file: hdfs dfs -cat file • To test your hadoop cluster · Run test 1 to calculate PI value: hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar pi 30 100 · Run test 2 to grep information from files: 	 	 	$ hdfs dfs -mkdir input 	 	 	 	$ hdfs dfs -put etc/hadoop/*.xml input 	 	 	 	$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce- 	 	 	examples-3.3.0.jar grep input output 'dfs[a-z.]+' 	 	 	 	 	 · Check the results: 	 	 	 $ hdfs dfs -get output output 	 	 	 $ cat output/* 	 	 	 	 	 Write your Python word count program — mapper 	 	 	#!/usr/bin/env python 	 	 	 	import sys 	 	 	 	 	  	 	 	 	 	#--- get all lines from stdin --- 	 	 	 	for line in sys.stdin: 	 	 	 	 	    #--- remove leading and trailing whitespace--- 	 	 	 	    line = line.strip() 	 	 	 	 	 	 	 	 	 	    #--- split the line into words --- 	 	 	 	 	    words = line.split() 	 	 	 	 	 	 	 	 	 	 	 	    #--- output tuples [word, 1] in tab-delimited format--- 	 	 	 	    for word in words:  	 	 	 	 	        print '%s	%s' % (word, "1") 	 	 	 	 	 	 · Make your python code executable: chmod +x mapper.py Write your Python word count program — reducer 	 	 	#!/usr/bin/env python 	 	 	 	import sys 	 	 	 	 	  	 	 	 	 	# maps words to their counts 	 	 	 	word2count = {} 	 	 	 	 	  	 	 	 	 	# input comes from STDIN 	 	 	 	 	for line in sys.stdin: 	 	 	 	 	 	    # remove leading and trailing whitespace 	 	 	 	    line = line.strip() 	 	 	 	 	  	 	 	 	 	    # parse the input we got from mapper.py 	 	 	 	 	    word, count = line.split('	', 1) 	 	 	 	 	 	    # convert count (currently a string) to int 	 	 	 	    try: 	 	 	 	 	        count = int(count) 	 	 	 	 	    except ValueError: 	 	 	 	 	 	        continue 	 	 	 	 	 	 	 	 	 	 	 	 	 	    try: 	 	 	 	 	 	 	 	        word2count[word] = word2count[word]+count 	 	 	    except: 	 	 	 	        word2count[word] = count 	 	 	 	  	 	 	 	# write the tuples to stdout 	 	 	 	 	# Note: they are unsorted 	 	 	 	 	 	for word in word2count.keys(): 	 	 	 	 	    print '%s	%s'% ( word, word2count[word] ) 	 	 	 	 	 · Make your python code executable: chmod +x reducer.py Test your scripts 	 	 	[hadoop@node1 ~]$ echo "foo foo quux labs foo bar quux" | ./mapper.py  	 	 	foo 1 foo 1 quux 1 labs 1 foo 1 bar 1  	 	 	 	quux 1 	 	 	 	 	 	 	 	 	[hadoop@node1 ~]$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort  	 	 	| ./reducer.py  	 	 	 	labs 1 quux 2 foo 3  	 	 	 	 	bar 1 	 	 	 	 	 	 	 Test your scripts 	 	 	[hadoop@node1 ~]$ cat mapper.py|./mapper.py|sort|./reducer

Robert · Accepted Answer

Assignment 2
Word Count using map reduce
(Fig 1)
Fig 1 shows the various details about our cluster like:
· It tells information about master node/manger node
· It tells information about worker node/slave node
‘
(Fig 2)
Figure 2 tells us about setting up of directory for the process of map reduce:

Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of Yahoo’s MapReduce framework Hadoop to execute the WordCount application. This assignment will be broken...

Answer To: Hadoop MapReduce Due on 12/13/2020 11:59PM Assignment This project assignment covers the use of...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment