This assignment will introduce you to working with the MapReduce framework and the Hadoop...

Question

This assignment will introduce you to working with the MapReduce framework and the Hadoop cluster.

I.Modify WordCount class

You can use the exist example (WordCount.java) in D2L to create a new class. In this class, we will make a few improvements.

•Update the mapper so it produces (Text, LongWritable) pairs, so we are workingwith 64-bit integers. (You should also fix the reducer and everywhere else in youscript).

Test your job to make sure it still has the same behaviour. Now we have a bigger problem…

Redefine “words”

When you having a closer look at the output result from the previous WordCount.java class, you will notice things like this:

better 144

better'; 1

better, 14

better," 1

better,' 1

better. 9

better." 3

better.' 1

better; 4

All of these are really instances of the word “better”, but with some punctuation making them count as different words. This is the fault of the StringTokenizer used in the example which just breaks up the line on any whitespace, which apparently isn't quite the right concept of a “word” in English language.

•Write the rule that words are separated by any spaces or any punctuation. You canuse the module java.util.regex.Pattern and the .split() method on it.

•Update your mapper so that it ignores punctuation by splitting on the regularexpression. Make it ignore case by applying .toLowerCase() to each word.

• One artifact of this method: it will sometimes emit an empty string as a “word”. Make sure your code ignores any length 0 words and doesn't count them.

Put the data set files in HDFS

The files that Hadoop jobs use as input (and produce as output) are stored in the cluster's HDFS (Hadoop Distributed File System). There are a few things you need to do.

Used the hdfs dfs command (which is a synonym for hadoop fs commands) to interact with the file system to do the following:

• Create a directory with your first name to hold input files for the first job and copy the data set files into it.

Compile a job

You have a large collection of text files and would like to count the number of times each word is used in the text by developing a MapReduce program using Java programing language.

• You need to build a .jar file that can be submitted to the cluster, which containing the WordCount class. Copy the JAR to somewhere in your home directory on the cluster.

Run a job

When we run this job, it takes two arguments on the command line: the directories for input and output files.

• Write the command to submit the job to the cluster.

• Write the command to inspect the output of the job created.

There was one file created in the output directory because there was one reducer responsible for combining all of the map output (one is the default).

• Write the command re-run the job (with three reducers instead of one) and store the output into different output directory.

• Write the command to force the Hadoop to dump the output from the mappers (the intermediate output) without going to the reducing stage and store the output into different output directory.

II.JSON Input & Reddit Comments

It is quite common for large data sets to be distributed with each record represented as a JSON object, with one object per line in the file(s). The input files are looking like this:

{"key1": "value1", "key2": 2}

{"key1": "value3", "key2": 4}

•Download the data set files from D2L with the name reddit-1.•Create a MapReduce program to calculate the average score in each subreddit.

Parsing JSON

The input to our mapper will be lines (from TextInputFormat) of JSON-encoded data. In the mapper, we will need to parse the JSON into actual data we can work with.

We will use the org.json package, which is the reference Java JSON implementation. Start by downloading the JAR file with the classes. Full docs are available, but here is a very quick tutorial on the parts we need:

import org.json.JSONObject;

JSONObject record = new JSONObject(input_string);

System.out.println((String) record.get("subreddit"));

System.out.println((Integer) record.get("score"));

Mapping: Passing Pairs

Since we want to calculate average score, we will need to pass around pairs: the number of comments we have seen, and the sum of their scores.

The mapper like probably producing lots of pairs like this:

canada (1,1)

canada (1,9)

canada (1,8)

canada (1,1)

Those all have to be shuffled to the reducer, but it would be much more efficient to combine them into:

canada (4,19)

Reducing to Averages

Write a reducer that takes the mapper's key/value output (Text , LongPairWritable) and calculates the average for each subreddit. It should write one Text , DoubleWritable pair for each subreddit.

III. Submission

• Make sure to submit screenshots of the output showing the program is working correctly in a PDF file

• ZIP all your source files and the output files

• Submit them into the related entry in the D2l

assignment-1-fall2022-kiul10zv.pdf

This assignment will introduce you to working with the MapReduce framework and the Hadoop cluster.I.Modify WordCount classYou can use the exist example (WordCount.java) in D2L to create a new class....

Get Answer To This Question

Related Questions & Answers

Submit New Assignment