Learning Outcomes There are three specific objectives to this assignment: · To learn about the general nature of the MapReduce paradigm. · To implement a correct and efficient MapReduce framework...

can you help with this project?


Learning Outcomes There are three specific objectives to this assignment: · To learn about the general nature of the MapReduce paradigm. · To implement a correct and efficient MapReduce framework using threads and related functions. · To gain more experience writing concurrent code, especially using lock and condition variables. Intro In 2004, engineers at Google introduced a new paradigm for large-scale parallel data processing known as MapReduce (see the original paper here, and make sure to look in the citations at the end). One key aspect of MapReduce is that it makes programming such tasks on large-scale clusters easy for developers; instead of worrying about how to manage parallelism, handle machine crashes, and many other complexities common within clusters of machines, the developer can instead just focus on writing little bits of code (described below) and the infrastructure handles the rest. In this project, you'll be building a simplified version of MapReduce for just a single machine. While somewhat easier to build MapReduce for a single machine, there are still numerous challenges, mostly in building the correct concurrency support. Thus, you'll have to think a bit about how to build the MapReduce implementation, and then build it to work efficiently and correctly. Background To understand how to make progress on any project that involves concurrency, you should understand the basics of thread creation, mutual exclusion (with locks), and signaling/waiting (with condition variables). These are described in the following book chapters: · Intro to Threads · Threads API(C) · Locks · Using Locks · Condition Variables(C) · Java Lock and Condition(This is a must read, but you can ignore built-in synchronization primitives) General Idea Let's now get into the exact code you'll have to build. The MapReduce infrastructure you will build supports the execution of user-defined Map() and Reduce() functions. As from the original paper: "Map(), written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key K and passes them to the Reduce() function." "The Reduce() function, also written by the user, accepts an intermediate key K and a set of values for that key. It merges together these values to form a possibly smaller set of values; typically just zero or one output value is produced per Reduce() invocation. The intermediate values are supplied to the user's reduce function via an iterator." A classic example, written here in pseudocode, shows how to count the number of occurrences of each word in a set of documents: map(String key, String value):     // key: document name     // value: document contents     for each word w in value:         EmitIntermediate(w, "1"); reduce(String key, Iterator values):     // key: a word     // values: a list of counts     int result = 0;     for each v in values:         result += ParseInt(v);     print key, result; What's fascinating about MapReduce is that so many different kinds of relevant computations can be mapped onto this framework. The original paper lists many examples, including word counting (as above), a distributed grep, a URL frequency access counters, a reverse web-link graph application, a term-vector per host analysis, and others. What's also quite interesting is how easy it is to parallelize: many mappers can be running at the same time, and later, many reducers can be running at the same time. Users don't have to worry about how to parallelize their applications; rather, they just write Map() and Reduce() functions and the infrastructure does the rest. Code Overview (All the starter code can be downloaded from here) We give you the starter code MapReduce class that specifies exactly what you must build in your MapReduce class: The most important function is MRRun, which takes the command line parameters of a given program, an  instance of an user-defined class that extends MapperReducerAPI class and implements Map and Reduce functions, the number of mapper threads your MapReduce class should create (numMappers), and the number of reducers (numReducers). Thus, when a user is writing a MapReduce computation with your class, they will implement a MapperReducerAPI class, and implement Map function, implement a Reduce function, possibly implement a Partitioner function, and then call MRRun(). Please also see the following Wordcount example. The infrastructure will then create threads as appropriate and run the computation.  One basic assumption is that your class will create num_mappers threads that perform the map tasks. Another is that your class will create num_reducers threads to perform the reduction tasks (Invariant: num_reducers = num_parititions = num_sorters ). Finally, your class will create some kind of internal data structure to pass keys and values from mappers to reducers; more on this below. Simple Example: Wordcount Here is a simple (but functional) wordcount program in Java, written to use this infrastructure by extending abstract class MapperReducerAPI: Let's walk through this code, in order to see what it is doing. First, notice that Map() is called with a file name. In general, we assume that this type of computation is being run over many files; each invocation of Map() is thus handed one file name and is expected to process that file in its entirety. In this example, the code above just reads through the file, one line at a time, into tokens. Each token is then emitted using the MREmit() function, which takes two strings as input: a key and a value. The key here is the word itself, and the token is just a count, in this case, 1 (as a string).  The MREmit() function is thus another key part of your class; it needs to take key/value pairs from the many different mappers and store them in a way that later reducers can access them, given constraints described below. Designing and implementing this data structure is thus a central challenge of the project. After the mappers are finished, your class should have stored the key/value pairs in such a way that the Reduce() function can be called. Reduce() is invoked once per key, and is passed the key along with a function that enables iteration over all of the values that produced that same key. To iterate, the code just calls MRGetNext() repeatedly until a NULL value is returned;  MRGetNext() returns a value object passed in by the MREmit() function above, or NULL when the key's values have been processed. The output, in the example, is just a count of how many times a given word has appeared, and is just written to output file by the MRPostProcess(). The following shows Reduce(), in the continuation of Wordcount class. All of this computation is started off by a call to MRRun() in the main() routine of the user program. This function is passed the argv array, and assumes that argv[0] (or user entered file name) contains file name to be processed, and will be equally split into as many files as mappers so that each mapper gets a split of the file. One interesting function that is needed by MapReduce class is the partitioning function. In most cases, programs will use the default function. Here is the default implementation in MapperReducerAPI: The function's role is to take a given key and map it to a number, from 0 to num_partitions - 1. Its use is internal to the MapReduce class, but critical. Specifically, your MapReduce class should use this function to decide which partition (and hence, which sorter and reducer thread) gets a particular key/list of values to process. For some applications, which reducer thread processes a particular key is not important (and thus the default function above should not be overridden by customer’s MapperReducerAPI subclass); for others, it is, and this is why the user can implement their own partitioning function in MapperReducerAPI subclass as need be. One last requirement: For each partition, keys (and the value list associated with said keys) should be sorted in ascending key order; thus, when a particular reducer thread (and its associated partition) are working, the Reduce() function should be called on each key in order for that partition. Considerations Here are a few things to consider in your implementation: · Thread Management. This part is fairly straightforward. You should create num_mappers mapping threads, and assign a file to each Map() invocation. You should also create num_reducers sorter  and reducer threads at some point, to work on the map's output.  · Partitioning and Sorting. Your central data structure, partitions(or PartitionTable), should be made concurrent, allowing mappers to each put values into different partitions correctly and efficiently. You are only required to support the data structure of partitions using coarse-granularity locking, that is, each partition is protected by a lock [see extra-credits]. Once the mappers have completed, a sorting phase should order the key/value-lists. Then, finally, each reducer thread should start calling the user-defined Reduce() function on the keys in sorted order per partition. You should think about what type of locking is needed throughout this process for correctness. Restrictions · Except the linked list which is provided to you in the starter code, you can only use primitive types in Java.   · You must use Java threads API(create, join, etc), condition variables, locks and/or semaphores, which are generic abstractions commonly supported across different programming languages. You are not allowed to use Java primitive synchronization (such as object notify) or Java thread pool(such as executor) which is only supported in Java language.   · Your implementation must show a clear pipelined processing of input data in the log, which is, sorters must wait for all mappers to complete, and a reducer can not start working on a partition until the partition is sorted by a sorter (hint: you can use condition variables(await, signal)). · You can add your code and your own custom classes but there should be no modification or deletion to the existing code in the starter-code folder, especially with testing and logging code, without permission from the instructor. Testing Your code will be measured for correctness, ensuring that it performs the maps and reductions correctly, then your code will be measured for performance improvement.  · Test scripts are automatically invoked at the end to check if your implementation produces correct output.  · Logs are printed to show progress and order of running mappers, sorters and reducers. Logging is not only useful to check if pipeline is implemented correctly, but also a powerful debugging tool for you. Extra-credits · 20% extra-credits for fine-granularity concurrency support to data structure of partitions using hand-over-hand locking. · 20% extra-credits for writing another mapreduce application using your own mapreduce framework! Talk to the instructor for the dataset of page rank.
Dec 16, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here