Working with MapReduce DS730 In this project, you will be working with input, output, Python and Hadoop framework. You will be writing multiple mappers and reducers to solve a few different problems....

1 answer below »
Please see attached file


Working with MapReduce DS730 In this project, you will be working with input, output, Python and Hadoop framework. You will be writing multiple mappers and reducers to solve a few different problems. If you recall, the map and reduce functions are stateless and this is especially important when dealing with Hadoop and distributed work. We can’t guarantee that any 1 mapper will read in all of the data. Nor can we guarantee that certain inputs will end up on the same machine for mapping. Rather, 1 mapper will likely read in a small portion of the data. The output that your mapper produces must only depend on the current input value. For the reducer, you can only guarantee that (key,value) pairs with the same key will end up on the same reducer. Your mapper and reducer cannot be trivial. For example, do not have all of your mappers map use the same key and then solve everything in the reducer. Such a solution defeats the purpose of MapReduce because all (key,value) pairs will end up on the same reducer. If you are unsure if your keys are trivial, post a private message to the message board for the instructors and we will let you know if your keys are trivial. A couple of very important things: 1. Make sure your key is separated by your value using a tab. Hadoop will only work if this is the case. Otherwise, Hadoop has no idea what your “key” is nor will it know what your “value” is. 2. Make sure that this is the first line of your mapper and reducer: #!/usr/bin/env python You must write 1 mapper file and 1 reducer file to solve each of the following problems. Make sure you name your files mapperX.py and reducerX.py where X is the problem number. For each problem, Hadoop will define what your input files are so there is no need to read in from any file. Simply read in from the command line. You are encouraged to use the “starter” mapper and reducer as shown in the activity. Discovering Contacts On many social media websites, it is common for the company to provide a list of suggested contacts for you to connect with. Many of these suggestions come from your own list of current contacts. The basic idea behind this concept being: I am connected with person A and person B but not person C. Person A and person B are both connected to person C. None of my contacts are connected to person D. It is more likely that I know person C than some other random person D who is connected to no one I know. For this problem, all connections are mutual (i.e. if A is connected to B, then B is connected to A). In this problem, you will read in an input file that is delimited in the following manner: PersonA : PersonX PersonY PersonZ PersonQ PersonB : PersonF PersonY PersonX PersonG PersonM … For example, the person to the left of the colon will be the current person. All people to the right of the colon are the people that the current person is connected to. All people will be separated by a single space. In the example above, PersonA is connected to PersonX, Y, Z and Q. In all inputs, all people will be replaced with positive integer ids to keep things simple. The following is a sample input file: 6 : 2 9 8 10 1 : 3 5 8 9 10 12 4 : 2 5 7 8 9 2 : 3 4 7 6 13 12 : 1 7 5 9 3 : 9 11 10 1 2 13 10 : 1 3 6 11 5 : 4 1 7 11 12 13 : 2 3 8 : 1 6 4 11 7 : 5 2 4 9 12 11 : 3 5 10 8 9 : 12 1 3 6 4 7 The ordering of people on the right hand side of the input can be in any order. Your goal is this: you must output potential contacts based on the following 2 criteria: 1. Someone who might be someone you know. For someone to be suggested here, the person must not currently be a connection of yours and that person must be a connection of exactly 2 or 3 of your current connections. For example, consider person 2 in the above example. Person 2 is connected with 3, 4, 6, 7 and 13. Person 4 is connected to 8, person 6 is connected to 8, person 3 is not connected to 8, person 7 is not connected to 8 and person 13 is not connected to 8. Therefore, person 2 has two connections (4 and 6) that are connected to 8 and person 2 is not currently connected to 8. Therefore, person 2 might know person 8.  2. Someone you probably know. For someone to be suggested here, the person must not currently be a connection of yours and that person must be a connection of 4 or more of your current connections. For example, consider person 2 in the above example. Person 2 is connected with 3, 4, 6, 7 and 13. Person 4 is connected to 9, person 6 is connected to 9, person 3 is connected to 9 and person 7 is connected to 9. Therefore, person 2 has at least four connections that are connected to 9 and person 2 is not currently connected to 9. Therefore, person 2 probably knows person 9. Your output must be formatted in the following fashion: personID:Might(personA,…,personX) Probably(personA,…personX) For each line you have a personID following by a colon. The colon is followed by the list of Might’s separated by commas (but no space). If a person has no one they might be connected to, this list is not printed at all (see person 13 below for example). The Might list is followed by a single space and then followed by the Probably list separated by commas (but no space). If a person has no one they probably are connected to, this list is not printed at all (see person 3 for example). If a person has neither a might list nor a probably list, that person only has their id along with a colon (see person 13 for example). The Might list must appear before the Probably list. If there is no Might list but there is a Probably list, there is no space between the colon and the Probably list. The integers within each list must appear in increasing order. However, the order the rows appear in the output need not be in any specific order. For example, the row for 5 might appear before the row for 3. As a concrete example from the above sample input, this would be a potential sample output: 1:Might(4,6,7) Probably(11) 2:Might(5,8,10) Probably(9) 3:Might(4,5,6,7,8,12)  4:Might(1,3,6,11,12)  5:Might(2,3,8,10) Probably(9) 6:Might(1,3,4,7,11)  7:Might(1,3,6)  8:Might(2,3,5,9,10)  9:Might(8,10) Probably(2,5) 10:Might(2,5,8,9)  11:Might(4,6) Probably(1) 12:Might(3,4)  13:  For each question, the rows do not have to be in any specific order. The following is also a valid output for number 3: 9:Might(8,10) Probably(2,5) 1:Might(4,6,7) Probably(11) 3:Might(4,5,6,7,8,12)  4:Might(1,3,6,11,12)  5:Might(2,3,8,10) Probably(9) 11:Might(4,6) Probably(1) 6:Might(1,3,4,7,11)  7:Might(1,3,6)  2:Might(5,8,10) Probably(9) 8:Might(2,3,5,9,10)  10:Might(2,5,8,9)  12:Might(3,4)  13: Other Important Information 1. You can use any Python libraries as long as they are installed by default on the Hortonworks machine and your code works on Hadoop. All projects will be tested using Hadoop on Hortonworks. You should ensure that your code executes correctly on that platform before submitting. 2. The order of the rows in the output does not matter. This is shown in the last part of problem 3. For problem 2, for example, the :1 row could come after the eo:2 row. 3. The format must be: mapper.py #!/usr/bin/python3.6 #Be sure the indentation is identical and also be sure the line above this is on #the first line import sys import re line = sys.stdin.readline() pattern = re.compile("[a-zA-Z0-9]+") while line: for word in pattern.findall(line): print(word+"\t"+"1") line = sys.stdin.readline() reducer.py #!/usr/bin/python3.6 #Be sure the indentation is correct and also be sure the line above this is on the first line import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) if current_word == word: current_count += count else: if current_word: print('%s\t%s' % (current_word, current_count)) current_count = count current_word = word if current_word == word: print('%s\t%s' % (current_word, current_count))
Answered 1 days AfterSep 23, 2021

Answer To: Working with MapReduce DS730 In this project, you will be working with input, output, Python and...

Karthi answered on Sep 25 2021
134 Votes
68_python_map_reduce_91859/mapperz.py
#!/usr/bin/env python
import sys
import re
###########################
#Procedure to split the lines.
#Input List with person, connection1 connection2 connection3...
#Output person_out, list of connections
###########################
def split_lines(line_in):
person_out = line_in[0].strip()
connections_out = line_in[1].replace('\n', '')
connections_out = connections_out.strip()
connections_out = connections_out.split()
return person_out,connections_out
###########################
##########################
#Procedure to create
a list of connections.
#input - person, list of friends
#ouput - list of all (person, friend) combinations.
##########################
def create_connection_friend_list(connection_in, friend_in):
list_out = []
connection_in_strip = connection_in.strip()
for friends in friend_in:
if friends != ' ' and friends != '' and friends != '\n' :
list_out.append((connection_in_strip,friends))
return list_out
#########################
#Procedure to print all combinations of person, friend, and friends of friends.
#Input - List of person and friends.
#Output - Print Person + Friends, and Possible Future Friends.
#########################
def create_connection_friend_friend_list(list_in):
list_in_loop = list_in
for first,args in list_in:
for first2,args2 in list_in_loop:
if args == first2:
print(first+"\t"+args+'+'+args2)
return
#########################
#main procedure
#########################
def main(argv):
connection_friend_list =[]     #Reset List
line = sys.stdin.readline()     #Read first Line
while line:
line = line.split(":")         #Split by Colan
Person,Connections = split_lines(line)
call_list = []
call_list = create_connection_friend_list(Person,Connections)
connection_friend_list.extend(call_list)

line = sys.stdin.readline() #Read Line

create_connection_friend_friend_list(connection_friend_list)
#Note there are two underscores around name and main
if __name__ == "__main__":
main(sys.argv)
68_python_map_reduce_91859/mapperx.py
#!/usr/bin/env python
import sys
import re
##########################
#Procedure to Split the lines into individual variables.
#Input - Line of data
#Output - InvoiceNo_out , CustomerID_out , Country_out , Amount_Spent_out , InvoiceMonth_out
##########################
def split_lines(line_in):
InvoiceNo_out = line_in[0] #Read Invoice Number
CustomerID_out = line_in[6] #Read Customer ID
Quantity_out = line_in[3] #Read Question
UnitPrice_out = line_in[5] #Read Unit Price
Country_out = line_in[7] #Read Country
Country_out = Country_out.replace('\r', '').replace('\n', '')             #Strip Line Return
InvoiceDate_out = line_in[4]                 #Read Invoice Date
Amount_Spent_out = 0    
Amount_Spent_out = int(float(Quantity_out)) * int(float(UnitPrice_out))     #Calculate Amount Spent
InvoiceMonth_out, date_out, year_time_out = InvoiceDate_out.split('/')         #Get Month
InvoiceMonth_out = str(InvoiceMonth_out).zfill(2)                 #Hard Pad 2 Characters
return InvoiceNo_out , CustomerID_out , Country_out , Amount_Spent_out , InvoiceMonth_out
##########################
#Main Procedure
##########################
def main(argv):
line = sys.stdin.readline() #Read Header
line = sys.stdin.readline() #Read first Line
while line:
line = line.split(",") #Split by CSV
InvoiceNo,CustomerID,Country,Amount_Spent,InvoiceMonth = split_lines(line)
if CustomerID != '' and InvoiceNo[0] != 'C':    #Exclude C invoices and Null Invoices
print(Country+'+'+InvoiceMonth+"\t"+str(Amount_Spent)+'+'+CustomerID)
line = sys.stdin.readline() #Read Line

#Note there are two underscores around name and main
if __name__ == "__main__":
main(sys.argv)
68_python_map_reduce_91859/mappery.py
#!/usr/bin/env python
import sys
import re
##########################
#Main Procedure
##########################
def main(argv):
line = sys.stdin.readline()            #read the first line
pattern = re.compile("[a-zA-Z0-9]+")

while line:
for word in pattern.findall(line):
vowels = ''
for i in word.lower():
if(i=='a' or i=='e' or i=='i' or i=='o' or i=='u'):
vowels += i
vowels_sorted = sorted(vowels)
vowels_print = ''.join(vowels_sorted)
if len(vowels_print) == 0:        #If there are no vowels, return a space
print("_"+"\t"+"1")
if len(vowels_print) >= 1:        #If there are more than 1 vowels, print the combination with 1.
print(vowels_print+"\t"+"1")
line = sys.stdin.readline()
#Note there are two underscores around name and main
if __name__ == "__main__":
main(sys.argv)
68_python_map_reduce_91859/reducerx.py
#!/usr/bin/env python

import sys
import pandas as pd
from collections import Counter
##########################
#Procedure to find the biggest spender.
#input
#list of values by the following keys, invoice month, country,
#output
#InvoiceMonth_in, Country_in, big spender
##########################
def print_rows(list_in,InvoiceMonth_in,Country_in):    
#Combine the Amounts for each Customer.
d = {x:0 for x,_ in list_in}
for name,num in list_in: d[name] += num
Result = map(tuple, d.items())
#Find the max value in the List.
max_Cust = max(Result,key=lambda item:item[1])[0]
max_val = max(Result,key=lambda item:item[1])[1]
#If there is a tie print them all
for first,args in Result:
if args ==...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here