Part A Part A - Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text...

1 answer below »
No Referencing needed!


Part A Part A - Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using Spark RDD. The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a headline of a news article, in format of "date, term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like below: 20030219,council chief executive fails to secure position 20030219,council welcomes ambulance levy decision 20030219,council welcomes insurance breakthrough 20030219,fed opp to re introduce national insurance 20040501,cowboys survive eels comeback 20040501,cowboys withstand eels fightback 20040502,castro vows cuban socialism to survive bush 20200401,coronanomics things learnt about how coronavirus economy 20200401,coronavirus at home test kits selling in the chinese community 20200401,coronavirus campbell remess streams bear making classes 20201015,coronavirus pacific economy foriegn aid china 20201016,china builds pig apartment blocks to guard against swine flu When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from https://www.kaggle.com/therohk/million-headlines. Your task is to find the top-3 most frequent terms for each year. That is, for each year, select 3 terms that appeared in the most articles of that year, which represent the hot topics. If some words appear in the same number of articles, sort them in ascending order alphabetically. Please ignore the "stop words" which are frequent but meaningless for this task, including: "to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how". In your output, sort the results by years. For each year (in one line), sort the top-3 terms first by their article frequencies and then by the terms in alphabetical order. For example, given the above data set, the output should be (using Spark RDD): 2003 council insurance welcomes 2004 cowboys eels survive 2020 coronavirus china economy Write a Python program that uses Spark RDD to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. Note that the efficiency (the time complexity) of your method will be considered for marking. To debug your code, you can first test everything in pyspark, and then write the codes in "rdd.py". To test your program, you first need to create your default directory in Hadoop, and then copy abcnews.txt to it: $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put abcnews.txt Similarly, please also update the file "stopwords.txt" to HDFS, also in the folder "/user/user". You can run your program on Spark by running the following command: $ spark-submit rdd.py Please save your results in the 'result-rdd' folder in HDFS. Part b Part B - Spark RDD with CSV (4 marks) In Part B your task is to answer a question about the data in a CSV file using Spark RDD. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the CSV file "orders.csv". It's one that you've seen before. Here are the fields in the file: OrderDate (date) ISBN (string) Title (string) Category (string) PriceEach (decimal) Quantity (integer) FirstName (string) LastName (string) City (string) Your task is to find the number of books ordered each day, sorted by the number of books descending, then order date ascending. Your results should appear as the following: 2009-04-03,10 2009-04-02,8 2009-04-01,7 2009-04-04,6 2009-03-31,5 2009-04-05,4 2009-04-08,4 First (4 marks) Write a Python program that uses Spark RDDs to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. You should be able to modify programs that you have already seen in this week's content. To sort the RDD results, you can use SortBy, and here is an example of it. Hint: >>> tmp = [('a', 3), ('b', 2), ('a', 1), ('d', 4), ('2', 5)] >>> sc.parallelize(tmp).sortBy(lambda x: (x[0],x[1])).collect() Output: [('2', 5), ('a', 1), ('a', 3), ('b', 2), ('d', 4)] To test your program you first need to create your default directory in Hadoop, and copy orders.csv to it: $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put orders.csv You can test your program by running the following command: $ spark-submit rdd.py Please save your results in the 'result-rdd' folder in HDFS. Part A - Hive with text (4 marks) In Part A your task is to answer a question about the data in an unprocessed text file using Hive. When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "walden.txt", containing some sample text (feel free to open the file and explore its contents)(it's an extract from Walden, by Henry David Thoreau). In this text file, each line is a sentence. It is worth noting that there are multiple spaces at the end of each line in this unprocessed text file. Your task is to find the average word lengths according to the first letters of sentences. For example, given a toy input file as shown below: Aaa bbb cc. Ab b. The output should be: Letter A: 2.6 Because, for A, we have (3*3+2*2)/5 = 2.6. You can assume that sentences are separated by full stops, and words are separated by spaces. For simplicity, we include all punctuations, like ',' and '.', when calculating word length, like what we did in example1 and example2. (So, the length of 'cc.' is 3 instead of 2.) The case of letters can be ignored. Given the walden.txt file as input, the format of the output is "letter: avg_word_length" (The result should be rounded to two decimal places, with round(x,2) ), as shown below: Letter A 4.17 Letter B 4.82 Letter F 4.18 Letter I 4.16 Letter S 4.32 Letter T 4.09 Letter W 4.89 Write a Hive script to do this. A file called "script.hql" has been created for you - you just need to fill in the details. You should be able to modify Hive scripts that you have already seen in this week's content. You might use some User-Defined Functions (UDFs) which can be found here. You can test your script by running the following command (it tells Hive to execute the commands contained in the file script.hql): $ hive -f script.hql This is worth 4 marks When you are happy that your job and script are correct, click "Submit". Part B - Spark SQL with CSV (2 marks) In Part B your task is to answer a question about the data in a CSV file using Spark DataFrames and SQL. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the CSV file "orders.csv". It's one that you've seen before. Here are the fields in the file: OrderDate (date) ISBN (string) Title (string) Category (string) PriceEach (decimal) Quantity (integer) FirstName (string) LastName (string) City (string) Your task is to find the number of books ordered each day, sorted by the number of books descending, then order date ascending. Your results should appear as the following: 2009-04-03,10 2009-04-02,8 2009-04-01,7 2009-04-04,6 2009-03-31,5 2009-04-05,4 2009-04-08,4 Write a Python program that uses Spark DataFrames and SQL to do this. A file called "sql.py" has been created for you - you just need to fill in the details. Again, you should be able to modify programs that you have already seen in this week's content. You can test your program by running the following command: $ spark-submit sql.py Please save your results in the 'result-sql' folder in HDFS. When you are happy that your two programs are correct, click "Submit". Part C- Spark SQL with CSV (6 marks) COVID-19 has affected our lives significantly in recent years. In Part B your task is to do a data analysis task over a COVID-19 data set stored in the CSV format using Spark DataFrames and SQL. The COVID-19 dataset contains the cases by notification date and postcode, local health district, and local government area in NSW, Australia. The dataset is updated daily, except on weekends. Here are the fields in the file: notification_date (date) -- e.g. 2020-03-29, 2020-03-30 etc. postcode (integer) -- e.g. 2011, 2035, etc. lhd_2010_code (string) -- local health district code, e.g. X720, X760, etc. lhd_2010_name (string) -- local health district name, e.g. South Eastern Sydney, Northern Sydney, etc. lga_code19 (string) -- local government area code, e.g. 17200, 16550, etc. lga_name19 (string) -- local government area name, e.g. Sydney (C), Randwick (C), etc. When you click the panel on the right you'll get a connection to a server, and in your home directory you can see a sample of the data set named "cases-locations.csv". Your task is to find the maximum daily cases number in each local health district (lhd) together with the date. Each line of your result should contain the local health district, the local health district code, the date and the maximum daily increase of total confirmed cases. The results should be sorted first by the daily increase in descending order, and then by the date in ascending order, and finally by the local health district name(lhd_2010_name) in descending order. For a certain local health district, if there are multiple dates that have the same maximum daily cases number, please return all such dates. For example, given the sample data set, your results should be as below: Northern Sydney,X760,2020-03-27,44 South Eastern Sydney,X720,2020-03-27,41 Western Sydney,X740,2020-03-29,24 Hunter New England,X800,2020-03-28,22 South Western Sydney,X710
Answered 6 days AfterJul 27, 2022

Answer To: Part A Part A - Spark RDD with text (8 marks) Detecting popular and trending topics from the news...

Rushendra answered on Aug 03 2022
70 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here