Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over...

1 answer below »

Spark RDD with text (8 marks)





Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using Spark RDD.



The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a
headline of a news article, in format of "date,
term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like below:




20030219,council chief executive fails to secure position 20030219,council welcomes ambulance levy decision 20030219,council welcomes insurance breakthrough 20030219,fed opp to re introduce national insurance 20040501,cowboys survive eels comeback 20040501,cowboys withstand eels fightback 20040502,castro vows cuban socialism to survive bush 20200401,coronanomics things learnt about how coronavirus economy 20200401,coronavirus at home test kits selling in the chinese community 20200401,coronavirus campbell remess streams bear making classes 20201015,coronavirus pacific economy foriegn aid china 20201016,china builds pig apartment blocks to guard against swine flu





When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from
https://www.kaggle.com/therohk/million-headlines.



Your task is to
find the top-3 most frequent terms for each year. That is, for each year, select 3 terms that appeared in the most articles of that year, which represent the hot topics. If some words appear in the same number of articles, sort them in ascending order alphabetically.



Please ignore the "stop words" which are frequent but meaningless for this task, including: "to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how".



In your output, sort the results by years. For each year (in one line), sort the top-3 terms first by their article frequencies and then by the terms in alphabetical order. For example, given the above data set, the output should be (using Spark RDD):




2003 council insurance welcomes 2004 cowboys eels survive 2020 coronavirus china economy




Write a Python program that uses Spark RDD to do this. A file called "rdd.py" has been created for you - you just need to fill in the details.
Note that
the efficiency
(the time complexity) of your method will be considered for marking.



To debug your code, you can first test everything in pyspark, and then write the codes in "rdd.py". To test your program, you first need to create your default directory in Hadoop, and then copy abcnews.txt to it:




$ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put abcnews.txt




Similarly, please also update the file "stopwords.txt" to HDFS, also in the folder "/user/user".



You can run your program on Spark by running the following command:




$ spark-submit rdd.py




Please save your results in the 'result-rdd' folder in HDFS (i.e., use saveAsTextFile("result-rdd") in your code).



Answered Same DayAug 09, 2022

Answer To: Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an...

Aditi answered on Aug 10 2022
68 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here