4/7/2021 A2: Count non-fluencies using PySparkhttps://umd.instructure.com/courses/1300184/assignments/5531983?module_item_id= XXXXXXXXXX/2A2: Count non-fluencies using PySparkDueFeb...

4/7/2021 A2: Count non-fluencies using PySparkhttps://umd.instructure.com/courses/1300184/assignments/5531983?module_item_id=10557802 1/2A2: Count non-fluencies using PySparkDueFeb 24by11:59pmPoints20Submittinga file uploadStart AssignmentTotal Points:20Task:In this assignment, you will use PySpark to analyze large amounts of text data and find actionableinsights from them. In particular, you will find frequencies ofnon-fluenciesfrom tweets.A non-fluency is an extra word in a sentence that does not contribute to the overall meaning of thesentence. Consider the following example,I'll play with youum umlike after my snackHere,umis a non-fluency with frequency 2. Such non-fluencies can work as an important feature forunderstanding informal language, spontaneous speech, expressions, and so on.Please use the following (key, value) pairs as the definition of different types of non-fluencies. The ‘+’and ‘*’ signs indicate regular expressions (regex) and should be used to match the non-fluencies. Youcan use Python’s defaultrepackage for this purpose.non_fluencies_dict = [('MM',[ 'mm+']),('OH', ['oh+', 'ah+']),('SIGH', ['sigh', 'sighed', 'sighing', 'sighs', 'ugh', 'uh']),('UM', ['umm*', 'hmm*', 'huh'])]The dictionary contains four different types of non-fluencies. The final output should include frequenciesof non-fluencies across all tweets that fall under each category.Sample input:Tweet1:Oh, I ate a burger in the afternoon. That was so great,ummm.Oh, and guess what!, I ate pizzaat dinner.Hmm, that’s good, right?4/7/2021 A2: Count non-fluencies using PySparkhttps://umd.instructure.com/courses/1300184/assignments/5531983?module_item_id=10557802 2/2Tweet2:Oh, I had a terrible dream.Ohhh! I just want to forget it.Sample output:(‘MM’,[]), (‘OH’, [(‘oh’, 3), (‘ohhh’,1)]), (‘SIGH’,[]), (‘UM’,[(‘ummm, 1),(hmm,1)])The implementation must use PySpark and HDFS. You should first test your program with small textfiles. And then, use thesesmall (https://drive.google.com/file/d/1QruCplvur66c6VwLf_aaPvyq_9-qSBWX/view?usp=sharing)andlarge(https://drive.google.com/file/d/1N3nv6cQTAIGDX6iZ6rXBsGUKvcSx6pB1/view?usp=sharing)tweets for thefinal results.Please convert the texts to lower cases. The regex given above will not match to upper case characters.You can use the defaultlowerfunction in Python for this purpose.Deliverables:Submit the following files:1.Analyze.py containing your python implementation. Please provide appropriate comments to yourcode.Good programming practices are 10% of your total marks.2.A screenshot containing the final results obtained from the analysis.3.A readme file in case you have special instructions for running the codes (you have a different OS,python3, or system requirements).Please upload the files separately,not as a ZIP.
Apr 08, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here