1 MIST.3050 Programming Assignment: Basis Text Analysis This assignment will give you the opportunity to develop a program that analyzes text by recognizing and counting the words of different...

1 answer below »
I need someone to do my programming Asiignment.


1 MIST.3050 Programming Assignment: Basis Text Analysis This assignment will give you the opportunity to develop a program that analyzes text by recognizing and counting the words of different sentiments. The program can be useful in several application scenarios. For example, in your cover letter for job applications, you may want to avoid using negative words – you can use the program to identify the negative words, then revise your letter to avoid them. You may also use the program to count positive and negative words in online reviews, then use these counts as an alternative to star ratings in a predictive model that you need to build. 1. Background Words such as “able” and “joy” often have a positive sentiment. And other words such as “novice” and “revoke” often have a negative sentiment. Linguists and text analysis researchers have categorized words and documented their results as dictionaries. Several dictionaries are used in practice. The following are two popular ones: (1) General Inquirer Harvard IV-4 Dictionary (IV-4). Originally developed by Professor Philip Stone at Harvard University for applications in psychology and sociology, the dictionary has been widely used in many areas. The dictionary has 11,888 entries and more than180 categories, including different categories developed by different researchers. Additional information about this dictionary can be found at http://www.wjh.harvard.edu/~inquirer/. Spreadsheet versions can be found at http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm. A plain text version can be found at wjh.harvard.edu/~inquirer/inqdict.txt. (2) Master Dictionary for Business (MD). Developed by professors Tim Loughran and Bill McDonald at the University of Notre Dame, this dictionary is specific to financial reports and can be used for business related texts. One of the motivations of developing this business specific dictionary is that frequently used words in business documents such as “liability” and “vice” have no negative implications, but they are categorized negative in the Harvard IV-4 dictionary. The master dictionary has more than 80,000 entries. Different forms of the same base word are included as separate entries. These different forms, also known as inflections in natural language processing (NLP), can be singular vs plural for nouns and different tenses for verbs. Information about this Master Dictionary can be found at https://sraf.nd.edu/textual-analysis/resources/. With either dictionary, you can look up a word to see if it has been categorized as positive or negative. Words are in uppercase in both dictionaries. The main difference between the two dictionaries is different classifications of word sentiment. Recall the example given earlier, “liability” is negative according to IV-4, but it is not negative according to MD. Likewise, “joy” is positive in IV-4 but not in MD. Another notable difference is: IV-4 captures different word meanings, while MD captures different word forms. A word can have different meanings, also known as word senses in NLP. In this case, the word has multiple entries in IV-4, each is listed as the word followed by “#” and http://www.wjh.harvard.edu/~inquirer/ http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm https://sraf.nd.edu/textual-analysis/resources/ 2 a number, e.g., “ABOVE#1”, “ABOVE#2”, etc. In contrast, MD has only one entry for the word “above”. In terms of word forms, IV-4 is quite limited, but MD includes all forms of a word. For example, IV-4 has two entries for the word “book”: book and booking. In contrast, MD has book, books, booking, bookings, booked, and many book-prefixed words such as bookends and bookkeeper – a total of 53 entries. As a results, MD ends up having more entries. 2. Tasks Your main task of the assignment is to identify and count the occurrences of positive words and negative words in any given text. This is conceptually simple: for each word in the text, look it up in a dictionary (or in both dictionaries) to see if it is positive or negative. For this assignment, you may choose either dictionary or use both. For example, given the text (which is from an online review): I thought they were pretty nice at first. Sound quality is good. They charged quick in the case. But transparency mode amplified sounds like crazy. At the gym someone was moving weights, the cling of them hitting was so loud it hurt my ears. When I took the airpods out it was much quieter in real life. Even in noise cancelling mode they pick up voices and then they end up sounding robotic. There's a white noise in the background too. It's like a little static, annoying as hell. Just now I started hearing a popping noise in the right bud. It kept happening even when the music was off. When I turned to transparency mode that popping became a buzz. using MD, your program should be able to output something like below: Positive words: {'transparency': 2, 'good': 1} Negative words: {'hurt': 1, 'cancelling': 1, 'annoying': 1} Positive occurrences: 3; Negative occurrences: 3 In addition, your program should produce decorated version of the text where positive words are bolded and negative words are underlined: I thought they were pretty nice at first. Sound quality is good. They charged quick in the case. But transparency mode amplified sounds like crazy. At the gym someone was moving weights, the cling of them hitting was so loud it hurt my ears. When I took the airpods out it was much quieter in real life. Even in noise cancelling mode they pick up voices and then they end up sounding robotic. There's a white noise in the background too. It's like a little static, annoying as hell. Just now I started hearing a popping noise in the right bud. It kept happening even when the music was off. When I turned to transparency mode that popping became a buzz. The decoration can be done through HTML, additional information about which will be provided. 3. Implementation As you know, there is usually more than one way to accomplish a certain task. The following suggestions intend to give you some ideas about implementation, and you may develop your own implementation different from the suggestions. 3 You may use either dictionary or use both dictionaries (you can earn up to 5% bonus points if you successfully use both dictionaries). In the following description, I assume you use only one dictionary. Think about how you want to represent the dictionary so that you can easily check if a word is categorized as positive or negative. You may want to implement functions or use OOP to support word lookup. The process has two main steps: 1. represent dictionary to support word lookup 2. for each token in the text, obtain the word and determine if it is positive or negative. 3.1 Processing Dictionary File For step 1, you need to process the dictionary file, find positive and negative words, and represent these words to support lookup in the second step. In IV-4, a positive word is marked by Positiv in the column named Positiv; in MD, a positive word is marked by a non-zero value in the column named Positive. Negative words are marked using the Negativ and Negative columns, respectively. See the Appendix for example entries in the two dictionary files. The following example code gives you some idea about how to read the IV-4 dictionary file, find positive words, and remove characters that are not in the alphabet plus the space character (e.g., converting “ABOVE#1” to “ABOVE”): import re, csv print("Positive words in IV-4") with open('iv4.csv' , 'r') as f: csvreader=csv.DictReader(f) for row in csvreader: if row['Positiv']: print(re.sub(r'[^a-zA-Z\s]', '', row['Entry'])) Note that the example uses two packages in the Python standard library (meaning that you need not install additional packages; and of course you could use the pandas package, if you prefer). It uses csv package’s DictReader function, which understands that the first row is the headers (that define column names) and coverts each row of data as a dictionary (with the column name as the key). The sub function in the regular expression (re) package replaces anything that is not in the alphabet or a space with an empty string. Similarly, the following code processes the MD files, finds positive words, cleans each word (not necessary because MD word entries uses characters only in the alphabet), and converts the word to upper case (also unnecessary because MD entries are in uppercase; included here to show how it is done because you will need case conversion for words in the text): 4 import re, csv print("Positive words in MD") with open('md2020.csv' , 'r') as f: csvreader=csv.DictReader(f) for row in csvreader: if int(row['Positive'])>0: print(re.sub(r'[^a-zA-Z\s]', '', row['Word']).upper()) Note that above examples only find the words. You need to decide how to represent the words to support lookup. I recommend Python dictionary (dict); you may also use list or set. 3.2 Processing Text The text can be stored in a text file and read in by your program. If this becomes a challenge, you may hard code it (i.e., storing it in a variable in your code). As noted earlier, in addition to space and the alphabet characters, there are often other characters in the text. For example, the following text includes quotation marks and a question mark: It is a “good” practice? You need to remove these punctuation marks for sentiment lookup. The same regular expression shown in the example code earlier can help you accomplish this task: >>> re.sub(r'[^a-zA-Z\s]', '', 'It is a "good" practice?').upper() ' IT IS A GOOD PRACTICE' The same regular expression also works on the single token. You can use the split method of string object to convert the text into a list, with each token as a list element. This can be useful when you need to preserve the order of the tokens. 3.3 Using HTML to Highlight Positive and Negative Words HTML is used for webpages. It tells the browser how to render texts and other contents. In this assignment, we will use in-line style
Answered 4 days AfterOct 05, 2021

Answer To: 1 MIST.3050 Programming Assignment: Basis Text Analysis This assignment will give you the...

Karthi answered on Oct 08 2021
117 Votes
1. Initially read the dataset given, important point is we are not all the columns, we are only reading specific columns which we need to compare.
2. Reading only two columns Positive and Negative columns from both the datasets.
3. Appending all the words into a list both negative and positive...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here