6/30/2021 COMP2521 21T2 - Assignment 1 https://cgi.cse.unsw.edu.au/~cs2521/21T2/assignments/ass1 1/7 Assignment 1 Text Analytics Click here to access the time trials leaderboard! Changelog All changes...

Your main objective is to complete the programtw(short forTopWords) that finds and displays the most frequent words (more accurately, word stems) in a given English Project Gutenberg ebook.


6/30/2021 COMP2521 21T2 - Assignment 1 https://cgi.cse.unsw.edu.au/~cs2521/21T2/assignments/ass1 1/7 Assignment 1 Text Analytics Click here to access the time trials leaderboard! Changelog All changes to the assignment specification and files will be listed here. [24/06 21:20] Added clarification regarding the Nwords command-line argument. [27/06 20:10] Added more clarifications from the forum, removed the comment next to the word field in WFreq.h. Aims To give you practice with processing text data To give you practice implementing binary search trees To appreciate the importance of using efficient data structures and algorithms Admin Marks contributes 15% towards your final mark (see Assessment section for more details) Submit see the Submission section Deadline submit by 20:00 on Friday 16th July Late penalty 1% off the maximum mark for each hour late. For example if an assignment worth 80% was submitted 15 hours late, the late penalty would have no effect. If the same assignment was submitted 24 hours late it would be awarded 76%, the maximum mark it can achieve at that time. Background The field of data analytics is currently hot. Text analytics is an important subfield of data analytics. Data extracted from text documents is used in applications such as web retrieval, sentiment analysis, authorship determination, etc. In this assignment, we aim to write a program which can extract one important text analytics "measure": the frequency of occurrence of the most common words in the text. The text documents we will use are drawn from Project Gutenberg, a long-running project aimed at digitizing out-of-copyright books in simple text format and making them available for free, for anyone to use. The books tend to be classics (such as "Moby Dick"), but are important works which Project Gutenberg aims to preserve in a simple, resilient format (ASCII text). Project Gutenberg books contain the full text of the book, but this is surrounded by meta-data and legal requirements, and it is a condition of use that these be left intact. Fortunately, the actual text of the book can be easily delineated from the other text by the following markers. ... meta-data, such as when uploaded, who provided the text, ... *** START OF THIS PROJECT GUTENBERG EBOOK title of book *** ... actual text of book ... *** END OF THIS PROJECT GUTENBERG EBOOK title of book *** ... tons of text giving licensing/legal details ... Preprocessing https://cgi.cse.unsw.edu.au/~cs2521/21T2/ass/ass1/time-trials/index.cgi/ https://www.gutenberg.org/ 6/30/2021 COMP2521 21T2 - Assignment 1 https://cgi.cse.unsw.edu.au/~cs2521/21T2/assignments/ass1 2/7 Text analysis is not as simple as carving a document into word tokens, and then using those tokens. Some additional processing is needed on the tokens before they are used in determining analytics. Three forms of processing are typically applied: tokenising/normalising English text consists of words and punctuation. We are interested primarily in the words, so we need to extract individual words from a document. We define a word as any sequence of characters that includes only alphabetics (upper and lower case), numbers, single- quote and hyphen. Once we have extracted a token, we "normalise" it by reducing to all lower-case. This simple approach to word extraction occasionally leads to strange "words" like "'''" or "--" or "-'-". Since these kind of words occur infrequently, we allow them, and don't apply any further restrictions such as requiring at least one alphabetic character. However, we do ignore any "words" containing just a single character. stopword removal Some words are very common and make little contribution to distinguishing documents or defining the semantics of a given document, e.g., "an", "the", "you", "your", "since", etc. Such words are called "stopwords" and are typically skipped (ignored) in text analysis. We have supplied a stopword list for use in this task. stemming Words occur in different forms, e.g., "love", "loves", "lovely", "dog", "dogs", "doggy". We do not wish to distinguish such variations, and so text analysis typically reduces words to their stem. For example, "dogs" reduces to "dog", and the forms of "love" might all reduce to "lov". We have supplied a stemming module for use in this task. The supplied stemmer is an implementation of the classic Porter stemming algorithm. It sometimes produces "unexpected" results, e.g., reducing "prince" and "princes" to "princ". This is ok; don't question the stemmer - take what the stemmer produces as The Answer. Example The following example shows how a small piece of text would be reduced to a sequence of word stems for use in later analytic tasks. Consider the following piece of text: "You may not have lived much under the sea-" ("I haven't," said Alice)-"and perhaps you were never even introduced to a lobster-" (Alice began to say "I once tasted-" but checked herself hastily, and said "No, never") "-so you can have no idea what a delightful thing a Lobster Quadrille is!" After tokenisation: You may not have lived much under the sea- I haven't said Alice - and perhaps you were never even introduced to a lobster- Alice began to say I once tasted- but checked herself hastily and said No never -so you can have no idea what a delightful thing a Lobster Quadrille is After normalisation: you may not have lived much under the sea- i haven't said alice - and perhaps you were never even introduced to a lobster- alice began to say i once tasted- but checked herself hastily and said no never -so you can have no idea what a delightful thing a lobster quadrille is After discarding one-character tokens and stopwords: lived sea- alice introduced lobster- alice began tasted- checked hastily -so idea delightful lobster quadrille After stemming the remaining tokens: live sea- alic introduc lobster- alic began tasted- check hastili -so idea delight lobster quadril Notice that this process may produce some strange words such as "sea-", "alic", "introduc" and "hastili", but this is fine - again, don't question the stemmer. Setting Up Change into the directory you created for the assignment and run the following command: $ unzip /web/cs2521/21T2/ass/ass1/downloads/code.zip If you're working at home, download code.zip by clicking on the above link and then run the unzip command on the downloaded file. If you've done the above correctly, you should now have the following files: https://cgi.cse.unsw.edu.au/~cs2521/21T2/ass/ass1/downloads/code.zip 6/30/2021 COMP2521 21T2 - Assignment 1 https://cgi.cse.unsw.edu.au/~cs2521/21T2/assignments/ass1 3/7 If you ve done the above correctly, you should now have the following files: Makefile a set of dependencies used to control compilation stopwords a list of English stopwords, one per line Dict.h interface to the Dict ADT Dict.c incomplete implementation of the Dict ADT WFreq.h type definition for (word, freq) pairs stemmer.h interface to the stemming module stemmer.c the stemming module, implemented using the Porter algorithm tw.c a main program to compute word frequencies (incomplete) linenos.c a sample program that demonstrates how to read a file with fgets Data Files The data files for this assignment are a small selection of English Project Gutenberg ebooks. If you're working at home, you'll need to download them to your local machine and place them in a directory called data in the assignment directory. The data files are available in the zip file /web/cs2521/21T2/ass/ass1/downloads/data.zip If you're working on the CSE servers, you probably shouldn't copy the data files to your home directory, as they will consume a large amount of disk space. Instead, run the following command while in your assignment directory: $ ln -s /web/cs2521/21T2/ass/ass1/data/ ./data This will produce a symlink (i.e., shortcut) to the data directory under the class account, and you can access it like you would a normal directory. As supplied, the code will compile but does nothing, except check command-line arguments. You can check this by running commands like: $ ./tw data/0011.txt $ ./tw 20 data/0011.txt The above assumes that the data files are accessible in a directory or symlink called data in the current directory, which you should have if you followed the above instructions. File Reading You will be required to read files in this assignment. However, the only file-related functions you will need to use are fopen, fgets and fclose. To help you get familiar with these functions, we have provided a sample program called linenos.c, which contains explanations in comments. The provided Makefile does not compile this program, so you'll need to compile it on your own (or add instructions to compile it to the Makefile). Task Your main objective is to complete the program tw (short for Top Words) that finds and displays the most frequent words (more accurately, word stems) in a given English Project Gutenberg ebook. tw.c tw.c is the entry point of the tw program. It: takes one or two command line arguments (./tw [Nwords] File) the first optional argument gives the number of words to be output the second argument gives the name of a text file If the Nwords argument is not given, the default value of 10 is used. If the given Nwords argument is less than 10, it is set to 10. reads text from the file, and computes word (stem) frequencies prints a list of the top Nwords most frequent words (word stems), from most frequent to least frequent, where words with the same frequency are in increasing lexicographic order Here are some more details on how the program should behave: The program should only process the actual text of the book, not the surrounding meta-data and licensing/legal details. See the Background section for details. https://cgi.cse.unsw.edu.au/~cs2521/21T2/ass/ass1/downloads/data.zip https://linux.die.net/man/3/fopen https://linux.die.net/man/3/fgets https://linux.die.net/man/3/fclose 6/30/2021 COMP2521 21T2 - Assignment 1 https://cgi.cse.unsw.edu.au/~cs2521/21T2/assignments/ass1 4/7 The program should process words exactly as described in the Background section. To summarise, it should: (1) tokenise, (2) normalise, (3) discard single-character tokens, (4) discard stopwords, (5) stem. See the Background section for details and an example. The program must output one (word, frequency) pair per line. Each line must contain a frequency (a number) followed by the associated word, separated by a single space. There should be no leading or trailing spaces. For example, this is the expected output for 0011.txt: $ ./tw 10 data/0011.txt 386 alic 76 thought 74 time 69 queen 62 king 58 began 58 head 58 turtl 58 well 57 mock The program may assume that the maximum length of a line in any data file is 1000 characters, including the newline character. It may also assume that the maximum length of a word is 100 characters. tw.c contains #defines for these limits (MAXLINE and MAXWORD). The program may assume that the data file contains at most one line
Jul 08, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here