'The' Final Project Handout Did it Make the News? Overview Millions of people visit news websites daily. In some cases, a knowledge worker might need the ability to focus a search on more specific...

The handout describes the overall project and the requirements and characteristics. However, I only need the things done in the timing demo (the parsing of the words into the AVL tree, etc...)
I have also attached a sample data set file. As explained in the final project handout, the program must be able to extract each word and parse it into the AVL tree among other things described in the timing demo pdf.


'The' Final Project Handout Did it Make the News? Overview Millions of people visit news websites daily. In some cases, a knowledge worker might need the ability to focus a search on more specific topics or industries. One very relevant and recent example related to Covid-19 is a dataset named Cord-19. A virologist or biochemist might be interested in targeted searches in the ~500,000 scientific articles it contains. Similarly, someone who works in the financial markets might be interested in a search engine that operates on financial new source data. And that is the type of data you’re going to use in this project. For your final project in CS 2341, you’re going to build a search engine for a large collection of financial news articles from Jan - May 2018. The dataset contains more than 300,000 articles. You can download the dataset from Kaggle at https://www.kaggle.com/jeet2016/us-financial-news-articles. You will need to make a Kaggle account to download it. Note that the download is around 1.3 GB and the uncompressed dataset is around 2.5 GB. Search Engine Architecture Search engines are designed to allow users to quickly locate the information they want. Building a custom search engine requires input of the documents that the user will eventually want to search. This is called the corpus. Then, once indexed, users can begin entering search queries. The search engine will take a query, find the documents that satisfy the request, and order them by some manner of relevancy. As you can see, there are two fundamental “roles” here: 1) the search engine maintainer, and 2) the search engine user. The four major components of a typical search engine are:1 1. Document parser/processor, 2. Query processor, 3. Search processor, and 4. Ranking processor. Figure 1 provides a general overview of a search engine system architecture. The fundamental “document” for this project is one news article with its associated metadata such as publication venue, date, author, associated entities and the full text of the article. The files containing the news articles are in JSON format. JSON is a “lightweight data interchange format” (https://www.json.org/json-en.html) that is easily understood by both humans and machines. There are a number of open source JSON parsing libraries available. The “officially supported parser” for this project is RapidJSON. 1http://www.infotoday.com/searcher/may01/liddy.htm https://www.kaggle.com/jeet2016/us-financial-news-articles https://www.json.org/json-en.html http://www.infotoday.com/searcher/may01/liddy.htm Figure 1 – Sample Search Engine System Architecture An explanation of the Parts of a Search Engine The index handler, the workhorse of the search engine, is responsible for: ● Read from and write to the main word index. You'll be creating an inverted file index which stores references from each element to be indexed to the corresponding document(s) in which those elements exist. ● Create and maintain an index of ORGANIZATION entities and an index of PERSON entities. ● Searching the inverted file index based on a request from the query processor. ● Storing other data with each indexed item (such as word frequency or entity frequency). The document parser/processor is responsible for the following tasks: ● Processing each news article in the corpus. The dataset contains one news article per file. Each document is in JSON format. Processing of an article involves the following steps: ○ Removing stopwords from the articles. Stopwords are common words that appear in text but that provide little useful information with respect to the value of a document relative to a query because of the commonality of the words. Example stop words include “a”, “the”, and “if”. One possible list of stop words to use for this project can be found at http://www.webconfs.com/stop-words.php. You may use other stop word lists you find online. ○ Stemming words. Stemming refers to removing certain grammatical modifications to words. For2 instance, the stemmed version of “running” may be “run”. For this project, you may make use of any previously implemented stemming algorithm that you can find online. ■ One such algorithm is the Porter Stemming algorithm. More information as well as implementations can be found at http://tartarus.org/~martin/PorterStemmer/. ■ Another option is http://www.oleandersolutions.com/stemming/stemming.html. ■ C ++ implementation of Porter 2: https://bitbucket.org/smassung/porter2_stemmer/src. ● Computing/maintaining information for relevancy ranking. You’ll have to design and implement some algorithm to determine how to rank the results that will be returned from the execution of a query. You 2See https://en.wikipedia.org/wiki/Stemming for more information. http://www.webconfs.com/stop-words.php http://tartarus.org/~martin/PorterStemmer/ http://www.oleandersolutions.com/stemming/stemming.html https://bitbucket.org/smassung/porter2_stemmer/src https://en.wikipedia.org/wiki/Stemming can make use of metadata provided, important words in the articles (look up term-frequency/inverse document frequency metric), and/or a combination of several metrics. The query processor is responsible for: ● Parsing queries entered by the user of the search engine. For this project, you'll implement functionality to handle simple prefix Boolean queries entered by the user. ○ The Boolean expression will be prefixed with a Boolean operator of either AND or OR if there is more than one word of interest. ○ No query will contain both AND and OR. ○ Single word queries (not counting NOT or additional operators below) do not need a boolean operator. ○ Trailing search terms may be preceded with the NOT operator, which indicates articles containing that term should be removed from the resultset. ○ Additional Operators: A query can contain zero or more of the following: ■ ORG - the org operator will search a special index you maintain related to organizations mentioned in the entity metadata ■ PERSON - the person operator will search a special index you maintain related to persons mentioned in the article’s entity metadata. ■ Additional Operator Notes: ● the order of ORG or PERSON doesn’t matter (meaning, you should accept queries that have them in either order) ● the operators will always be entered in all caps. ● you may assume that neither ORG nor PERSON will be search terms themselves. ● Here are some examples: ○ markets ■ This query should return all articles that contain the word markets. ○ AND social network ■ This query should return all articles that contain the words “social” and “network” (doesn’t have to be as a 2-word phrase) ○ AND social network PERSON cramer ■ This query should return all articles that contain the words social and network and that mention cramer as a person entity. ○ AND social network ORG facebook PERSON cramer ■ This query should return all articles that contain the words social and network, that have an entity organization of facebook and that mention cramer as a person entity. ○ OR snap facebook ■ This query should return all articles that contain either snap OR facebook ○ OR facebook meta NOT profits ■ This query should return all articles that contain facebook or meta but that do not contain the word profits. ○ bankruptcy NOT facebook ■ This query should return all articles that contain bankruptcy, but not facebook. ○ OR facebook instagram NOT bankruptcy ORG snap PERSON cramer ■ This query should return any article that contains the word facebook OR instagram but that does NOT contain the word bankruptcy, and the article should have an organization entity with Snap and a person entity of cramer ● Ranking the Results. Relevancy ranking refers to organizing the results of a query so that “more relevant” documents are higher in the result set than less relevant documents. The difficulty here is determining what the concept of “more relevant” means. One way of calculating relevancy is by using a basic term frequency – inverse document frequency (tf/idf) statistic . tf/idf is used to determine how3 important a particular word is to a document from the corpus. If a word appears frequently in document dt but infrequently in other documents, then document dt would be ranked higher than another document ds in which a query term appears frequently, but it also appears frequently in other documents as well. There is quite a bit of other information that you can use to do relevancy ranking as well such as date of publication of the article, etc. The Index The inverted file index is a data structure that relates each unique word from the corpus to the document(s) in4 which it appears. It allows for efficient execution of a query to quickly determine in which documents a particular query term appears. For instance, let's assume we have the following documents with ascribed contents: • doc d1 = Computer network security • doc d2 = network cryptography • doc d3 = database security The inverted file index for these documents would contain, at a very minimum, the following: • computer = d1 • network = d1, d2 • security = d1, d3 • cryptography = d2 • database = d3 The query “AND computer security” would find the intersection of the documents that contained computer and the documents that contained security. • set of documents containing computer = d1 • set of documents containing security = d1, d3 • the intersection of the set of documents containing computer AND security = d1 Inverted File Index Implementation Details The heart of this project is the inverted file index. ● To index the text of the articles, you will create an inverted index with an AVL tree. Each node of the tree would represent a word being indexed and would provide information about all the articles that contain said word. ● To index organizations and persons, you will use separate instances of an AVL tree. In other words, you’ll have 3 AVL Trees: one for the main index of words, one for organizations, and lastly, one for persons. 4See http://en.wikipedia.org/wiki/Inverted_index for more information. 3http://en.wikipedia.org/wiki/Tf-idf or http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html for more information http://en.wikipedia.org/wiki/Inverted_index http://en.wikipedia.org/wiki/Tf-idf http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html http://en.wikipedia.org/wiki/Tf-idf User Interface The user interface of the application should provide the following options: ● allows the user to clear the index completely ● allows the user to manage the persistent index (see Index Persistence above for more info) ● allows the user to parse a document dataset to populate the index OR read from the persistence file ● allow the user to enter a Boolean query (as described above). ○ You may assume the query is properly formatted. ○ The results should display the article’s identifying/important information including Article Title, publication, and date published. If the result set contains more than 15 results, display the 15 with the highest relevancy. If less than 15 are returned, display all of them ordered by relevance. If you’d like to show more, please paginate. ○ The user should be allowed to choose one of the articles from the result set above and have the complete text of the article printed. ○ Helpful Hint: that the query terms should have stop words removed and stemmed before querying the index. ● Output
Apr 16, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here