4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4 Assignment 8: Text Mining Due Tuesday by 11:59pm Points 80 Submitting a file upload Start...

1 answer below »
Read carefully Chapter 1 .Do Assignment 8 Text Mining Download Assignment 8 Text Mining(maximal grade: 80 points).


4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/325909 1/4 Assignment 8: Text Mining Due Tuesday by 11:59pm Points 80 Submitting a file upload Start Assignment Assignment 8.pdf (https://wssu.instructure.com/courses/19153/files/2959899/download? download_frd=1) Departments.ARFF (https://wssu.instructure.com/courses/19153/files/2735771/download? download_frd=1) Experiments with Text Mining You can do some experiments with text mining in Weka, by creating a text file where each document is represented on a separate line in a plain text format. In the ARFF file, each document’s content is enclosed in quotation marks (“) and the document has a name at the beginning of each line and a file header at the beginning of the file, e.g.: @relation departments_string @attribute document_name string @attribute document_content string @data Anthropology, " Anthropology consists of four ... “ The attached file Departments.ARFF contains data about academic departments in a string format. The data is scrapped from a collection of web pages describing the departments in the CCSU school of Arts and Sciences. The data from each department webpage is named after the department and classified to one of 2 groups: “A” – science disciplines and “B” – humanities. Task 1. Create 3 versions of this data set: term-count, binary, and TFIDF versions. 1.1 To create a term-count version of the dataset: a. Load the original file in Weka; in the Preprocess tab choose the StringToNominal filter and apply it to the first attribute, document_name. b. Then choose the StringToWordVector filter (use the More button to see the meaning of its parameters) and apply it to document_content with parameters: attributeIndices=2 and outputWordCounts=TRUE. Change “attributeNamePrefix” to “t-“ so that all the new features/attributes (representing terms in the documents) start with it. Change “wordsToKeep” to “10000” to keep them all. Change the setting of “tokenizer” (select Alphabetic Tokenizer, if you want only alphabetic tokens) and “stopwordsHandler” to see how the results change. https://wssu.instructure.com/courses/19153/files/2959899?wrap=1 https://wssu.instructure.com/courses/19153/files/2959899/download?download_frd=1 https://wssu.instructure.com/courses/19153/files/2735771?wrap=1 https://wssu.instructure.com/courses/19153/files/2735771/download?download_frd=1 4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/325909 2/4 c. Now you have a document–term matrix loaded in Weka. Use the “Edit” option to see it in a tabular format, where you can also change its content or copy it to other applications (e.g., MS Excel). Once created in Weka, the table can be stored in an ARFF file through the “Save” option. Weka shows some interesting statistics about the terms. In the visualization area (preprocess mode), change the class to document_name. Then if you click “Visualize All” you will see the distribution of each term over documents as bar diagrams. You are encouraged to examine the diagrams (the color indicates the document) and find which terms are very specific for some documents. For example, compare the diagrams of anthropology and chair and try to explain the difference. Which one is more representative, and for which document? d. Move the document_class to the last position: In the Preprocess tab, click Edit. Right-click the “document_class” attribute and select “Attribute as a class”. e. Save the dataset as Departments-termcount.ARFF. 1.2 To obtain the Binary representation, apply the NumericToBinary filter to the termcount representation (the term-count version). Save it as Departments-binary.ARFF. What changed in the diagrams? 1.3 Now create a TFIDF representation of the Departments dataset. For the TFIDF representation, use the original string representation and apply the StringToWordVector filter with IDFTransform=true. a. Choose the StringToNominal filter and apply it to the first attribute, document_name. b. Apply the StringToWordVector filter to document_content with parameter settings: attributeIndices=2; IDFTransform = TRUE; attributeNamePrefix”=t-; outputWordCounts=TRUE, tokenizer=Alphabetic Tokenizer, and stopWordsHandler=MultiStopwords. Note: IDFTransform -- Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document (instance) j. c. Move the document_class to the last position: Click Edit. Right-click the “document_class” attribute and select “Attribute as a class”. d. Save thus transformed dataset as Departments-tfidf.ARFF file for further use. e. Examine the document–term table and the diagrams. Explain why some columns (e.g., chair) are all zero. 4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/325909 3/4 Task 2. The nearest-neighbor algorithm is a straightforward application of similarity (or distance) for the purposes of classification. When using the K-NN classification algorithm on text data, the closeness is measured by minimal distance or maximal similarity between the documents’ term vectors. The most common approach is to use the TFIDF framework to represent both the test and training documents and to compute the cosine similarity between the document vectors. 2.1 Use the K-NN classifier (IBk) with the binary, term-count, and TFIDF versions of departments data created in Task 1. Vary the parameter k (e.g., set it to 2, 5, 7, 9) and the weighting scheme. The distance-weighting approach allows the algorithm to use not just a single instance, but more or even all instances and with the weighting scheme it makes sense to use a larger k. Examine how each of these parameters affects the classification accuracy measured with 10-fold cross validation. Present the results in a table. Find the data set and the parameter setting that produces maximal accuracy. Task 3. Select attributes that produce the highest accuracy: Use the dataset that produced maximal accuracy in the previous experiment and apply Information Gain (InfoGainAttributeEval) and Instance- based (ReliefFAttributeEval) attribute evaluation. This can be done through the Weka’s panel “Select Attributes” (to examine the attribute ranking) or in the Preprocess panel through filters (to actually reorder attributes according to their rank). Following the steps below, apply each of the two filters and then run the IBk algorithm with increasing number of attributes chosen from the beginning of the ranked attribute list, e.g., 1, 5, 10, 50, 100, 200, 300, ... 3.1 For each filter: a. Apply the filter to the data set. If you use the Weka’s panel “Select Attributes” to set the InfoGainAttributeEval and ReliefFAttributeEval filters for examining the attribute ranking: Choose as Search Method: Ranker. Set numToSelect to the number of attributes to retain. The default value (-1) indicates that all attributes are to be retained. Use either this option or a threshold to reduce the attribute set. b. Choose the top ranked attributes (copy them from the output window). c. Go to the Preprocess panel and discard the rest of the attributes. Choose Filter Remove and paste the selected attributes in the field for the parameter “attrbuteIndexes” along with the index of the class; then select invertSelection “True” since we want to remove the remaining attributes. d. Now run the IBk algorithm measured with 10-fold cross validation. e. Repeat the previous steps with another number of attributes to be selected: first Undo (the removed attributes), then change the numToSelect parameter’value in the Ranker parameter settings, remove the top ranked and apply IBk. For example, run with 1, 5, 10, 50, 100, 200, 300, 4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/325909 4/4 … attributes. Plot the accuracies for each run in a graph, to create a graph representing the number of selected attributes vs. accuracy. 3.2 Compare the graphs produced with the two attribute selection methods (InfoGainAttributeEval and ReliefFAttributeEval) and analyze the results. Determine the optimal number of attributes for classification for each attribute selection method. Which attribute selection method works better? Comment on this. Task 4. Use the binary, term-count and TFIDF datasets and run on each one the Naïve Bayes algorithm and the multinominal Naïve Bayes algorithm (NaiveBayesMultinomial). Compare the accuracies (produced with 10-fold cross validation) and comment on the results. II Write a report on the experiments described in Task 2, Task 3, and Task 4 above including all comparisons, analyses, and the answers to all questions. Use tables whenever possible and include only experimental data which are relevant to the analysis. Week Topic: What is this course about Assignment 8 Experiments with Text Mining You can do some experiments with text mining in Weka, by creating a text file where each document is represented on a separate line in a plain text format. In the ARFF file, each document’s content is enclosed in quotation marks (“) and the document has a name at the beginning of each line and a file header at the beginning of the file, e.g.: @relation departments_string @attribute document_name string @attribute document_content string @data Anthropology, " Anthropology consists of four ... “ The attached file Departments.ARFF contains data about academic departments in a string format. The data is scrapped from a collection of web pages describing the departments in the CCSU school of Arts and Sciences. The data from each department webpage is named after the department and classified to one of 2 groups: “A” – science disciplines and “B” – humanities. Task 1. Create 3 versions of this data set: term-count, binary, and TFIDF versions. 1.1 To create a term-count version of the dataset: a. Load the original file in Weka; in the Preprocess tab choose the StringToNominal filter and apply it to the first attribute, document_name. b. Then choose the StringToWordVector filter (use the More button to see the meaning of its parameters) and apply it to document_content with parameters: • attributeIndices=2 and outputWordCounts=TRUE. • Change “attributeNamePrefix” to “t-“ so that all the new features/attributes (representing terms in the documents) start with it. Change “wordsToKeep” to “10000” to keep them all. • Change the setting of “tokenizer” (select Alphabetic Tokenizer, if you want only alphabetic tokens) and “stopwordsHandler” to see how the results change. c. Now you have a document–term matrix loaded in Weka. Use the “Edit” option to see it in a tabular format, where you can also change its content or copy it to other applications (e.g., MS Excel). Once created in Weka, the table can be stored in an ARFF file through the “Save” option. • Weka shows some interesting statistics about the terms. In the visualization area (preprocess mode), change the class to document_name. Then if you click “Visualize All” you will see the distribution of each term over documents as bar diagrams. You are encouraged to examine the diagrams (the color indicates the document) and find which terms are very specific for some documents. For example, compare the diagrams of anthropology
Answered 3 days AfterApr 09, 2022

Answer To: 4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/...

Mohd answered on Apr 12 2022
86 Votes
Analysis Report:
Task 1 , 2 & 4
As per the instructions we have created three datasets named as Te
rmCount, Binary and TFDIF. We have trained and evaluated classification IBK models with varied parameter(KNN, Distance weighting). All datasets have maximum accuracy at KNN=5.
Departments-Binary dataset has maximum classification accuracy of 80 percent at KNN equal to 5 and no distance weighting. With distance weighting the accuracy for binary datasets has increased by 5 percent. Hence maximum classification accuracy is with department binary dataset, KNN=5 and distance weight by 1-distance or distance weight by 1/distance.
     
     
    TermCount
    Binary
    TFDIF
    Classifiers
    Test Option
    KNN
     Accuracy
    KNN
    Accuracy
    KNN
    Accuracy
    weka.classifiers.IBk, no distance weighting
    10-fold cross validation
    5
    70.0%
    5
    80%
    5
    65%
    Distance weighted (weka.classifiers.IBk, weight by...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here