4/9/22, 3:33 AM Assignment 8: Text Mining https://wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4 Assignment 8: Text Mining Due Tuesday by 11:59pm Points 80 Submitting a file upload Start...

1 answer below »

4/9/22, 3:33 AM Assignment 8: Text Mining
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4
Assignment 8: Text Mining
Due Tuesday by 11:59pm Points 80 Submitting a file upload
Start Assignment
Assignment 8.pdf (https:
wssu.instructure.com/courses/19153/files/2959899/download?
download_frd=1)
Departments.ARFF (https:
wssu.instructure.com/courses/19153/files/2735771/download?
download_frd=1)
Experiments with Text Mining
You can do some experiments with text mining in Weka, by creating a text file where each document is
epresented on a separate line in a plain text format. In the ARFF file, each document’s content is
enclosed in quotation marks (“) and the document has a name at the beginning of each line and a file
header at the beginning of the file, e.g.:
@relation departments_string
@attribute document_name string
@attribute document_content string
@data
Anthropology, " Anthropology consists of four ... “
The attached file Departments.ARFF contains data about academic departments in a string format. The
data is scrapped from a collection of web pages describing the departments in the CCSU school of Arts
and Sciences. The data from each department webpage is named after the department and classified to
one of 2 groups: “A” – science disciplines and “B” – humanities.
Task 1. Create 3 versions of this data set: term-count, binary, and TFIDF versions.
1.1 To create a term-count version of the dataset:
a. Load the original file in Weka; in the Preprocess tab choose the StringToNominal filter and apply
it to the first attribute, document_name.
. Then choose the StringToWordVector filter (use the More button to see the meaning of its
parameters) and apply it to document_content with parameters:
attributeIndices=2 and outputWordCounts=TRUE.
Change “attributeNamePrefix” to “t-“ so that all the new features/attributes (representing terms
in the documents) start with it. Change “wordsToKeep” to “10000” to keep them all.
Change the setting of “tokenizer” (select Alphabetic Tokenizer, if you want only alphabetic
tokens) and “stopwordsHandler” to see how the results change.
https:
wssu.instructure.com/courses/19153/files/2959899?wrap=1
https:
wssu.instructure.com/courses/19153/files/2959899/download?download_frd=1
https:
wssu.instructure.com/courses/19153/files/2735771?wrap=1
https:
wssu.instructure.com/courses/19153/files/2735771/download?download_frd=1
4/9/22, 3:33 AM Assignment 8: Text Mining
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4
c. Now you have a document–term matrix loaded in Weka. Use the “Edit” option to see it in a
tabular format, where you can also change its content or copy it to other applications (e.g., MS
Excel). Once created in Weka, the table can be stored in an ARFF file through the “Save” option.
Weka shows some interesting statistics about the terms. In the visualization area (preprocess
mode), change the class to document_name. Then if you click “Visualize All” you will see the
distribution of each term over documents as bar diagrams. You are encouraged to examine
the diagrams (the color indicates the document) and find which terms are very specific fo
some documents. For example, compare the diagrams of anthropology and chair and try to
explain the difference. Which one is more representative, and for which document?
d. Move the document_class to the last position:
In the Preprocess tab, click Edit.
Right-click the “document_class” attribute and select “Attribute as a class”.
e. Save the dataset as Departments-termcount.ARFF.
1.2 To obtain the Binary representation, apply the NumericToBinary filter to the termcount
epresentation (the term-count version). Save it as Departments-binary.ARFF. What changed in the
diagrams?
1.3 Now create a TFIDF representation of the Departments dataset. For the TFIDF representation, use
the original string representation and apply the StringToWordVector filter with IDFTransform=true.
a. Choose the StringToNominal filter and apply it to the first attribute, document_name.
. Apply the StringToWordVector filter to document_content with parameter settings:
attributeIndices=2; IDFTransform = TRUE; attributeNamePrefix”=t-; outputWordCounts=TRUE,
tokenizer=Alphabetic Tokenizer, and stopWordsHandler=MultiStopwords.
Note: IDFTransform -- Sets whether if the word frequencies in a document should be
transformed into:
XXXXXXXXXXfij*log(num of Docs/num of Docs with word i)
XXXXXXXXXXwhere fij is the frequency of word i in document (instance) j.
c. Move the document_class to the last position:
Click Edit.
Right-click the “document_class” attribute and select “Attribute as a class”.
d. Save thus transformed dataset as Departments-tfidf.ARFF file for further use.
e. Examine the document–term table and the diagrams. Explain why some columns (e.g., chair)
are all zero.

4/9/22, 3:33 AM Assignment 8: Text Mining
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4
Task 2. The nearest-neighbor algorithm is a straightforward application of similarity (or distance) fo
the purposes of classification. When using the K-NN classification algorithm on text data, the closeness
is measured by minimal distance or maximal similarity between the documents’ term vectors. The most
common approach is to use the TFIDF framework to represent both the test and training documents and
to compute the cosine similarity between the document vectors.
2.1 Use the K-NN classifier (IBk) with the binary, term-count, and TFIDF versions of departments data
created in Task 1. Vary the parameter k (e.g., set it to 2, 5, 7, 9) and the weighting scheme. The
distance-weighting approach allows the algorithm to use not just a single instance, but more or even all
instances and with the weighting scheme it makes sense to use a larger k. Examine how each of these
parameters affects the classification accuracy measured with 10-fold cross validation. Present the
esults in a table. Find the data set and the parameter setting that produces maximal accuracy.
Task 3. Select attributes that produce the highest accuracy: Use the dataset that produced maximal
accuracy in the previous experiment and apply Information Gain (InfoGainAttributeEval) and Instance-
ased (ReliefFAttributeEval) attribute evaluation. This can be done through the Weka’s panel “Select
Attributes” (to examine the attribute ranking) or in the Preprocess panel through filters (to actually
eorder attributes according to their rank).
Following the steps below, apply each of the two filters and then run the IBk algorithm with increasing
number of attributes chosen from the beginning of the ranked attribute list, e.g., 1, 5, 10, 50, 100, 200,
300, ...
3.1 For each filter:
a. Apply the filter to the data set.
If you use the Weka’s panel “Select Attributes” to set the InfoGainAttributeEval and
ReliefFAttributeEval filters for examining the attribute ranking: Choose as Search Method:
Ranker. Set numToSelect to the number of attributes to retain. The default value (-1)
indicates that all attributes are to be retained. Use either this option or a threshold to reduce
the attribute set.
. Choose the top ranked attributes (copy them from the output window).
c. Go to the Preprocess panel and discard the rest of the attributes.
Choose Filter Remove and paste the selected attributes in the field for the paramete
“att
uteIndexes” along with the index of the class; then select invertSelection “True” since
we want to remove the remaining attributes.
d. Now run the IBk algorithm measured with 10-fold cross validation.
e. Repeat the previous steps with another number of attributes to be selected: first Undo (the
emoved attributes), then change the numToSelect parameter’value in the Ranker paramete
settings, remove the top ranked and apply IBk. For example, run with 1, 5, 10, 50, 100, 200, 300,
4/9/22, 3:33 AM Assignment 8: Text Mining
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/4
… attributes. Plot the accuracies for each run in a graph, to create a graph representing the
number of selected attributes vs. accuracy.
3.2 Compare the graphs produced with the two attribute selection methods (InfoGainAttributeEval
and ReliefFAttributeEval) and analyze the results.
Determine the optimal number of attributes for classification for each attribute selection method.
Which attribute selection method works better? Comment on this.
Task 4. Use the binary, term-count and TFIDF datasets and run on each one the Naïve Bayes algorithm
and the multinominal Naïve Bayes algorithm (NaiveBayesMultinomial). Compare the accuracies
(produced with 10-fold cross validation) and comment on the results.
II Write a report on the experiments described in Task 2, Task 3, and Task 4 above including all
comparisons, analyses, and the answers to all questions. Use tables whenever possible and include
only experimental data which are relevant to the analysis.

Week Topic: What is this course about
Assignment 8

Experiments with Text Mining
You can do some experiments with text mining in Weka, by creating a text file where each document is
epresented on a separate line in a plain text format. In the ARFF file, each document’s content is
enclosed in quotation marks (“) and the document has a name at the beginning of each line and a file
header at the beginning of the file, e.g.:
@relation departments_string
@attribute document_name string
@attribute document_content string
@data
Anthropology, " Anthropology consists of four ... “
The attached file Departments.ARFF contains data about academic departments in a string format. The
data is scrapped from a collection of web pages describing the departments in the CCSU school of Arts
and Sciences. The data from each department webpage is named after the department and classified to
one of 2 groups: “A” – science disciplines and “B” – humanities.
Task 1. Create 3 versions of this data set: term-count, binary, and TFIDF versions.
1.1 To create a term-count version of the dataset:
a. Load the original file in Weka; in the Preprocess tab choose the StringToNominal filter and
apply it to the first attribute, document_name.
. Then choose the StringToWordVector filter (use the More button to see the meaning of its
parameters) and apply it to document_content with parameters:
• attributeIndices=2 and outputWordCounts=TRUE.
• Change “attributeNamePrefix” to “t-“ so that all the new features/attributes (representing
terms in the documents) start with it. Change “wordsToKeep” to “10000” to keep them
all.
• Change the setting of “tokenizer” (select Alphabetic Tokenizer, if you want only alphabetic
tokens) and “stopwordsHandler” to see how the results change.
c. Now you have a document–term matrix loaded in Weka. Use the “Edit” option to see it in a
tabular format, where you can also change its content or copy it to other applications (e.g.,
MS Excel). Once created in Weka, the table can be stored in an ARFF file through the “Save”
option.
• Weka shows some interesting statistics about the terms. In the visualization area
(preprocess mode), change the class to document_name. Then if you click “Visualize All”
you will see the distribution of each term over documents as bar diagrams. You are
encouraged to examine the diagrams (the color indicates the document) and find which
terms are very specific for some documents. For example, compare the diagrams of
anthropology
Answered 3 days AfterApr 09, 2022

Solution

Mohd answered on Apr 12 2022
11 Votes
Analysis Report:
Task 1 , 2 & 4
As per the instructions we have created three datasets named as TermCount, Binary and TFDIF. We have trained and evaluated classification IBK models with varied parameter(KNN, Distance weighting). All datasets have maximum accuracy at KNN=5.
Departments-Binary dataset has maximum classification accuracy of 80 percent at KNN equal to 5 and no distance weighting. With distance weighting the accuracy for binary datasets has increased by 5 percent. Hence maximum classification accuracy is with department binary dataset, KNN=5 and distance weight by 1-distance or distance weight by 1/distance.
     
     
    TermCount
    Binary
    TFDIF
    Classifiers
    Test Option
    KNN
     Accuracy
    KNN
    Accuracy
    KNN
    Accuracy
    weka.classifiers.IBk, no distance weighting
    10-fold cross validation
    5
    70.0%
    5
    80%
    5
    65%
    Distance weighted (weka.classifiers.IBk, weight by...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here