4/2/22, 3:03 AM Assignment 7: Clustering https://wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/2 Assignment 7: Clustering Due Tuesday by 11:59pm Points 80 Submitting a file upload Start...

1 answer below »

4/2/22, 3:03 AM Assignment 7: Clustering
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/2
Assignment 7: Clustering
Due Tuesday by 11:59pm Points 80 Submitting a file upload
Start Assignment
I Do the following experiments:
Use the original loan data set (without any discretizaton or nominalization of the numeric
attributes) represented as an ARFF file. Apply the following clustering algorithms with the specified
parameter settings. Use the guidelines provided in Week 13 Handout Clustering Experiments.
1. k-means (weka.clusterers.SimpleKMeans): Generate two clusters in Classes to clusters
evaluation mode. Compare the cluster centroids with the data models generated by the Weka
classifiers, that is, if we consider these centroids as instances how are they classified according to the
models produced by OneR or J48?
2. EM (weka.clusterers.EM): Generate two clusters in Classes to clusters evaluation mode (change
the default number of clusters to 2). Compare the clusters and their parameters with the class
descriptions produced by the Naive Bayes classifier. Are they similar?
3. Cobweb (weka.clusterers.Cobweb): Use training set cluster mode and ignore the class
attribute (loan approval). Describe only the top level split (the successors of the root) in the clustering
hierarchy (using the approach described in the Handout Clustering Experiments). Compute the classes
to clusters evaluation e
or for the top level clustering by hand (see again Handout Clustering
Experiments). Weka cannot be used for this purpose because it computes the evaluation for the leaf
clusters only.
Hints about parameter tuning:
1. For k-means, try several different seeds (used by the random number generator) to find the best
clustering.
2. In EM, use numClusters = 2 and try several seeds as well.
3. In Cobweb, try different cutoff's (category utility threshold) to generate the smallest possible clustering
hierarchy (with minimal number of clusters).

II Write a report on the experiments described above including all comparisons, analyses and the
answers to all questions. Use tables whenever possible and include only experimental data which are
elevant to the analysis.
4/2/22, 3:03 AM Assignment 7: Clustering
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/2

Chapter 4
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 4, Algorithms: Clustering
of Data Mining by I. H. Witten, E. Frank,
M. A. Hall and C. J. Pal
2
• Clustering techniques apply when there is no class to be
predicted: they perform unsupervised learning
• Aim: divide instances into “natural” groups
• As we have seen, clusters can be:
• disjoint vs. overlapping
• deterministic vs. probabilistic
• flat vs. hierarchical
• We will look at a classic clustering algorithm called k-means
• k-means clusters are disjoint, deterministic, and flat
Clustering
3
• Step 1: Choose k random cluster centers
• Step 2: Assign each instance to its closest cluster center based on
Euclidean distance
• Step 3: Recompute cluster centers by computing the average
(aka centroid) of the instances pertaining to each cluste
• Step 4: If cluster centers have moved, go back to Step 2
• This algorithm minimizes the squared Euclidean distance of the
instances from their co
esponding cluster centers
• Determines a solution that achieves a local minimum of the squared
Euclidean distance
• Equivalent termination criterion: stop when assignment of
instances to cluster centers has not changed
The k-means algorithm
4
The k-means algorithm: example
5
Discussion
• Algorithm minimizes squared distance to cluster centers
• Result can vary significantly
• based on initial choice of seeds
• Can get trapped in local minimum
• Example:
• To increase chance of finding global optimum: restart with
different random seeds
• Hierarchical k-means: apply recursively with k = 2 to the
esulting clusters
instances
initial cluster
centres
6
Faster distance calculations
• Can we use kD-trees or ball trees to speed up the process?
Yes, we can:
• First, build the tree data structure, which remains static, for all the
data points
• At each node, store the number of instances and the sum of all
instances (summary statistics)
• In each iteration of k-means, descend the tree and find out which
cluster each node belongs to
• Can stop descending as soon as we find out that a node belongs
entirely to a particular cluste
• Use summary statistics stored previously at the nodes to compute
new cluster centers
7
Example scenario (using a ball tree)
8
Choosing the number of clusters
• Big question in practice: what is the right number of clusters,
i.e., what is the right value for k?
• Cannot simply optimize squared distance on training data to
choose k
• Squared distance decreases monotonically with increasing values of k
• Need some measure that balances distance with complexity
of the model, e.g., based on the MDL principle (covered later)
• Finding the right-size model using MDL becomes easier when
applying a recursive version of k-means (bisecting k-means):
• Compute A: information required to store data centroid, and the
location of each instance with respect to this centroid
• Split data into two clusters using 2-means
• Compute B: information required to store the two new cluster
centroids, and the location of each instance with respect to these two
• If A > B, split the data and recurse (just like in other tree learners)
9
Hierarchical clustering
• Bisecting k-means performs hierarchical clustering in a top-down
manne
• Standard hierarchical clustering performs clustering in a bottom-
up manner; it performs agglomerative clustering:
• First, make each instance in the dataset into a trivial mini-cluste
• Then, find the two closest clusters and merge them; repeat
• Clustering stops when all clusters have been merged into a single cluste
• Outcome is determined by the distance function that is used:
• Single-linkage clustering: distance of two clusters is measured by finding
the two closest instances, one from each cluster, and taking their distance
• Complete-linkage clustering: use the two most distant instances instead
• Average-linkage clustering: take average distance between all instances
• Centroid-linkage clustering: take distance of cluster centroids
• Group-average clustering: take average distance in merged clusters
• Ward’s method: optimize k-means criterion (i.e., squared distance)
10
Example: complete linkage
11
Example: single linkage
12
Incremental clustering
•Heuristic approach (COBWEB/CLASSIT)
•Forms a hierarchy of clusters incrementally
•Start:
• tree consists of empty root node
•Then:
• add instances one by one
• update tree appropriately at each stage
• to update, find the right leaf for an instance
• may involve restructuring the tree using merging or splitting
of nodes
•Update decisions are based on a goodness measure
called category utility
Clustering the weather data I
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
1
2
3
13
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
4
3
Merge best host and
unner-up
5
Consider splitting the best host if
merging does not help
Clustering the weather data II
14
Final clustering
15
Example: clustering a subset of the iris data
16
Example: iris data with cutoff
17
The category utility measure
• Category utility: quadratic loss function
defined on conditional probabilities:
• Every instance in a different category 
numerator becomes
CU(C1,C2,...,Ck )=
P(Cl ) (P(ai = vij |Cl )
2 -P(ai = vij )
2 )
j
å
i
å
l
å
k
m-P(ai = vij )
2 maximum
number of attributes
18
Numeric attributes?
• Assume normal distribution:
• Then:
• Thus
ecomes
• Pre-specified minimum variance can be enforced to
combat overfitting (called acuity parameter)
f (a) =
1
2ps
e
-
(a-m )2
2s 2
P(ai
j
å = vij )
2Û f (ai )
2 dai =
1
2 ps i
ò
CU =
P(Cl ) (P(ai = vij |Cl )
2 -P(ai = vij )
2 )
j
å
i
å
l
å
k
k
C
CU l i iil
lå å ÷÷
ø
ö
çç
è
æ
-
=
ssp
11
2
1
]Pr[
19
20
Multi-instance learning
• Recap: multi-instance learning is concerned with examples
co
esponding to sets (really, bags or multi-sets) of instances
• All instances have the same attributes but the full set of instances is split
into subsets of related instances that are not necessarily independent
• These subsets are the examples for learning
• Example applications of multi-instance learning: image
classification, classification of molecules
• Simplicity-first methodology can be applied to multi-instance
learning with surprisingly good results
• Two simple approaches to multi-instance learning, both using
standard single-instance learners:
• Manipulate the input to learning
• Manipulate the output of learning
21
Aggregating the input
• Idea: convert multi-instance learning problem into a single-
instance one
• Summarize the instances in a bag by computing the mean, mode,
minimum and maximum, etc., of each attribute as new attributes
• “Summary” instance retains the class label of its bag
• To classify a new bag the same process is used
• Any single-instance learning method, e.g., decision tree
learning, can be applied to the resulting data
• This approach discards most of the information in a bag of
instances but works surprisingly well in practice
• Should be used as a base line when evaluating more
advanced approaches to multi-instance learning
• More sophisticated region-based summary statistics can be
applied to extend this basic approach
22
Aggregating the output
• Idea: learn a single-instance classifier directly from the
original instances in all the bags
• Each instance is given the class of the bag it originates from
• But: bags can contain differing numbers of instances  give each
instance a weight inversely proportional to the bag's size
• To classify a new bag:
• Produce a prediction for each instance in the bag
• Aggregate the predictions to produce a prediction for the bag as a
whole
• One approach: treat predictions as votes for the various class labels;
alternatively, average the class probability estimates
• This approach treats all instances as independent at training
time, which is not the case in true multi-instance applications
• Nevertheless, it often works very well in practice and should
e used as a base line

Clustering
• Clustering is an unsupervised learning method: there is no target value
(class label) to be predicted, the goal is finding common patterns o
grouping similar examples.
• Differences between models/algorithms for clustering:
– Conceptual
Answered 3 days AfterApr 02, 2022

Solution

Mohd answered on Apr 06 2022
18 Votes
k-means (weka.clusterers.SimpleKMeans): Generate two clusters in Classes to clusters
evaluation mode. Compare the cluster centroids with the data models generated by the Weka
classifiers, that is, if we consider these centroids as instances how are they classified according to the
models produced by OneR or J48?
A. We have already done k-means clustering on your data to get two clusters, then you could use J48 on the new data point to find out which class it belongs to. J48 has higher accuracy than kmeans clustering.
     
    Seed
    Number of clusters
    Inco
ectly Classified instances
    weka.clusterers.SimpleKMeans
    10
    2
    35%
    
    2
    2
    45%
    
    5
    2
    45%
    
    15
    2
    45%
    
    20
    2
    35%
2. EM (weka.clusterers.EM): Generate two clusters in Classes to clusters evaluation mode (change
the default number of clusters to 2). Compare the clusters and their parameters with the class
descriptions produced by the Naive Bayes classifier. Are they similar?
A. Somehow, they are similar in terms of class attribute levels and number of clusters. In naive bayes we have different response variable to classify the data and in EM clustering we have two cluster to...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here