4/2/22, 3:03 AM Assignment 7: Clustering
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/2
Assignment 7: Clustering
Due Tuesday by 11:59pm Points 80 Submitting a file upload
Start Assignment
I Do the following experiments:
Use the original loan data set (without any discretizaton or nominalization of the numeric
attributes) represented as an ARFF file. Apply the following clustering algorithms with the specified
parameter settings. Use the guidelines provided in Week 13 Handout Clustering Experiments.
1. kmeans (weka.clusterers.SimpleKMeans): Generate two clusters in Classes to clusters
evaluation mode. Compare the cluster centroids with the data models generated by the Weka
classifiers, that is, if we consider these centroids as instances how are they classified according to the
models produced by OneR or J48?
2. EM (weka.clusterers.EM): Generate two clusters in Classes to clusters evaluation mode (change
the default number of clusters to 2). Compare the clusters and their parameters with the class
descriptions produced by the Naive Bayes classifier. Are they similar?
3. Cobweb (weka.clusterers.Cobweb): Use training set cluster mode and ignore the class
attribute (loan approval). Describe only the top level split (the successors of the root) in the clustering
hierarchy (using the approach described in the Handout Clustering Experiments). Compute the classes
to clusters evaluation e
or for the top level clustering by hand (see again Handout Clustering
Experiments). Weka cannot be used for this purpose because it computes the evaluation for the leaf
clusters only.
Hints about parameter tuning:
1. For kmeans, try several different seeds (used by the random number generator) to find the best
clustering.
2. In EM, use numClusters = 2 and try several seeds as well.
3. In Cobweb, try different cutoff's (category utility threshold) to generate the smallest possible clustering
hierarchy (with minimal number of clusters).
II Write a report on the experiments described above including all comparisons, analyses and the
answers to all questions. Use tables whenever possible and include only experimental data which are
elevant to the analysis.
4/2/22, 3:03 AM Assignment 7: Clustering
https:
wssu.instructure.com/courses/19153/assignments/ XXXXXXXXXX/2
Chapter 4
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 4, Algorithms: Clustering
of Data Mining by I. H. Witten, E. Frank,
M. A. Hall and C. J. Pal
2
• Clustering techniques apply when there is no class to be
predicted: they perform unsupervised learning
• Aim: divide instances into “natural” groups
• As we have seen, clusters can be:
• disjoint vs. overlapping
• deterministic vs. probabilistic
• flat vs. hierarchical
• We will look at a classic clustering algorithm called kmeans
• kmeans clusters are disjoint, deterministic, and flat
Clustering
3
• Step 1: Choose k random cluster centers
• Step 2: Assign each instance to its closest cluster center based on
Euclidean distance
• Step 3: Recompute cluster centers by computing the average
(aka centroid) of the instances pertaining to each cluste
• Step 4: If cluster centers have moved, go back to Step 2
• This algorithm minimizes the squared Euclidean distance of the
instances from their co
esponding cluster centers
• Determines a solution that achieves a local minimum of the squared
Euclidean distance
• Equivalent termination criterion: stop when assignment of
instances to cluster centers has not changed
The kmeans algorithm
4
The kmeans algorithm: example
5
Discussion
• Algorithm minimizes squared distance to cluster centers
• Result can vary significantly
• based on initial choice of seeds
• Can get trapped in local minimum
• Example:
• To increase chance of finding global optimum: restart with
different random seeds
• Hierarchical kmeans: apply recursively with k = 2 to the
esulting clusters
instances
initial cluster
centres
6
Faster distance calculations
• Can we use kDtrees or ball trees to speed up the process?
Yes, we can:
• First, build the tree data structure, which remains static, for all the
data points
• At each node, store the number of instances and the sum of all
instances (summary statistics)
• In each iteration of kmeans, descend the tree and find out which
cluster each node belongs to
• Can stop descending as soon as we find out that a node belongs
entirely to a particular cluste
• Use summary statistics stored previously at the nodes to compute
new cluster centers
7
Example scenario (using a ball tree)
8
Choosing the number of clusters
• Big question in practice: what is the right number of clusters,
i.e., what is the right value for k?
• Cannot simply optimize squared distance on training data to
choose k
• Squared distance decreases monotonically with increasing values of k
• Need some measure that balances distance with complexity
of the model, e.g., based on the MDL principle (covered later)
• Finding the rightsize model using MDL becomes easier when
applying a recursive version of kmeans (bisecting kmeans):
• Compute A: information required to store data centroid, and the
location of each instance with respect to this centroid
• Split data into two clusters using 2means
• Compute B: information required to store the two new cluster
centroids, and the location of each instance with respect to these two
• If A > B, split the data and recurse (just like in other tree learners)
9
Hierarchical clustering
• Bisecting kmeans performs hierarchical clustering in a topdown
manne
• Standard hierarchical clustering performs clustering in a bottom
up manner; it performs agglomerative clustering:
• First, make each instance in the dataset into a trivial minicluste
• Then, find the two closest clusters and merge them; repeat
• Clustering stops when all clusters have been merged into a single cluste
• Outcome is determined by the distance function that is used:
• Singlelinkage clustering: distance of two clusters is measured by finding
the two closest instances, one from each cluster, and taking their distance
• Completelinkage clustering: use the two most distant instances instead
• Averagelinkage clustering: take average distance between all instances
• Centroidlinkage clustering: take distance of cluster centroids
• Groupaverage clustering: take average distance in merged clusters
• Ward’s method: optimize kmeans criterion (i.e., squared distance)
10
Example: complete linkage
11
Example: single linkage
12
Incremental clustering
•Heuristic approach (COBWEB/CLASSIT)
•Forms a hierarchy of clusters incrementally
•Start:
• tree consists of empty root node
•Then:
• add instances one by one
• update tree appropriately at each stage
• to update, find the right leaf for an instance
• may involve restructuring the tree using merging or splitting
of nodes
•Update decisions are based on a goodness measure
called category utility
Clustering the weather data I
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
1
2
3
13
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
4
3
Merge best host and
unnerup
5
Consider splitting the best host if
merging does not help
Clustering the weather data II
14
Final clustering
15
Example: clustering a subset of the iris data
16
Example: iris data with cutoff
17
The category utility measure
• Category utility: quadratic loss function
defined on conditional probabilities:
• Every instance in a different category
numerator becomes
CU(C1,C2,...,Ck )=
P(Cl ) (P(ai = vij Cl )
2 P(ai = vij )
2 )
j
å
i
å
l
å
k
mP(ai = vij )
2 maximum
number of attributes
18
Numeric attributes?
• Assume normal distribution:
• Then:
• Thus
ecomes
• Prespecified minimum variance can be enforced to
combat overfitting (called acuity parameter)
f (a) =
1
2ps
e

(am )2
2s 2
P(ai
j
å = vij )
2Û f (ai )
2 dai =
1
2 ps i
ò
CU =
P(Cl ) (P(ai = vij Cl )
2 P(ai = vij )
2 )
j
å
i
å
l
å
k
k
C
CU l i iil
lå å ÷÷
ø
ö
çç
è
æ

=
ssp
11
2
1
]Pr[
19
20
Multiinstance learning
• Recap: multiinstance learning is concerned with examples
co
esponding to sets (really, bags or multisets) of instances
• All instances have the same attributes but the full set of instances is split
into subsets of related instances that are not necessarily independent
• These subsets are the examples for learning
• Example applications of multiinstance learning: image
classification, classification of molecules
• Simplicityfirst methodology can be applied to multiinstance
learning with surprisingly good results
• Two simple approaches to multiinstance learning, both using
standard singleinstance learners:
• Manipulate the input to learning
• Manipulate the output of learning
21
Aggregating the input
• Idea: convert multiinstance learning problem into a single
instance one
• Summarize the instances in a bag by computing the mean, mode,
minimum and maximum, etc., of each attribute as new attributes
• “Summary” instance retains the class label of its bag
• To classify a new bag the same process is used
• Any singleinstance learning method, e.g., decision tree
learning, can be applied to the resulting data
• This approach discards most of the information in a bag of
instances but works surprisingly well in practice
• Should be used as a base line when evaluating more
advanced approaches to multiinstance learning
• More sophisticated regionbased summary statistics can be
applied to extend this basic approach
22
Aggregating the output
• Idea: learn a singleinstance classifier directly from the
original instances in all the bags
• Each instance is given the class of the bag it originates from
• But: bags can contain differing numbers of instances give each
instance a weight inversely proportional to the bag's size
• To classify a new bag:
• Produce a prediction for each instance in the bag
• Aggregate the predictions to produce a prediction for the bag as a
whole
• One approach: treat predictions as votes for the various class labels;
alternatively, average the class probability estimates
• This approach treats all instances as independent at training
time, which is not the case in true multiinstance applications
• Nevertheless, it often works very well in practice and should
e used as a base line
Clustering
• Clustering is an unsupervised learning method: there is no target value
(class label) to be predicted, the goal is finding common patterns o
grouping similar examples.
• Differences between models/algorithms for clustering:
– Conceptual