In some large data sets, we are interested in discovering groups of items that are similar in some way. For example, suppose we are trying to determine whether a suite of test results is indicative of...


In some large data sets, we are interested in discovering groups of items that are similar in some way. For example, suppose we are trying to determine whether a suite of test results is indicative of a malignant tumor. If we have these test results for a large number of patients, together with information about whether each patient has been diagnosed with a malignant tumor, then we can test whether clusters of similar test results correspond uniformly to the same diagnosis. If they do, the clustering is evidence that the tests can be used to test for malignancy, and the clusters give insights into the range of results that correspond to that diagnosis. Because data clustering can result in deep insights like this, it is another fundamental technique used in data mining.



Task #1:Loading the Medical Data




The comma-separated file "breast-cancer-data.csv" has been provided for you. This file contains 683 rows, where each row represents an individual. Each row consists of 11 values:



  • ID -- sample number

  • Clump Thickness -- integer scale from 1 - 10

  • Uniformity of Cell Size -- integer scale from 1 - 10

  • Uniformity of Cell Shape -- integer scale from 1 - 10

  • Marginal Adhesion -- integer scale from 1 - 10

  • Single Epithelial Cell Size -- integer scale from 1 - 10

  • Bare Nuclei -- integer scale from 1 - 10

  • Bland Chromatin -- integer scale from 1 - 10

  • Normal Nucleoli -- integer scale from 1 - 10

  • Mitoses -- integer scale from 1 - 10

  • Class -- classification of tumor (2 = benign, 4 = malignant)


These values were taken from actual patients. Columns 2-10 have all been normalized to a 1-10 integer scale. Column 11 was the actual diagnosis from the medical expert.




We want to take this data and create an automated system that can classify new measurements (Columns 2-10) as either benign or malignant tumors.




Create the following function:




read_file(filename)-- Return two lists (test results and diagnosis) based on input file.




This function reads the file specified by filename and creates two lists. The first contains the 9 test results (Columns 2-10), so it is a list of lists of 9 integers each. The second contains the diagnosis (Column 11), so it is a list of integers (either a 2 or a 4 in each position). The function returns a tuple of these lists.



Clustering Algorithms




Algorithms that cluster items according to their similarity are easiest to understand initially with two-dimensional data (e.g., longitude and latitude) that can be represented visually. Therefore, before you tackle the tumor diagnosis problem, we will look at simple, two-dimensional examples.




The clustering problem turns out to be very difficult, and there are no known efficient algorithms that solve it exactly. Instead, people use heuristics. A heuristic is an algorithm that does not necessarily give the best answer, but tends to work well in practice. Colloquially, you might think of a heuristic as a "good rule of thumb." We will discuss a common clustering heuristic known as k-means clustering.




In k-means clustering, the data is partitioned into k clusters, and each cluster has an associated centroid, defined to be the mean of the points in the cluster. The k-means clustering heuristic consists of a number of iterations in which each point is (re)assigned to the cluster of the closest centroid, and then the centroids of each cluster are recomputed based on their potentially new membership.




Task #2:Defining Similarity




Similarity among items in a data set can be defined in a variety of ways; here we will define similarity simply in terms of normal Euclidean distance. If each item in our data set has m numerical attributes, then we can think of these attributes as a point in m-dimensional space. Given two m-dimensional points p and q, we can find the distance between them using the formula


distance(p, q) = sqrt((p[0] - q[0])**2 + (p[1] - q[1])**2 + ... + (p[m-1] - q[m-1])**2)


In Python, we can represent each point as a list of m numbers. For example, each of the following three lists might represent six test results (m = 6) for one patient:


p1 = [85, 92, 45, 27, 31, 0.0]
p2 = [85, 64, 59, 32, 23, 0.0]
p3 = [86, 54, 33, 16, 54, 0.0]


But since we are not going to need to change any element of a point once we read it from a data file, a better alternative is to store each point in a tuple, as follows:


p1 = (85, 92, 45, 27, 31, 0.0)
p2 = (85, 64, 59, 32, 23, 0.0)
p3 = (86, 54, 33, 16, 54, 0.0)


To find the distance between tuples p1 and p2, we can compute


sqrt((85 − 85)**2 + (92 − 64)**2 + (45 − 59)**2 + (27 − 32)**2 + (31 − 23)**2 + (0 − 0)**2) ~ 32.7




and to find the distance between tuples p1 and p3, we can compute


sqrt((85 − 86)**2 + (92 − 54)**2 + (45 − 33)**2 + (27 − 16)**2 + (31 − 54)**2 + (0 − 0)**2 ~ 47.3.




Because the distance between p1 and p2 is less than the distance between p1 and p3, we consider the results for patient p1 to be more similar to the results for patient p2 than to the results for patient p3.




Create the following function:




distance(p, q)-- Return the Euclidean distance between p and q.




This function returns the Euclidean distance between two k-dimensional points.



Task #3:Calculating Centroids




Once we have our data points assigned to a particular cluster, we have to determine the center point (centroid) of that cluster. This is simply the average value of all points in the cluster. (Examples of this can be found below in the full k-means example.)




Create the following function:




centroid(cluster, data)-- Return the centroid of the cluster.




This function computes the centroid of the given cluster. The centroid is a k-element tuple (cx1, cx2,..., cxk), where the cxi's are simply the average of the xivalues in the cluster.





A k-means Clustering Example




To illustrate in more detail how the heuristic works, consider the following small set of six points, and suppose we want to group them into k = 2 clusters. The lower left corner of the rectangle has coordinates (0, 0) and each square is one unit on a side. So our six points have coordinates (0.5, 5), (1.75, 5.75), (3.5, 2), (4, 3.5), (5, 2), and (7, 1).









We begin by randomly choosing k = 2 of our points to be the initial centroids. In the figure below, we chose one centroid to be the red point at (0.5, 5) and the other centroid to be the blue point at (7, 1).









In the first iteration of the heuristic, for every point, we compute the distance between the point and each of the two centroids, represented by the dashed line segments. We assign each point to a cluster associated with the centroid to which it is closest. In this example, the three points circled in red below are closest to the red centroid and the three points circled in blue below are closest to the blue centroid.









Once every point is assigned to a cluster, we compute a new centroid for each cluster, defined to be the mean of all the points in the cluster. In our example, the x and y coordinates of the new centroid of the red cluster are (0.5 + 1.75 + 4)/3 = 25/12 ~ 2.08 and (5 + 5.75 + 3.5)/3 = 4.75. The x and y coordinates of the new centroid of the blue cluster are (3.5 + 5 + 7)/3 = 31/6 ~ 5.17 and (2 + 2 + 1)/3 = 5/3 ~ 1.67. These are the new red and blue points shown below.









In the second iteration of the heuristic, we once again compute the closest centroid for each point. As illustrated below, the point at (4, 3.5) is now grouped with the lower right points because it is closer to the blue centroid than to the red centroid (distance((2.08, 4.75), (4, 3.5)) ~ 2.29 and distance((5.17, 1.67), (4, 3.5)) ~ 2.17).









Next, we once again compute new centroids. The x and y coordinates of the new centroid of the red cluster are (0.5 + 1.75)/2 = 1.125 and (5 + 5.75)/2 = 5.375. The x and y coordinates of the new centroid of the blue cluster are (3.5 + 4 + 5 + 7)/4 = 4.875 and (1 + 2 + 2 + 3.5)/4 = 2.125. These new centroids are shown below.









In the third iteration, when we find the closest centroid for each point, we find that nothing changes. Therefore, these clusters are the final ones chosen by the heuristic.



Task #4:Implementing k-means Clustering




To implement this heuristic in Python, we define a function named kmeans with three parameters: a list of tuples named data, the number of clusters k, and iterations, the number of iterations in the heuristic. If we were calling the function with the points in the previous example, then data would be assigned the list


[(0.5, 5), (1.75, 5.75), (3.5, 2), (4, 3.5), (5, 2), (7, 1)]




The function needs to use the sample function from the random module to choose k random tuples from data to be the initial centroids, assigned to a variable named centroids. The remainder of the function should consist of iterations passes over the list of points controlled by a for loop.




Next, within the loop, we first create a list of k empty lists, one for each cluster. For example, if k was 4 then clusters would be assigned the list[[ ], [ ], [ ], [ ]].




The first empty list,clusters[0], will be used to hold the indices of the points in the cluster associated withcentroids[0]; the second empty list,clusters[1], will be used to hold the indices of the points in the cluster associated withcentroids[1]; and so on. Some care has to be taken to create this list of lists. (Sayingclusters = [[]] * kwill not work. If you try it, you will see why.)




Next, we iterate over the indices of the points indata. In each iteration, we need to find the centroid that is closest to the pointdata[dataIndex]. This inner for loop is essentially the same algorithm that we would use to find the minimum value in a list. The difference here is that we need to find the index of the centroid with the minimum distance todata[dataIndex]. Therefore, we maintain the index of the closest centroid in the variable named minIndex. Once we have found the final value ofminIndex, we appenddataIndexto the list of indices assigned to the cluster with indexminIndex.




After we have assigned all of the points to a cluster, we compute each new centroid with ourcentroidfunction.




Create the following function:




kmeans(data, k, iterations)-- Cluster data into k clusters using k-means.




This function should be implemented as outlined in the previous section. As an additional guide, the following pseudocode puts everything together:







Once you have implemented all of the functions correctly, the main program should report that the two clusters have (benign=9, malignant=221) and (benign=435, malignant=18). It will also produce a scatterplot of the clusters along two chosen dimensions (1 and 2 in the default case, which correspond to uniformity of cell size and cell shape). Realize that clusters may overlap when viewed in two dimensions even though they don't overlap in the full n dimensions.

Apr 20, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here