Microsoft Word - Document1 Announcement by instructor... I am getting a lot of emails regarding the input format for the data in Program 3. As I explained, each row of the file represents an object,...


Overview and Assignment Goals:


The objectives of the assignment are the following:



  • Implementing the Bisecting K-Means Algorithm

  • Deal with text data (news records) in document term sparse matrix format.

  • Design of proximity function for text data

    • Think about the Curse of Dimensionality



  • Think about best metrics for evaluating clustering solutions.




Microsoft Word - Document1 Announcement by instructor... I am getting a lot of emails regarding the input format for the data in Program 3. As I explained, each row of the file represents an object, and the numbers are pairs. You have been working with SciPy csr_matrix structures all semester, and it should not be very hard to read the text file into such a structure. For those of you that have figured it out, good for you! For the rest, I would rather you focus on the main algorithms than this step, so I am providing you with a csr_read function in the attached file. Analyze it so you know how to do this in the future. As a side note, I have seen a lot of interview questions that involve sparse data that rely on a good understanding of these structures, how to create them, traverse them, etc. Having read the data into a CSR matrix, you now need to: - implement the K-Means algorithm in a funciton that can take as input the CSR matrix, a subset of the rows in the matrix (e.g., as a list of row ids), and a number of clusters k. The algo needs to partition the list of rows into k distinct sets, using the K-Means algorithm discussed in class, where the objects are the associated rows in the input matrix. - implement the Bisecting K-Means algorithm, which relies on the K-Means algorithm to do the partitioning by splitting lists in two, one at a time, until you have k lists. You just need to implement a function that figures out the next list (cluster) to be bisected (split in two). Look at slides for adequate criteria to help you make that choice. - run the Bisecting K-Means algorithm per the requirements in the assignment, writing out a file with the cluster ID of each row in the input matrix, where cluster IDs start at 1. You can use libraries for internal measures of clustering effectiveness needed for the report.
Nov 28, 2019
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here