Assignment 2: k-Means Clustering (Programming)

  

Your task for this assignment is to implement and evaluate the k-means clustering algorithm. 

 

1.      Implement the k-means clustering algorithm.   

a.      You can use any programming language that you are familiar with.

b.      The program should be executable with at least 3 parameters: the name of the dataset file, k, and the name of the output file. 

2.      Select two datasets from the UCI repository or publicly avaialable ones, determine how you will measure the quality of the clusters produced, evaluate your algorithm using the datasets with respect to varying k.

3.      Write a brief report to:

a.      Describe the datasets and your quality metrics. 

b.      Describe your experiment setup and implementation details such as how you preprocess the data (if any), what distance metric did you choose and why, etc.

c.      Present the experiment results in plots with varying k.

d.      Discuss the insights and conclusions from your experiments.  For example, does k-means work well for the datasets you select?  Why or why not?  What's the impact of k? How data preprocessing might help? 

4.      Your deliverable should be in one tar or zip file that contains your source files, the executable, a readme file explaining how to compile/run your program, and the report in pdf.