Assignment 1: Frequent Itemsets Mining and Performance Competition (Programming)

 

Out: 8/29/2017

Due: 9/12/2017, 11:59pm

 

Your task for this assignment is to implement and evaluate the Apriori-based algorithm for frequent itemsets mining. 

 

1.      Implement the Apriori algorithm that is originally proposed by Agrawal et al. [AS94b] for frequent itemsets mining. You can also find the pseudocode and its related procedures from the lecture slides and textbook.  You are encouraged to use any existing or your own optimization techniques for the Apriori algorithm.  If you do, explain and discuss the techniques you have used and/or provide the appropriate references in the report. 

2.      Test your implementation on the dataset T10I4D100K (.dat, .gz) and measure execution time as well as number of frequent itemsets with various minimum support values.  The test dataset is a synthetic dataset that contains 100,000 transactions with an average size of 10 items from a set of 1000 distinct items.  Detailed descriptions about the dataset can be found in [AS94b].  You can also try your program with various other frequent itemset mining datasets.

3.      Write a brief report in PDF presenting your results on the test dataset and other datasets if you have tried.  Explain and discuss, if any, the algorithmic optimizations you have used in your implementation.  Discuss the experiences and lessons you have learned from the implementation.

4.      You can work as a team of up to two.  If you work on your own, you get 5 bonus points.  If you work as a team of two, please explain the contribution of each team member in your report.  

5.      Your submission should be a zip or tar file that contains the PDF report as well as the program deliverables including your source files, the executable, a readme file explaining how to compile/run your program, the output file for the test dataset with minimum support count 500 (corresponds to relative support 0.5%), and the PDF report.

 

Note: Please start early and be warned that an implementation without efficient data structures could run for days!  There are a few online repositories for frequent pattern mining implementations, most notably FIMI repository.  You can study them but you are asked not to copy their implementations for this assignment. 

 

Competition

We will run a competition using a few test datasets and select top three winners that offer the best performance with correct and complete results. For fairness, no multithreading or parallelization should be used for the competition.  You can see a sample of the competition results from a previous offering of the class.