Assignment 1: Frequent Itemsets Mining and Performance Competition (Programming)


Out: 1/31/2018

Due: 2/14/2018, 11:59pm


Your task for this assignment is to implement and evaluate the Apriori-based algorithm for frequent itemsets mining. 


1.      Implement the Apriori algorithm that is originally proposed by Agrawal et al. [AS94b] for frequent itemsets mining. You can also find the pseudocode and its related procedures from the lecture slides and textbook.  You are not required but encouraged to use existing or your own optimization techniques for the Apriori algorithm.  If you do, explain and discuss the techniques you have used and/or provide the appropriate references in the report. 

2.      Test your implementation on the dataset T10I4D100K (.dat, .gz) and measure execution time as well as number of frequent itemsets with various minimum support values.  The test dataset is a synthetic dataset that contains 100,000 transactions with an average size of 10 items from a set of 1000 distinct items.  Detailed descriptions about the dataset can be found in [AS94b].  You can also try your program with various other frequent itemset mining datasets.

3.      Write a brief report in PDF presenting your results on the test dataset and other datasets if you have tried.  Explain and discuss, if any, the algorithmic optimizations you have used in your implementation.  Discuss the experiences and lessons you have learned from the implementation.

4.      You can work as a team of up to two.  If you work on your own, you get 5 bonus points.  If you work as a team of two, please explain the contribution of each team member in your report.  

5.      Submission. You (or your partner) will upload two items to Canvas: your PDF report and a zip or tar file. This zip/tar file must contain:
your source files (include your name(s) in commented form at the top of all source files),
the executable (which takes parameters: input filename, minimum support count, and outputfilename, in that order),
a README file explaining how to compile/run your program,
and a file named 'output500.dat' which is your solution for the dataset T10I4D100K with minimum support count 500 (corresponds to relative support 0.5%).
Any deviations from these submission instructions will irritate your overwhelmed TA while he is determining your grade.


Note: Please start early and be warned that an implementation without careful planning or efficient data structures could run for days!  There are a few online repositories for frequent pattern mining implementations, most notably FIMI repository.  You can study them but you are asked not to copy their implementations for this assignment. Remember that the Honor Code applies, and an automatic plagiarism checker will be used on submissions. 




We will run a competition using a few test datasets and select top three winners that offer the best performance with correct and complete results. For fairness, no multithreading or parallelization should be used for the competition.  Prizes include extra late pass, choice of final project presentation date, and starbucks gift certificates! :) You can see a sample of the competition results from a previous offering of the class.