Assignment 1. Differential private histogram (programming)
The main goal of this task is to understand the concept of differential privacy by implementing
and evaluating a differentially private histogram. Differentially private histogram
can be implemented using the standard Laplace mechanism to generate noisy histogram bin accounts.
Synthetic records can be then generated from the differentially private
histogram (you can round all negative bin counts to zero) to answer random range
queries. The query accuracy is primarily measured by the relative error
between the true answer from the original data and the answer from the
differentially private data.
(You can find more details in
Differentially Private Histogram and Synthetic Data Publication)
You task for this assignment is to implement a differential private histogram using Laplace mechanism
and evaluate it using random range queries.
Input and output
Your program should take the following parameters:
- Input dataset file
(in csv format)
- Privacy budget ϵ (which can range from (0, 1))
- Output synthetic dataset file (in csv format)
- # of random queries to run for reporting the query error
- Output relative error
Your program needs to do the following:
- Construct private histograms with noisy bin accounts using the basic
Laplace mechanism and the given privacy budget
- Construct private synthetic datasets based on the private histogram
- Perform random range queries over the three attributes and compute
average relative error.
- E.g., Select Count(*) where age in [min,max] and gender=x and race =y.
- Age: [17,18, …, 90],
- Gender: 1 (male) or 2 (female),
- Race: [1,2,…,5]
- You are free to use any optimizations (check the
DPBench paper from
our reading list for existing techniques) to enhance accuracy
You can test your implementation using the provided Adult dataset
of three attributes (Age, gender, race) which is extracted from the original Adult dataset
from the 1994 Census database at the UCI data repository.
You can use any programming language that you are familiar with. Document your code using comments.
- Write a brief report in PDF presenting your results on the provided test dataset
and optionally any other dataset you have tested. Explain and discuss, if any, the algorithmic optimizations you have used in your implementation. Discuss the experiences and lessons you have learned from the implementation.
- Your submission should be a zip file that contains the PDF brief report as well as the program deliverables including your source files, a readme file explaining how to compile/run your program.
We will run a competition using a few test datasets and award prizes to two
winners that offer the best accuracy. :)