### Finding Frequent Elements in a Data Stream

• Problem Description

• Problem Description:

• A (very long) stream of integers contains different values

Example:

 ``` 1 2 4 3 4 7 8 4 8 4 4 9 15 4 8 1 4 8 15 4 98 4 765 4 9825 8 2 4 8 4 2 ... ```

• You are given no information about the input stream

• Relative frequency:

 ``` # times that a appears in input Relative frequency of a = --------------------------------- Total number of input values ```

• Find all elements in the input data stream with relative frequency > θ

(E.g., find elements in the input data stream with relative frequency > 0.2)

• Note: (restriction)

 You cannot assume any upperbound on the number of integers in the input stream You cannot assume that the number of different integers is finite

• Prelude: Maximum number of frequent elements

• Claim:

 The number of elements that has relative frequency greater than θ is at most 1/θ

• Proof:

• Let N = total number of item in input

• An element a that occurs with relative frequency &theta will occur &theta×N in the input

• If there are more than 1/θ elements with relative frequency θN, then the total number of input will be:

 ``` Total # input value > 1/&theta * ( θN ) > N ```

But there are exactly N elements in the stream, and not > N elements

Impossible situation.

• Note: limit the size of the data structure of the algorithm

 There are at most 1/&theta values that has the threshold frequency Therefore, we only need to use an array of size 1/&theta to store the different values !!!

• Algorithm to find frequent elements in a data stream

• Proposed Algorithm:

 ``` Variable utilization: K = set of potentially heavy elements count[e] = adjusted count of element e ∈ K K = ∅; // Initialize while ( not end of stream ) { x = next item in stream; /* ---------------------------------------- Insert phase (executed for each input) ---------------------------------------- */ if ( x ∈ K ) { count[x]++; } else { insert x into K; count[x] = 1; } /* -------------------------------------------------- Delete phase (executed once for every 1/θ inputs) -------------------------------------------------- */ if ( |K| > 1/θ ) { for ( each e in K ) { count[e]--; if ( count[e] == 0 ) { delete e from K; } } } } ```

• Example:

• Suppose θ = 0.2

• Input:

 ``` 1 2 4 3 4 7 8 4 8 4 4 9 15 4 8 1 4 8 15 4 98 4 765 4 9825 8 2 4 8 4 2 ... ```

• Execution of the algorithm:

 ``` 1/θ = 1/(0.2) = 5 Delete phase is activated when |K| = 6 items Process items until |K| = 6: 1 2 4 3 4 7 8 After insert phase: (1, 1), (2, 1), (4, 2), (3, 1) (7, 1) (8, 1) (6 items) Delete phase decrement count by 1: (1, 0), (2, 0), (4, 1), (3, 0) (7, 0) (8, 0) After delete items with 0 count: (4, 1) Process items until |K| = 6: 4 8 4 4 9 15 4 8 1 4 8 15 4 98 After insert phase: (4, 8), (8, 3), (9, 1), (15, 1) (1, 1) (98, 1) (6 items) Delete phase decrement count by 1: (4, 7), (8, 2), (9, 0), (15, 0) (1, 0) (98, 0) After delete items with 0 count: (4, 7), (8, 2) Process items until |K| = 6: 4 765 4 9825 8 2 4 8 4 2 7 After insert phase: (4, 12), (8, 5), (765, 1), (9825, 1) (2, 2) (7, 1) (6 items) Delete phase decrement count by 1: (4, 11), (8, 4), (765, 0), (9825, 0) (2, 1) (7, 0) After delete items with 0 count: (4, 11), (8, 4), (2, 1) ```

• Output of algorithm:

 If an element x has relative frequency > θ, then x will be included in the output Example: 4 and 8 However, the algorithm may generate false possitive Example: 2

• Final note

• This algorithm has been improved by Manku and the false positives can be removed by the improved algorithm