# ε-approximate quantiles - efficiency issues

• Coverage

• Recall that when a input value v arrives and vi-1 ≤ v < vi , the following tuple is inserted into the quantile summary S: ( click here)

 (v, 1, gi + Δi - 1)

Let us denote this tuple as (v, 1, Δ)

• A tuple t in the quantile summary S covers an input value v if:

 the tuple (v, 1, Δ) was merged into the tuple t directly a tuple x that covers the input value v has been merged into the tuple t indirectly

• Recall that the merging 2 tuples in the summary proceeds as follows:

 ``` ..... (vi, gi, Δi), (vi+1, gi+1, Δi+1) .... by: ..... (vi+1, gi+gi+1, Δi+1) .... ```

Example:

 ``` Input stream: ... 5 9 6 .... Insertion into Summary Summary: ... (5, 1, ?), (6, 1, ?), (9, 1, ?) .... Merge (5, 1, ?), (6, 1, ?): Summary: ... (6, 2, ?), (9, 1, ?) .... (6, 2, ?) now covers 5 Merge (6, 2, ?), (9, 1, ?): Summary: ... (9, 3, ?) .... (9, 3, ?) now covers 5 and 6 ```

• A tuple (v, 1, ...) always covers itself

• The number of input values covered by the tuple (v, g, &Delta) is equal to:

 number of input values covered = g

• "Fullness" of an entry in the summary

• By Corollary 1, the algorithm to report quantiles must always maintain the property:

 maxall i ( gi + Δi )    ≤    2 ε n

• "Inspired" by this property, an entry (vi, gi, Δi) is full when:

 gi + Δi    =    ⌊ 2 ε n ⌋

• The algorithm must never allow an entry in the summary to become "overfilled":

• Must never allow this to happen:

 gi + Δi    >    2 ε n

• Bands: categorizing the coverage of entries in the summary

• The algorithm tries to minimize the number of entries used in the summary

• Recall that the number of values covered by (v, g, Δ) is equal to g

So ideally, you want entries with large g...

• Due to the requirement that:

 gi + Δi    ≤    ⌊ 2 ε n ⌋

the value of g cannot grow arbitrarily large...

• The entries that have the potential to let g grow large, must have:

 Smaller value for Δ !!!

• The notion of a band is used in the algorithm to categorize the entries in the summary:

 A highest band entry has the potential to cover more than 1/2 of the ⌊ 2 ε n &rfloor values (⌊ 2 ε n &rfloor = the maximum possible) A second highest band entry will have Δi = 0   ..   ⌊ 2 ε n &rfloor/2 A third highest band entry has the potential to cover between 1/4 and 1/2 of the ⌊ 2 ε n &rfloor values A band(0) entry will have Δi = ⌊ 2 ε n &rfloor / 4   ..   Δi = ⌊ 2 ε n &rfloor / 2 + 1 And so on....

• Example:

 ``` 2 ε n = 14 Δ ------------------------------- Band(0): 14 Band(1): 12 13 Band(2): 11 10 9 8 Band(3): 7 6 5 4 3 2 1 0 ```

• A smaller value for Δ will allow g to grow larger

Hence, a larger value for the band is more desirable

• The definition of a band allows us to group tuples with similar "capacity" together.

• Here's a function that let you compute the band() value for a given Δ and p = 2 ε n:

 ``` int findband(Δ, p) { int diff; double band; diff = p - Δ + 1; if ( diff == 1 ) { return(0); } else { band = Math.log(diff)/Math.log(2); return( (int) band ); } } ```

• Example Program: (Demo above code)

• Tree-structure formed by the bands

• The algorithm by Greenwald and Ghanna delete tuples in groups

A group of tuples is defined using a tree-structure imposed upon the values of their band

• Imposing a tree structure on a serie of numbers:

 ``` Given: any sequence of integer Example: 0 1 3 0 2 1 3 0 1 2 3 ```

Lift up the largest value:

 ``` 3 3 3 0 1 0 2 1 0 1 2 ```

Lift up the next largest value:

 ``` 3 3 3 2 2 0 1 0 1 0 1 ```

Lift up the next largest value:

 ``` 3 3 3 2 2 1 1 1 0 0 0 ```

Tree: parent(i) = first number to your right that is > i

 ``` ---------- root ----------- / \ \ 3 ---- 3 3 / / / / / 2 / 2 / / / / 1 / 1 1 / / / 0 0 0 ```

• The Greenwald & Khanna's Algorithm

• Greenwald & Khanna's algorithm:

 ``` *** ε is the margin error (a parameter of the algorithm) S = {}; // S contains the summary structure, which is: // <(v0, g0, Δ0), (v1, g1, Δ1) ... > // NOTE: S is an ordered list !!! N = 0; // Number of items processed while ( not EOS ) { /* --------------------------------------------- Delete phase: executed once every 1/(2×ε) insertions --------------------------------------------- */ if ( N % [1/(2×ε)] == 0 ) { COMPRESS(); // <-------- Delete some entries in summary } /* ------------------------------------ Insert phase ------------------------------------ */ v = next value in input /* -------------------------------------------- Find insert position for v in S -------------------------------------------- */ Find a tuple (vi, gi, Δi) ∈ S such that: vi-1 ≤ v < vi if ( v is inserted at the head or tail of S ) Δ = 0; else Δ = gi + Δi - 1 ; INSERT "(v, 1, Δ)" into S between vi-1 and vi; N++; } ```

• We will discuss COMPRESS() in more details next....

• Deleting elements from the summary

• The COMPRESS() routine works in a similar fashion as the basic delete routine:

• Scans the summary from the end to the beginning

• If a sequence of entries

 (vj, gj, Δj), (vj+1, gj+1, Δj+1) ... (vi, gi, Δi), (vi+1, gi+1, Δi+1)

can be found such that:

 gj + gj+1 + ... + gi + gi+1 + Δi+1   ≤   2 × ε × N

Then all tuples (vj, gj, Δj), (vj+1, gj+1, Δj+1) ... (vi, gi, Δi) are merged into the tuple (vi+1, gi+1, Δi+1)

• The only difference is how COMPRESS() determines the set of tuples:

 (vj, gj, Δj), (vj+1, gj+1, Δj+1) ... (vi, gi, Δi)     and           (vi+1, gi+1, Δi+1)

• Here is the COMPRESS() routine:

 COMPRESS() ```Input: S = {}; // S contains the summary structure, which is: // <(v0, g0, Δ0), (v1, g1, Δ1) ... (vs-1, gs-1, Δs-1) for "i from (s-2) to 0" do { if ( band( Δi, 2εN) ≤ band( Δi+1, 2εN) && g* + gi+1 + Δi+1 < 2εN ) { Merge the subtree rooted at (vi, gi, Δi) into tuple (vi+1, gi+1, Δi+1) } } NOTE: g* = sum of g value in the subtree rooted at (vi, gi, Δi) ```

• Why use bands to delete elements

• The authors proved a number of lemma's that state how much memory space the algorithm will use.

The lemma's make use of the way that the tuples are organized in bands.

• I will not go through the proof, just state the lemmas...

• Lemma 1:

 A tuple from band α will never cover a value from a band > α

• Lemma 2:

 The total number of observations covered by tuples with band values 0 .. α is at most 2α/ε

• Lemma 3:

 For any α, there are at most 3/(2ε) nodes in the tree that have a child node with band value = α             Or: there are at most 3/(2ε) parent nodes for a node in band α

• Lemma 4:

 For any α, there are at most 4/ε nodes with band value = α that are right-side partners in a full tuple pair

• Lemma 5:

 For any α, the maximum number of tuples with band value = α is ≤ 11/(2ε)

• Theorem 1:

 The total number of tuples in summary S is at most (11/(2&epsilon)) log(2εn)

• An example

• Assume that ε = 0.25

Hence, one period is equal to 1/(2 ε) = 2 items

The COMPRESS() method is invoked once after 2 items is received

• Input sequence:

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 ```

• Processing item 1 (= 12)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 S = (12, 1, 0) ```

• Processing item 2 (= 10)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 10 12 S = (10, 1, 0) (12, 1, 0) ```

• N % (1/(2 ε)) == 0, perform COMPRESS()

 ``` 2εN = 1 According to the band program: (more on bands later) Δ ------- band(0) = 1 band(1) = 0 S = (10, 1, 0) (12, 1, 0) band(1) band(1) Testing: (10, 1, 0) Band: 1 ≤ 1 ==> TRUE 1 + 1 + 0 < 1 ==> FALSE cannot delete (10, 1, 0) DONE ```

• Processing item 3 (= 11)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 10 11 12 S = (10, 1, 0) (11, 1, 0) (12, 1, 0) Δ = 1 + 0 - 1 = 0 ```

• Processing item 4 (= 10)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 10 10 11 12 S = (10, 1, 0) (10, 1, 0) (11, 1, 0) (12, 1, 0) Δ = 1 + 0 - 1 = 0 ```

• N % (1/(2 ε)) == 0, perform COMPRESS()

 ``` 2εN = 2 According to the band program: (more on bands later) Δ ------- band(0) = 2 band(1) = 1 0 S = (10, 1, 0) (10, 1, 0) (11, 1, 0) (12, 1, 0) band(1) band(1) band(1) band(1) Testing: (11, 1, 0), (12, 1, 0) Band: 1 ≤ 1 ==> TRUE 1 + 1 + 0 < 2 ==> FALSE cannot delete (11, 1, 1) Testing: (10, 1, 0), (11, 1, 0) Band: 1 ≤ 1 ==> TRUE 1 + 1 + 0 < 2 ==> FALSE cannot delete (10, 1, 0) Testing: (10, 1, 0), (10, 1, 0) Band: 1 ≤ 1 ==> TRUE 1 + 1 + 0 < 2 ==> FALSE cannot delete (10, 1, 0) DONE ```

• Processing item 5 (= 1)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 1 10 10 11 12 S = (1,1,0) (10,1,0) (10,1,0) (11,1,0) (12,1,0) (Δ = 0 because (1, 1, 0) is inserted at the head of S) ```

• Processing item 6 (= 10)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 1 10 10 10 11 12 S = (1,1,0) (10,1,0) (10,1,0) (10,1,0) (11,1,0) (12,1,0) Δ = 1 + 0 - 1 = 0 ```

• N % (1/(2 ε)) == 0, perform COMPRESS()

 ``` 2εN = 3 !!!! According to the band program: (more on bands later) Δ ------- band(0) = 3 band(1) = 1 2 band(2) = 0 S = (1,1,0) (10,1,0) (10,1,0) (10,1,0) (11,1,0) (12,1,0) band(2) band(2) band(2) band(2) band(2) band(2) Testing: (11,1,0) (12,1,0) Band: 2 ≤ 2 ==> TRUE 1 + 1 + 0 < 3 ==> TRUE DELETE subtree (11, 1, 0): replace (11,1,0) (12,1,0) by (12,2,0) S = (1,1,0) (10,1,0) (10,1,0) (10,1,0) (12,2,0) band(2) band(2) band(2) band(2) band(2) Testing: (10, 1, 2) (12,2,0) Band: 2 ≤ 2 ==> TRUE 1 + 2 + 0 < 3 ==> FALSE cannot delete (10, 1, 0) Testing: (10,1,0) (10,1,0) Band: 2 ≤ 2 ==> TRUE 1 + 1 + 0 < 3 ==> TRUE DELETE S = (1,1,0) (10,1,0) (10,2,0) (12,2,0) band(2) band(2) band(2) band(2) Testing: (10, 1, 0) (10,2,0) Band: 2 ≤ 2 ==> TRUE 1 + 2 + 0 < 3 ==> FALSE cannot delete (10, 1, 0) Testing: (1, 1, 0) (10,1,0) Band: 2 ≤ 2 ==> TRUE 1 + 1 + 0 < 3 ==> TRUE DELETE (1, 1, 0): replace (1,1,0) (10,1,0) by (10,2,0) S = (10,2,0) (10,2,0) (12,2,0) Strange... according to the algorithm description, the min. value can be merged... The paper said that v0 = min value (and vs-1 = max value) Let's assume that v0 = min value and the summary is: S = (1,1,0) (10,1,0) (10,2,0) (12,2,0) DONE ```

Assessing the state:

 ``` Input: 1 10 10 10 11 12 State: S = (1,1,0) (10,1,0) (10,2,0) (12,2,0) Or: S = 1:[1..1] 10:[2..2] 10:[3..3] 12:[5..5] Sample query: 1 2 3 4 5 6 -------------------------- 1 10 10 10 11 12 | | +----------+ Answer: S = 1:[1..1] 10:[2..2] 10:[3..3] 12:[5..5] ```

We can answer any φ-quantile query with error margin within 1 error position.

That is acceptable because ⌊ε×N⌋ is 1.

Notice that we have remove 2 items and need to maintain less information

• Example Continued...

• Processing item 7 (= 11)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 1 10 10 10 11 11 12 S = (1,1,0) (10,1,0) (10,2,0) (11,1,1) (12,2,0) Δ = 2 + 0 - 1 = 1 ```

• Processing item 8 (= 9)

 ``` 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3 1 9 10 10 10 11 11 12 S = (1,1,0) (9,1,0) (10,1,0) (10,2,0) (11,1,1) (12,2,0) Δ = 1 + 0 - 1 = 0 ```

• N % (1/(2 ε)) == 0, perform COMPRESS()

 ``` 2εN = 4 (wiggle room) According to the band program: (more on bands later) Δ ------- band(0) = 4 band(1) = 2 3 band(2) = 0 1 S = (1,1,0) (9,1,0) (10,1,0) (10,2,0) (11,1,1) (12,2,0) band(2) band(2) band(2) band(2) band(2) band(2) Testing: (11, 1, 1) (12,2,0) Band: 2 ≤ 2 ==> TRUE 1 + 2 + 0 < 4 ==> TRUE DELETE (11, 1, 1): replace (11,1,1) (12,2,0) by (12,3,0) S = (1,1,0) (9,1,0) (10,1,0) (10,2,0) (12,3,0) Testing: (10, 1, 0) (12,3,0) Band: 2 ≤ 2 ==> TRUE 2 + 3 + 0 < 4 ==> FALSE cannot delete (10, 1, 0) Testing: (10, 1, 0) (10,2,0) Band: 2 ≤ 2 ==> TRUE 1 + 2 + 0 < 4 ==> TRUE DELETE (10, 1, 0): replace (10,1,0) (10,2,0) by (10,3,0) S = (1,1,0) (9,1,0) (10,3,0) (12,3,0) Testing: (9,1,0) (10, 3, 0) Band: 2 ≤ 2 ==> TRUE 1 + 3 + 0 < 4 ==> FALSE cannot delete (9, 1, 0) Do not delete: (1, 1, 0) S = (1,1,0) (9,1,0) (10,3,0) (12,3,0) DONE ```

• Assessment:

 ``` Max positional error: εN = 2 (more wiggle room now !!) Input processed: 1 9 10 10 10 11 11 12 Summary: S = (1,1,0) (9,1,0) (10,3,0) (12,3,0) Or: S = 1:[1..1] 9:[2..2] 10:[5..5] 12:[8..8] User query processing: Rank: 1 2 3 4 5 6 7 8 --------------+--------------------------------------- actual answer: 1 9 10 10 10 11 11 12 | | r = 1 +-----------+ ===> rmax(9) = 2 > 1 1:[1..1] | | r = 2 +----------------+ ===> rmax(10) = 5 > 2 9:[2..2] r = 3 | | +----------------------+ ===> rmax(10) = 5 > 3 9:[2..2] r = 4 | | +----------------------+ ===> rmax(10) = 5 > 4 9:[2..2] r = 5 | | +----------------------+ ===> rmax(12) = 8 > 4 10:[5..5] ```

We can answer any φ-quantile query with error margin within 2 error position.

That is acceptable because ⌊ε×N⌋ is now 2 !!