### Finding V-Optimal histogram (part 2) - searching for the best bucket partitions

• Finding the histogram with the minimum error - Part 2 best bucket boundaries

• Previously:

• the optimal value for hr that minimizes the error in a bucket br was solved using calculus:

 hr = Average(fsr ,   fsr+1 ,   fsr+2 , .... ,   fer)

The minimum error in a bucket br is:

 Sr = (fsr2 + fsr+12 + .... + fer2) - (fsr + fsr+1 + ... + fer)2/p p = # values

• The next problem that we must solve to find the V-optimal histogram is finding the best boundaries for the buckets

This step requires computer science...

• Jagadish et. al. presented a dynamic programming approach to find the optimal bucket partitioning

• Searching for the best bucket assignment

• Problem formulation:

 Find the bucket assignment for a histogram that uses B buckets and minimizes the squared error

This problem is solved by a brute force search

(A smart brute force search :-))

• Basic idea for the search algorithm:

Meaning of the figure:

 The red area represents a optimal histogram with k buckets The green area represents a optimal histogram with k-1 buckets The last bucket contains the data for x, x+1, ..., b

• Suppose we need to construct a histogram using k-1 buckets for the data for a, a+1, ..., x-1

 The solution optimal histogram will be the one in the green area

Therefore:

 Optimal Histogram for [a,b] using k buckets                     = Optimal Histogram of [a..x-1] using k-1 buckets                     + Optimal Histogram of [x..b] using 1 buckets

• Question:

 How to determine x ???

 There is no mathematical formula that will tell us what is the "best" last bucket (x..b). The only way is to try every single possible case: Last bucket is: xN Last bucket is: xN-1...xN Last bucket is: xN-2...xN ... Last bucket is: x1...xN One of them must have the smallest squared error That is the optimal partition !!!

• In other words, we have the following recursive relationship:

 Optimal Histogram for [a,b] using k buckets                     = minx=a..b{ Optimal Histogram of [a..x-1] using k-1 buckets + last bucket is [x..b]}

• Workout Example

• Consider the following input:

 f1 = 4 f2 = 2 f3 = 3 f4 = 6 f5 = 5 f6 = 6 f7 = 12 f8 = 16

Problem: construct a V-optimal histogram with B = 3 buckets

• Step 1: construct V-optimal histogram with B = 1 bucket

Note: This is just minimize the squared error , so we can use the result from this webpage: click here

 ``` Histogram with 1 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: 0.0 | 2.0 | 2.0 | 8.75 | 10.0 | 13.3 | 63.7 | 161.5| ```

• Step 2: construct V-optimal histogram with B = 2 bucket

• Initially:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | ?? | ?? | ?? | ?? | ?? | ?? | ^ | each value in its own bucket ```

• To find the best bucket partition for values 4 2 3, we try:

 ``` [ 4 2 ] [ 3 ] ===> MinError[1][2] + 0 [ 4 ] [ 2 3 ] ===> MinError[1][1] + (2 - 2.5)2 + (3 - 2.5)2 | | +-------+ 1 bucket optimal histogram Using the result from the 1 bucket optimal histogram: [ 4 2 ] [ 3 ] ===> 2.0 + 0 = 2.0 [ 4 ] [ 2 3 ] ===> 0.0 + 0.5 = 0.5 <---- Min ```

Result:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | 0.5 | ?? | ?? | ?? | ?? | ?? | ```

• To find the best bucket partition for values 4 2 3 6, we try:

 ``` [ 4 2 3 ] [ 6 ] ===> MinError[1][3] + 0 [ 4 2 ] [ 3 6 ] ===> MinError[1][2] + (3 - 4.5)2 + (6 - 4.5)2 [ 4 ] [ 2 3 6 ] ===> MinError[1][1] + (2 - 3.66)2 + (3 - 3.66)2 + (6 - 3.66)2 | | +---------+ 1 bucket optimal histogram Using the result from the 1 bucket optimal histogram: [ 4 2 3 ] [ 6 ] ===> 2.0 + 0 = 2.0 <--- Min [ 4 2 ] [ 3 6 ] ===> 2.0 + 4.5 = 6.5 [ 4 ] [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666 ```

Result:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | 0.5 | 2.0 | ?? | ?? | ?? | ?? | ```

• And so on... - Final result:

 ``` Input: 4 2 3 6 5 6 12 16 V-optimal Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | 0.5 | 2.0 | 2.5 | 2.66 | 13.3 | 21.3 | ```

• Step 3: construct V-optimal histogram with B = 3 bucket

• Initially:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 3 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0 | ?? | ?? | ?? | ?? | ?? | ^ | each value in its own bucket ```

• To find the best bucket partition for values 4 2 3 6, we try:

 ``` { 4 2 3 } [ 6 ] ===> MinError[2][3] + 0 { 4 2 } [ 3 6 ] ===> MinError[2][2] + (3 - 4.5)2 + (6 - 4.5)2 { 4 } [ 2 3 6 ] ===> MinError[2][1] + (2 - 3.66)2 + (3 - 3.66)2 + (6 - 3.66)2 | | +---------+ 2 bucket optimal histogram Using the result from the 2 bucket optimal histogram: { 4 2 3 } [ 6 ] ===> 0.5 + 0 = 0.5 <---- Min { 4 2 } [ 3 6 ] ===> 0.0 + 4.5 = 4.5 { 4 } [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666 ```

Result:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0 | 0.5 | ?? | ?? | ?? | ?? | ```

• To find the best bucket partition for values 4 2 3 6 5, we try:

 ``` { 4 2 3 6 } [ 5 ] ===> MinError[2][4] + 0 { 4 2 3 } [ 6 5 ] ===> MinError[2][3] + (6 - 5.5)2 + (5 - 5.5)2 { 4 2 } [ 3 6 5 ] ===> MinError[2][2] + (3 - 4.66)2 + (6 - 4.66)2 + (5 - 4.66)2 { 4 } [ 2 3 6 5 ] ===> MinError[2][1] + (2 - 4)2 + (3 - 4)2 + (6 - 4)2 + (5 - 4)2 | | +-----------+ 2 bucket optimal histogram Using the result from the 1 bucket optimal histogram: { 4 2 3 6 } [ 5 ] ===> 2.0 + 0 { 4 2 3 } [ 6 5 ] ===> 0.5 + 0.5 = 1.0 <--- Min { 4 2 } [ 3 6 5 ] ===> 0.0 + 4.666 = 4.666 { 4 } [ 2 3 6 5 ] ===> 0.0 + 10.0 = 10.0 ```

Result:

 ``` Input: 4 2 3 6 5 6 12 16 Histogram with 3 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0.0 | 0.5 | 1.0 | ?? | ?? | ?? | ```

• And so on... - Final result:

 ``` Input: 4 2 3 6 5 6 12 16 V-optimal Histogram with 3 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0.0 | 0.5 | 1.0 | 1.16 | 2.66 | 10.6 | ```

• The V-optimal Algorithm

• Algorithm in psuedo code:

 ``` /* ------------------------------------------------ Help function to compute Error in a bucket ------------------------------------------------ */ SqError(int a, int b) { s2 = PP[b] - PP[a]; s1 = P[b] - P[a]; return (s2 - s1*s1/(b-a+1)); } /* ---------------------------------------------- Prepare arrays to compute error efficiently ---------------------------------------------- */ P[0] = 0; PP[0] = 0; for (i = 1; i <= N; i++) { P[i] = P[i-1] + xi PP[i] = PP[i-1] + xi2 } /* --------------------------------------------- Compute the best error for 1 bucket histogram --------------------------------------------- */ for (i = 1; i <= N; i++) { // Single bucket: use error formula... BestErr[k][i] = SqError(1,i); } /* --------------------------------------------------------- Now we compute the V-opt. histogram with B buckets Output: BestError[k][i] = best error of histogram using k buckets on data points (1..i) --------------------------------------------------------- */ // The dynamic algorithm uses these variables: // // k = # buckets // i = current item - items processed are: (1..i) // BestError[k][i] = min. error in histogram of k buckets for f1..fi for (k = 1; k <= B; k++) { // Find optimal histogram using k buckets for (i = 1; i <= N; i++) { // Multiple buckets: search BestError[k][i] = INFINITE; // Start value // Try every possible size for the last bucket for (j = 1; j <= i-1; j++) // Last bucket is [j..i] { if ( BestError[k-1][j] + SqError(j+1,i) < BestError[k][i] ) { BestError[k][i] = BestError[k-1][j] + SqError(j+1,i); // Better division found } } } } ```

• Example Program: (Demo above code)

• Finding the buckets in the histogram

• Insert the code tag with ********* into the above program to obtain the histogram bucket:

 ``` /* ------------------------------------------------ Help function to compute Error in a bucket ------------------------------------------------ */ SqError(int a, int b) { s2 = PP[b] - PP[a]; s1 = P[b] - P[a]; return (s2 - s1*s1/(b-a+1)); } /* ---------------------------------------------- Prepare arrays to compute error efficiently ---------------------------------------------- */ P[0] = 0; PP[0] = 0; for (i = 1; i <= N; i++) { P[i] = P[i-1] + xi PP[i] = PP[i-1] + xi2 } /* --------------------------------------------- Compute the best error for 1 bucket histogram --------------------------------------------- */ for (i = 1; i <= N; i++) { // Single bucket: use error formula... BestErr[k][i] = SqError(1,i); } index[1] = 1; // First index ************* /* --------------------------------------------------------- Now we compute the V-opt. histogram with B buckets Output: BestError[k][i] = best error of histogram using k buckets on data points (1..i) --------------------------------------------------------- */ // The dynamic algorithm uses these variables: // // k = # buckets // i = current item - items processed are: (1..i) // BestError[k][i] = min. error in histogram of k buckets for f1..fi for (k = 1; k <= B; k++) { // Find optimal histogram using k buckets for (i = 1; i <= N; i++) { // Multiple buckets: search BestError[k][i] = INFINITE; // Start value // Try every possible size for the last bucket for (j = 1; j <= i-1; j++) // Last bucket is [j..i] { if ( BestError[k-1][j] + SqError(j+1,i) < BestError[k][i] ) { BestError[k][i] = BestError[k-1][j] + SqError(j+1,i); // Better division found index[i] = j+1; // ******************** } } } } /* --------------------------------- Print bucket boundaries --------------------------------- */ i = B; j = n; while (i >= 2) { int end_point; end_point = j; j = min_index[j]; System.out.println("[" + j + " .. " + end_point + "]"); j--; i--; } System.out.println("[" + 1 + " .. " + j + "]"); ```