### Motivation for developing multi-dimensional indexes

• Are muliti-dimensional indexes necessary ???

• Question:

 Can our one-dimensional index technique support geometrical (2-dimensional) queries efficiently ???

• Case study:

 Let's try to process a range query using a B-tree index.....

• Database and query description

• Database: object locations

 ``` Object( x, y, other-attributes ) // x and y are the // coordinates of the object ```

• Query:

• Find all objects that lies with in a rectangle:

• Processing the geometrical query using a B-tree index

• Suppose we have B+-tree indexes on:

 The x-coordinate attribute of Object       and        The y-coordinate attribute of Object

• The B+-tree on the x-coordinate information looks like this:

(The point with the smallest x-coordinate value is the left-most leaf key)

• The B+-tree on the y-coordinate information looks like this:

(The point with the smallest y-coordinate value is the left-most leaf key)

• Range query:

• Find all points x such that:

 xL ≤ x ≤ xH          and          yL ≤ y ≤ yH

Graphically:

• How to use the B+-tree indexes to process the range query:

1. Use the x-B+-tree index and find the first value that is ≥ xL:

Traverse the leaf nodes to find all record pointers for which:

 ``` xL ≤ x ≤ xH ```

Graphically:

2. Do the same for the y-coordinate:

Graphically:

3. Compute the intersection of the 2 pointer sets:

4. Retrieve the records using the record pointers in the intersection.

• These records are guarantee to satisfy:

 xL ≤ x ≤ xH          and          yL ≤ y ≤ yH

• Sounds good.... no ???

 You will see that this is useless This solution is not faster than scanning the entire relation !!!

• A concrete example....

• Consider the following situation:

• We have a 1000×1000 grid

• There are 1,000,000 points on the grid --- we assume a uniform distribution of the points over the grid

Query:

 Find all the points (x,y) within the square where 450 ≤ x ≤ 550 and 450 ≤ y ≤ 550

Graphically:

• First, let's look at some statistics about th einputs:

 ``` Total area of the grid = 1000 × 1000 = 1,000,000 Selected area = 100 × 100 = 10,000 = 0.01 × Total area Total # points in the grid = 1,000,000 # points in selected area ~= 0.01 × 1,000,000 = 10,000 Other statistics: # points with x-coordinate in [450,550] ~= 0.1 × 1,000,000 = 100,000 # points with y-coordinate in [450,550] ~= 0.1 × 1,000,000 = 100,000 ```

• Assumptions on storage information (needed to compute the processing cost = # disk blocks accessed)

• 1 disk block contains 100 points

• 1 B+-tree block (node) contains an average of 200 (key, ptr) pairs

 The internal nodes of the B+-tree have all been read (and stored) in memory The leaf nodes of the B+-tree (= the index file) are stored on disk

• Computing the processing cost (= # disk blocks accessed) to find all point in the selected area:

1. Use the x-coord B+-tree index and find the first value that is ≥ xL:

Cost to find the database address of the B+ leaf node with xL = 450:

 Cost = 0 (Because we assume that the internal nodes of the B+ tree index are stored in memory)

2. Starting at the B+ node containing the search key 450:

• Traverse the leaf nodes to find all record pointers for which:

 ``` 450 ≤ x ≤ 550 ```

Graphically:

We will access approximately 0.1 × 1,000,000 = 100,000 search keys

The cost to access 100,000 search keys (stored in leaf nodes of the B+-tree) is:

• # B+ leaf nodes that we must access:

 ``` Read: 100,000 search keys (200 search keys per leaf node) = 500 leaf nodes (= index blocks) ```

• Cost to access the x-coord B+-tree    =    500 disk block

3. We must do the same for the y-coordinate:

Graphically:

So: cost to access the y-coord B+-tree    =    500 disk block

4. Compute the intersection of the 2 "search key" sets:

Cost of this step:

 Cost = 0 Because no disk access is required

The intersection will contain approximately 10,000 (= 0.01 × 1,000,000) record pointers (database addresses) to points (records)

5. We must then retrieve the point records using the 10,000 record pointers in the intersection:

• Storage assumption:

 We assume the records are stored randomly (i.e., not ordered by the x or y coordinate) Different records will likely be stored in different blocks

• Then: Accessing the 10,000 records using the record pointers will result in:

 Accessing 10,000 (disk) blocks

6. Total number of disk blocks accessed:

 ``` 500 + 500 + 10,000 = 11,000 disk blocks ^ ^ ^ | | | | | +--- accessing the point records | +------------ accessing the y-coord leaf nodes +------------------ accessing the x-coord leaf nodes ```

Comment:

 The number index disk blocks accessed is usually much smaller than the number data disk blocks accessed !!! We can often discard the index blocks if we want to approximate the number of disk blocks accessed by an operation !!!

Alternate solution

• Suppose:

 We do not have any index on the points records....

How to find all point in the selected region:

 ``` while ( not EOF ) { Read next data block; // Scan ALL data blocks !! for ( each point (x,y) ∈ data block ) { if ( 450 ≤ x ≤ 550 && 450 ≤ y ≤ 550 ) Output (x,y); } } ```

• The cost of finding the points by scanning the entire relation is:

 ``` There are 1,000,000 points 1 disk block stores 100 points # disk blocks used = 1,000,000 / 100 = 10,000 blocks ```

So we would need access:

 ``` 10,000 disk blocks.... (Compare this number to 11,000 blocks when using B+-trees !!) ```

So using the B-tree index does not help us improve performance !!!

• Conclusion:

• We cannot store geographically "related" data randomly

 If related geographical data is store randomly, we will need to access too many data blocks !!!

Lesson:

 We must store geographically "related" data (i.e.: points that are close to each other) in the same data block !!!