### Indexing based on Hashing

• Hash Function

• Hash Function:

 Hash function = a function that maps a search key to an index between [0 .. B-1] (B = the size of the hash table)

• Bucket:

 Bucket = a location (slot) in the bucket array

• The hashing technique: Note:

 Different search keys can be hashed into the same hash bucket !!!

• Hashing used as an indexing technique

• How to use use Hashing as a indexing technique to find records stored on disk:

• Bucket Array contains a (search key, block/record pointer) map object in each bucket: • The Bucket Array:

• The bucket array is stored on disk:

 1 bucket = 1 disk block 1 bucket will contain n (search key, block ptr) !!!

• Storing hash tables on disk

• Recall that a file entry in a directory contains a list of block numbers:

 ``` B block numbers <-------------------> +-----------+-------------------+-+-+-+-+-+ ...... +-+ | Filename | owner/access info | | | | | | | | +-----------+-------------------+-+-+-+-+-+ ...... +-+ ```

• Each data block of a Hash index file is one bucket: The Hash index file has B buckets

• How to find data records using a Hash Index

• Sample hash index file: How to use a Hash index to find the data record with search key W:

• Find the data record containing the search key W: Steps:

1. Compute the hash value h(W)

2. Read the disk block(s) of bucket h(W) into memory

3. Search (a linear search algorithm is sufficient because memory search is fast) the bucket h(W) for:

 ``` W, RecordPtr(W) ```

4. Use RecordPtr(W) to access W on disk

• Inserting into a Hash index

• Insert Algorithm:

 ``` Insert( K, recordPtr(K) ) { k = h(K); // Hash the key Find the hash bucket (= a CHAIN of disk blocks !!!) if ( hash bucket has space ) { insert (K, recordPtr(K)) in bucket; } else { allocate an overflow block; insert (K, recordPtr(K)) in overflow block; } } ```

• Deleting from a Hash table

• Delete Algorithm:

 ``` Delete( K ) { k = h(K); // Hash the key Find the hash bucket (a CHAIN of disk blocks !!!) if ( x is found in hash bucket ) { Delete (K, recordPtr(K)) from bucket; } Optional: Consolidate overflow blocks of a bucket into few blocks } ```

Delete oscillation caveat:

• Consolidation must be done with care because:

• Inserts after deletes can cause oscillation !!!

Oscillation:

 Insert: a block splits Delete: 2 blocks merges Insert: a block splits again Delete: 2 blocks merges again... ....

( Oscillation --- can result in poor performance !!!)

• Performance of Hash index

• Fact:

• The block numbers in the index file is an array of block pointers that is:

 small enough to be stored entirely in main memory

Graphically: Therefore:

 We disregard the access time to a block pointer

• Performance of a Hash index:

• Suppose we look up using key x: • We read in the first index block into the memory: The search for key x fails....

• We read in the next index block into the memory: The search for key x succeeds

We use the corresponding block/record pointer to access the data

• Conclussion:

 Performance of a hash index depends of the number of overflow blocks used....

• Ideal hash index

• The ideal hash index:

 No overflow blocks in use....

Graphically: • Performance:

 Total: 2 disk blocks accesses to lookup any record !!!

• Hash index vs ordered index (B+-tree)

• Hash index strengths and weaknesses:

• Hash index is generally faster (less disk accesses) than ordered index (B+-tree)

• Hash index cannot support ordered searches, like:

 ``` SELECT fname, lname FROM employee WHERE lname > 'Smith' ```

• Ordered index strength over a hash index:

• Hash index can support ordered searches, like:

 ``` SELECT fname, lname FROM employee WHERE lname > 'Smith' ```

Example: • Problem with Hash Tables

• Problem:

• When many keys are inserted into the hash table, we will have many overflow block

 Overflow blocks will requrie more disk block read operations and slow performance

• We can reduce the # overflow blocks by increasing the size (= B) of the hash table

• The Hash Table size (= B) is very hard to change

• Changing the hash table size will usually require:

 Re-hashing all keys in the existing hash table into a new table (with the new size) !!!

• Solution: Dynamic hashing:

 Dynamic hashing = a hashing technique that allow the size of the hash table to change with relative low cost