### Using a Partitioned Hashing

• Using a Partitioned Hashing (with commonly used multi-dim queries)

• Partial Match queries

 The query specifies conditions on some dimensions but not on all dimensions

Example:

• Find people with age = 50 and salary = unspecified

We can only hash the age value (⇒ 0) and must use every possible hash value for salary:

Explanation:

• Age = 50 will hash to the hash value:

 ``` Hash(age) = 0xx ```

• We start at bucket 000 and scan to bucket 011

Conclussion:

 Partial Hash index can be used to reduce the search space The amount of search space that is reduced will depend on the sizes (# bits used) of the hash values

• Range queries

• Find people such that:

 ``` 35 ≤ age ≤ 50 ⇒ age = 35, 36, 37, ..., 50 50K ≤ salary ≤ 100 ⇒ salary = 50K, 51K, 52K, ..., 100K ```

• Solution:

• Hash all values inside the specified range and find their bucket (block) pointers:

 ``` hash(35, 50K) --> block pointer 1 hash(36, 50K) --> block pointer 2 .... hash(50, 50K) And so on: (35, 55K) (36, 55K), .... (50, 55K) ... (35, 100K) (36, 100K), .... (50, 100K) ```

Graphically:

Note: the block pointers can have duplicates !!!

• Collect all the bucket pointers

(I.e.: eliminate any duplicate bucket (block) pointers !!!)

• Access all (unique) buckets (disk blocks)

Comment:

 Hashing is in general not appropriate for range queries because a hash function does not preserve the "closeness" of values

• Nearest neighbor queries:

• Hashing is completely useless for nearest neighbor type queries

Because:

 The is no notion of distance in the hash function !!!!

• Example: find records that with distance ≤ 1 to search key = 1:

• We hash the search key 1:

• However, we cannot use the distance in the hash table to locate "nearby" objects (records) because:

The value 2 is near the value 1, but may get hash very far away !!!

• Property of hashing:

 Closeness of bucket indexes has nothing to do with real distance between data points (because hashing computes a random number) !!!!

• Where-am-I queries:

 Hashing is also not useful here either.... (Because hashing provide no information on distance)

• Advantage of Partitioned Hashing over the grid index

• Good hash functions will randomize the records:

 Partitioned hashing will achieve good occupancy rate per bucket

• A major problem with Grid Index files is:

• Poor occupancy rate at many grid buckets:

(Especially when you have 3 or more dimensions.

You will have many buckets that are empty !!!)