Hashing: Static Methods
- Course Notes -
* Hashing is a method of Information Retrieval - typically used for database management systems, other systems in which rapid storage and retrieval of information is necessary.
* Typical problem - Search for a record/object in a database that is associated with some key
* Hashing is done with a hash function such that -
H: KEY------> INDEX or ADDRESS
* Hash function:
H(Primary Key) = External Key
The problem that occurs:
H(K1) = IK= H(K2)
That is, 2 different keys hash to the same external location! This is called a COLLISION.
* Hashing takes a potentially huge range of values and maps it to a much smaller range of values -
E.G.s -
1) Students here at Chico State -
Large potential # of SSNs -
10 digits ---> 109 possible combinations
yet only about 15,000 actual students, so
H : SSN ----> Student File is "many to few"
2) All possible C++ identifiers (program names) and the compiler will store these in a symbol table by hashing for rapid retrieval:
If limited to 32 characters, only use alpha for first (all caps), alphanumeric (all caps) for others:
26*(36)31
* Sample Hash Functions:
1) Division Method (MODULO arithmetic):
Note: '%' is the C++ MODULO operator
H: Key ----> Integer Index
E.g. - Table size of 100
3 Digit numbers are the keys
999 possible items
Indices 0..99 on the table
999 % 100 = 99 (100 is Table size)
524 % 100 = 24
199 % 100 = 99 (COLLISION)
2) Mid-Square Method - Concat, Square and Remove the Middle!
E.G. 32 character identifiers being hashed -
Table size [0..99]
A..Z ---> 1,2, ...26
0..9 ----> 27,...36
Identifier: CS1 --->3+19+28 (concat) = 31,928
(31,928)2 = 1,019,397,184 - 10 digits
extract middle 2 digits (5th and 6th)
get 39, so:
H(CS1) = 39
3) Folding Method:
a) break key up into binary segments (ASCII)
b) XOR these together
c) Calculate the numeric integer equivalent
Hashing Examples:
1) Basic Division Method-
H(Key) = Key % 15,
Values to be hashed all > 0
0 indicates null value - or nothing there
Results after hashing 41, 58, 12, 92, 50 and 91:
Index Key
| 0 | 0 |
| 1 | 91 |
| 2 | 92 |
| 3 | 0 |
| 4 | 0 |
| 5 | 50 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
| 10 | 0 |
| 11 | 41 |
| 12 | 12 |
| 13 | 58 |
| 14 | 0 |
This is a nice distribution of values, no collisions!
2) Same hash function -
Now with values 10, 20, 30, 40, 50, 60, 70 -
Index Value Overflow
| 0 | 30 | 60 (collision) |
| 1 | 0 | |
| 2 | 0 | |
| 3 | 0 | |
| 4 | 0 | |
| 5 | 20 | 50 (collision) |
| 6 | 0 | |
| 7 | 0 | |
| 8 | 0 | |
| 9 | 0 | |
| 10 | 10 | 40, 70 (collisions) |
| 11 | 0 | |
| 12 | 0 | |
| 13 | 0 | |
| 14 | 0 |
Conclusion: % 15 is a BAD HASH FUNCTION for this particular set of values!
In general: Choose the nearest prime number 1.5 times greater than your largest key you are hashing!
3) H(Key) = Key % 11
Same values as in last e.g. above:
Index Value
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 70 |
| 5 | 60 |
| 6 | 50 |
| 7 | 40 |
| 8 | 30 |
| 9 | 20 |
| 10 | 10 |
* Handling Collisions - Techniques:
Two Major Strategies:
1) Open Addressing - Find another spot in the "Table" (same contiguous address space)
2) Chaining - Find another spot outside the "Table"
___________________________________________
*Open Addressing Techniques:
a) Linear Probing - search sequentially (with wrap-around) until you find the first vacant slot
b) Quadratic Probing -
Hashed value to index i - slot i is occupied!
1st try after i ---> try i+1
2nd try after i ---> try i + 22
3rd try after i ---> try i + 32
(Always % tablesize, of course)
ETC.
c) Rehashing: When see spot is occupied, hash original key over with a second hash function - this to find another spot in the table.
___________________________________________
* Chaining Techniques:
This technique "Chains" the item that collided to a location outside the "Table" - to another block of memory
(You'll do this with Dynamic, Extendible Hashing Techniques)
_____________________________________________
Problems with both Open Addressing and Chaining - can have very long searches for an item that collided a bunch with other items!
E.g.s -
1) Open Addressing with Linear Probing
Clustering can occur :
Suppose keys 160, 204, 219, 119, 412, 390, 263 are loaded and H is biased for returning 38-40!
Index Value Hash values
| 0 | ||
| 1 | ||
| 2 | ||
| .... | ..... | ..... |
| 38 | 160 | H(160)=38 |
| 39 | 204 | H(204)=38 |
| 40 | 219 | H(219)=38 |
| 41 | 119 | H(119)=39 |
| 42 | 412 | H(412)=39 |
| 43 | 390 | H(390)=39 |
| 44 | 263 | H(263)=40 |
| .... | ..... | ...... |
| size |
Conclusion: Clustering can occur due to a biased hash function with linear probing as a collision resolution technique!
2) Quadratic Probing:
H(Key) = Key %11, Hashing values 13, 3, 24, 46, 90:
Index Value
| 0 | 46 |
| 1 | |
| 2 | 13 |
| 3 | 3 |
| 4 | |
| 5 | |
| 6 | 24 |
| 7 | 90 |
| 8 | |
| 9 | |
| 10 |
Note the wraparound calculations!
Also, quadratic probing may never "get anywhere"-
H(K) = K % 8 ......
Index Value
| 0 | |
| 1 | |
| 2 | |
| 3 | X - Initial hash position, plus 4th, 8th, 12th.... probes |
| 4 | 1st, 3rd, 5th, 7th, 9th.... probes after collision at position 3 |
| 5 | |
| 6 | |
| 7 | 2nd, 6th, 10th.... probes after collision at position 3 |
______________________________________________________________
Also, can have hashing to "Buckets" - More like the database situation where a "Bucket" is the size of a disk block that can fit n records/objects of size k, say:
E.g. - A Bucket Size of 3, H(K) = K%10
Index Slot 1 Slot 2 Slot 3
| 0 | record with key 400 | 310 | 20 |
| 1 | record with key 501 | 211 | Empty |
| ... | |||
| ... | |||
| 9 | record with key 89 | Empty | Empty |
How does this contrast with chaining?
E.g. of Chaining -
Index Hashtable Chains
0 400 ---->310 ----->20 ----->50
1 501 ---->211
.....
9 89
*******************************************
Hash Table General Class Methods: