Hashing: Static Methods
- Course Notes -
* Hashing is a method of Information Retrieval - typically used for database management systems, other systems in which rapid storage and retrieval of information is necessary.
* Typical problem - Search for a record/object in a database that is associated with some key
* Hashing is done with a hash function such that -
H: KEY------> INDEX or ADDRESS
* Hash function:
H(Primary Key) = External Key
The problem that occurs:
H(K1) = IK= H(
That is, 2 different keys hash to the same external location! This is called a COLLISION.
* Hashing takes a potentially huge range of values and maps it to a much smaller range of values -
E.G.s -
1) Students here at
Large potential # of SSNs -
10 digits ---> 109 possible combinations
yet only about 15,000 actual students, so
H : SSN ----> Student File is "many to few"
2) All possible C++ identifiers (program names) and the compiler will store these in a symbol table by hashing for rapid retrieval:
If limited to 32 characters, only use alpha for first (all caps), alphanumeric (all caps) for others:
26*(36)31
* Sample Hash Functions:
1) Division Method (MODULO arithmetic):
Note: '%' is the C++ MODULO operator
H: Key ----> Integer Index
E.g. - Table size of 100
3 Digit numbers are the keys
999 possible items
Indices 0..99 on the table
999 % 100 = 99 (100 is Table size)
524 % 100 = 24
199 % 100 = 99 (COLLISION)
2) Mid-Square Method - Concat, Square and Remove the Middle!
E.G. 32 character identifiers being hashed -
Table size [0..99]
A..Z ---> 1,2, ...26
0..9 ----> 27,...36
Identifier: CS1 --->3+19+28 (concat) = 31,928
(31,928)2 = 1,019,397,184 - 10 digits
extract middle 2 digits (5th and 6th)
get 39, so:
H(CS1) = 39
3) Folding Method:
a) break key up into binary segments (ASCII)
b) XOR these together
c) Calculate the numeric integer equivalent
Hashing Examples:
1) Basic Division Method-
H(Key) = Key % 15,
Values to be hashed all > 0
0 indicates null value - or nothing there
Results after hashing 41, 58, 12, 92, 50 and 91:
Index Key
|
0 |
0 |
|
1 |
91 |
|
2 |
92 |
|
3 |
0 |
|
4 |
0 |
|
5 |
50 |
|
6 |
0 |
|
7 |
0 |
|
8 |
0 |
|
9 |
0 |
|
10 |
0 |
|
11 |
41 |
|
12 |
12 |
|
13 |
58 |
|
14 |
0 |
This is a nice distribution of values, no collisions!
2) Same hash function -
Now with values 10, 20, 30, 40, 50, 60, 70 -
Index Value Overflow
|
0 |
30 |
60 (collision) |
|
1 |
0 |
|
|
2 |
0 |
|
|
3 |
0 |
|
|
4 |
0 |
|
|
5 |
20 |
50 (collision) |
|
6 |
0 |
|
|
7 |
0 |
|
|
8 |
0 |
|
|
9 |
0 |
|
|
10 |
10 |
40, 70 (collisions) |
|
11 |
0 |
|
|
12 |
0 |
|
|
13 |
0 |
|
|
14 |
0 |
|
Conclusion: % 15 is a BAD HASH FUNCTION for this particular set of values!
In general: Choose the nearest prime number 1.5 times greater than the size of the data set you are hashing!
3) H(Key) = Key % 11
Same values as in last e.g. above:
Index Value
|
0 |
0 |
|
1 |
0 |
|
2 |
0 |
|
3 |
0 |
|
4 |
70 |
|
5 |
60 |
|
6 |
50 |
|
7 |
40 |
|
8 |
30 |
|
9 |
20 |
|
10 |
10 |
* Handling Collisions - Techniques:
Two Major Strategies:
1) Open Addressing - Find another spot in the "Table" (same contiguous address space)
2) Chaining - Find another spot outside the "Table"
___________________________________________
*Open Addressing Techniques:
a) Linear Probing - search sequentially (with wrap-around) until you find the first vacant slot
b) Quadratic Probing -
Hashed value to index i - slot i is occupied!
1st try after i ---> try i+1
2nd try after i ---> try i + 22
3rd try after i ---> try i + 32
(Always % tablesize, of course)
ETC.
c) Rehashing: When see spot is occupied, hash original key over with a second hash function - this to find another spot in the table.
___________________________________________
* Chaining Techniques:
This technique "Chains" the item that collided to a location outside the "Table" - to another block of memory
(You'll do this with Dynamic, Extendible Hashing Techniques)
_____________________________________________
Problems with both Open Addressing and Chaining - can have very long searches for an item that collided a bunch with other items!
E.g.s -
1) Open Addressing with Linear Probing
Clustering can occur :
Suppose keys 160, 204, 219, 119, 412, 390, 263 are loaded and H is biased for returning 38-40!
Index Value Hash values
|
0 |
|
|
|
1 |
|
|
|
2 |
|
|
|
.... |
..... |
..... |
|
38 |
160 |
H(160)=38 |
|
39 |
204 |
H(204)=38 |
|
40 |
219 |
H(219)=38 |
|
41 |
119 |
H(119)=39 |
|
42 |
412 |
H(412)=39 |
|
43 |
390 |
H(390)=39 |
|
44 |
263 |
H(263)=40 |
|
.... |
..... |
...... |
|
|
|
|
|
|
|
|
|
size |
|
|
Conclusion: Clustering can occur due to a biased hash function with linear probing as a collision resolution technique!
2) Quadratic Probing:
H(Key) = Key %11, Hashing values 13, 3, 24, 46, 90:
Index Value
|
0 |
46 |
|
1 |
|
|
2 |
13 |
|
3 |
3 |
|
4 |
|
|
5 |
|
|
6 |
24 |
|
7 |
90 |
|
8 |
|
|
9 |
|
|
10 |
|
Note the wraparound calculations!
Also, quadratic probing may never "get anywhere"-
H(K) = K % 8 ......
Index Value
|
0 |
|
|
1 |
|
|
2 |
|
|
3 |
X - Initial hash position, plus 4th, 8th, 12th.... probes |
|
4 |
1st, 3rd, 5th, 7th, 9th.... probes after collision at position 3 |
|
5 |
|
|
6 |
|
|
7 |
2nd, 6th, 10th.... probes after collision at position 3 |
______________________________________________________________
Also, can have hashing to "Buckets" - More like the database situation where a "Bucket" is the size of a disk block that can fit n records/objects of size k, say:
E.g. - A Bucket Size of 3, H(K) = K%10
Index Slot 1 Slot 2 Slot 3
|
0 |
record with key 400 |
310 |
20 |
|
1 |
record with key 501 |
211 |
Empty |
|
... |
|
|
|
|
... |
|
|
|
|
9 |
record with key 89 |
Empty |
Empty |
How does this contrast with chaining?
E.g. of Chaining -
Index Hashtable Chains
0 400 ---->310 ----->20 ----->50
1 501 ---->211
.....
9 89
*******************************************
Hash Table General Class Methods: