Hashing: Static Methods

- Course Notes -

* Hashing is  a method of Information Retrieval - typically used for database management systems, other systems in which rapid storage and retrieval of information is necessary.

* Typical problem - Search for a record/object in a database that is associated with some key

* Hashing is done with a hash function such that -

            H: KEY------> INDEX or ADDRESS

* Hash function:

            H(Primary Key) = External Key

    The problem that occurs:

            H(K1) = IK= H(K2)

     That is, 2 different keys hash to the same external location!  This is called a COLLISION.

* Hashing takes a potentially huge range of values and maps it to a much smaller range of values -

E.G.s -

1) Students here at Chico State -

    Large potential # of SSNs -

    10 digits ---> 109 possible combinations

    yet only about 15,000 actual students, so

    H : SSN ----> Student File is "many to few"

2) All possible C++ identifiers (program names) and the compiler will store these in a symbol table by hashing for rapid retrieval:

    If limited to 32 characters, only use alpha for first (all caps), alphanumeric (all caps) for others:

                        26*(36)31

* Sample Hash Functions:

1) Division Method (MODULO arithmetic):

    Note: '%' is the C++ MODULO operator

    H: Key ----> Integer Index

    E.g. - Table size of 100

               3 Digit numbers are the keys

                999 possible items

                Indices 0..99 on the table

                999 % 100 = 99 (100 is Table size)

                524 % 100 = 24

                199 % 100 = 99 (COLLISION)

2) Mid-Square Method - Concat, Square and Remove the Middle!

E.G. 32 character identifiers being hashed -

Table size [0..99]

A..Z ---> 1,2, ...26

0..9 ----> 27,...36

Identifier: CS1 --->3+19+28 (concat) = 31,928

                    (31,928)2 = 1,019,397,184 - 10 digits

                    extract middle 2 digits (5th and 6th)

                    get 39, so:

                    H(CS1) = 39

3) Folding Method:

    a) break key up into binary segments (ASCII)

    b) XOR these together

    c) Calculate the numeric integer equivalent

 

Hashing Examples:

1) Basic Division Method-

    H(Key) = Key % 15,

    Values to be hashed all > 0

     0 indicates null value - or nothing there

    Results after hashing 41, 58, 12, 92, 50 and 91:

                                    Index     Key

0 0
1 91
2 92
3 0
4 0
5 50
6 0
7 0
8 0
9 0
10 0
11 41
12 12
13 58
14 0

        This is a nice distribution of values, no collisions!

2) Same hash function -

    Now with values 10, 20, 30, 40, 50, 60, 70 -

                              Index    Value    Overflow

0 30 60 (collision)
1 0  
2 0  
3 0  
4 0  
5 20 50 (collision)
6 0  
7 0  
8 0  
9 0  
10 10 40, 70 (collisions)
11 0  
12 0  
13 0  
14 0  

Conclusion:  % 15 is a BAD HASH FUNCTION for this particular set of values!

In general: Choose the nearest prime number 1.5 times greater than your largest key you are hashing!

3) H(Key) = Key % 11

    Same values as in last e.g. above:

                                     Index   Value

0 0
1 0
2 0
3 0
4 70
5 60
6 50
7 40
8 30
9 20
10 10

* Handling Collisions - Techniques:

Two Major Strategies:

1) Open Addressing - Find another spot in the "Table" (same contiguous address space)

2) Chaining - Find another spot outside the "Table"

___________________________________________

*Open Addressing Techniques:

a) Linear Probing - search sequentially (with wrap-around) until you find the first vacant slot

b) Quadratic Probing -

    Hashed value to index i - slot i is occupied!

    1st try after i          ---> try i+1

    2nd try after i         ---> try i + 22

    3rd try after i         ---> try i + 32

   (Always % tablesize, of course)

    ETC.

c) Rehashing:  When see spot is occupied, hash original key over with a second hash function - this to find another spot in the table.

___________________________________________

* Chaining Techniques:

This technique "Chains" the item that collided to a location outside the "Table" - to another block of memory

(You'll do this with Dynamic, Extendible Hashing Techniques)

_____________________________________________

Problems with both Open Addressing and Chaining - can have very long searches for an item that collided a bunch with other items!

E.g.s -

1) Open Addressing with Linear Probing

Clustering can occur :

Suppose keys 160, 204, 219, 119, 412, 390, 263 are loaded and H is biased for returning 38-40!

                         Index   Value Hash values

0    
1    
2    
.... ..... .....
38 160 H(160)=38
39 204 H(204)=38
40 219 H(219)=38
41 119 H(119)=39
42 412 H(412)=39
43 390 H(390)=39
44 263 H(263)=40
.... ..... ......
     
     
size    

Conclusion:  Clustering can occur due to a biased hash function with linear probing as a collision resolution technique!

2) Quadratic Probing:

H(Key) = Key %11, Hashing values 13, 3, 24, 46, 90:

                                   Index   Value

0 46
1  
2 13
3 3
4  
5  
6 24
7 90
8  
9  
10  

Note the wraparound calculations!

Also, quadratic probing may never "get anywhere"-

H(K) = K % 8 ......

                                   Index   Value

0  
1  
2  
3 X - Initial hash position, plus 4th, 8th, 12th.... probes
4 1st, 3rd, 5th, 7th, 9th.... probes after collision at position 3
5  
6  
7 2nd, 6th, 10th.... probes after collision at position 3

______________________________________________________________

Also, can have hashing to "Buckets" - More like the database situation where a "Bucket" is the size of a disk block that can fit n records/objects of size k, say:

E.g. - A Bucket Size of 3, H(K) = K%10

    Index                 Slot 1            Slot 2             Slot 3

0 record with key 400 310 20
1 record with key 501 211 Empty
...      
...      
9 record with key 89 Empty Empty

How does this contrast with chaining?

E.g. of Chaining -

Index   Hashtable    Chains

0             400             ---->310   ----->20    ----->50

1             501             ---->211

.....

9             89

*******************************************

Hash Table General Class Methods:

Go to Hashing: Dynamic Techniques -  Course Notes