Information Retrieval and Hashing

Hash Tables: Chapter 11

A dictionary is a collection of pairs of the form (k, e), where k is a key and e is the element associated with the key k (equivalently, e is the element whose key is k). No two pairs in the dictionary have the same key. The following operations are performed on a dictionary:
  1. Get the element associated with a specified key from the dictionary
  2. Insert or put an element with a specified key into the dictionary
  3. Delete or remove an element with a specified key

Hashing is a randomized scheme that may be used to search, insert, and remove elements.
It provides expected performance of (1) and has worse-case performance of (n).

Java provides several implementations of a hash table: java.util.HashTable, java.util.HashMap, and java.util.HashSet,
as well as java.util.IdentityHashMap, java.util.LinkedHashMap, and java.util.LinkedHashSet

Hashing

See text: page 415

In computer science, we generally use the term symbol table rather than dictionary when referring to the ADT.

The notion of a symbol table arises frequently in computer science.

When building loaders, assemblers, compilers, or any keyword driven translator a symbol table is used. (Again, much like the idea of a dictionary... image)

In these contexts a symbol table is a set of name-attribute pairs. Associated with each name in the table is an attribute, a collection of attributes, or some direction about what further processing is needed. (example: in a thesarus: word -> synonyms)

The operations that one generally wants to perform on symbol tables are:

  1. ask if a particular name is already present,
  2. retrieve the attributes of that name,
  3. modify the attributes of that name,
  4. insert a new name and its value, and
  5. delete a name and its value.
ADT for symbol table (dictionary) (pg. 401)

Is a symbol table a set of names or can the same name appear more than once?
(Example of a need for names appearing more than once: scope of variables.)

We need a mechanism to determine which of several occurances of the same name has been the most recently introduced

First things first. The general problem:

Typical problem: search for the record associated with some key.

Methods:

  1. Sequential search (e.g. file, table) on a collection of n items -> O(n) worst case time
  2. Binary search (sorted collection) -> 0 (n log n)
  3. Table lookup (arrays, vectors) the key is the array(or vector)-index:
    F:I -> A where I is the index and A is the actual memory address

  4. Hashing
    Hash function H
    H: Key--> index or address
*Tables, files, etc. are viewed as groups of objects (records)
*Each object has a unique distinguishing piece of information-- a key or keys.

e.g. Table of employee (or student) records (instances).

Each employee has an ID number (possibly a social security number). You want to be able to find the records of an individual employee quickly, given the employee ID number.

If the ID number is a small integer, say 1, 2, 3, etc., you could use a Vector or an ArrayList (why are these nice in Java?), but if you use social security numbers, they will be too long for an index.

In static hashing the identifiers are stored in a fixed size table called a hash table.

The address or location of an identifier X, is obtained by computing some arithmetic function, f, of X. f(X) gives the address of X in the table. This address will be referred to as the hash or home address of X. The memory available to maintain the symbol table is assumed to be sequential.

Specifically: The Hashtable class in Java provides this mechanism. The hash table computes a small integer (called the hash code) out of the external key .

The computation does not actually matter much, except that it provide a quick and evenly distributed set of resulting integers. Then the hash table stores the key and the data item (reference) in an array, at the location of the hash code.

All of this is invisible to the user of the Hashtable class.

First, we will go over how hash tables work, then look at their use specifically in Java.

Hashing is typically a many to few correspondence.

Example

  1. Large potential number of Social Security numbers
      10 digits--> 109 possible permutations but only perhaps 10,000 students

      H: SSN -> student file is "many to few"

  2. All possible Java identifiers of length 6, stored in a symbol table
    T = 54*(64)5 possibilities (26 lower, 26 caps, _, $) ( actually, plus unicode)
    is >109...Any reasonable application would use far less!

Let's call the hash table ht. The hash table is partioned into b buckets, ht[0], ... ht[b]. Each bucket is said to consist of s slots, each slot being large enough to hold one record. Usually, s=1 and each bucket can hold exactly one record.

In the above example of identifiers, where T is the number of distinct possibilities, we would say that

Since the number of identifiers, n, in use, is usually several orders of magnitude less than the total number of possible identifiers, T, the number of buckets, b, in the hash table is also much less than T. (don't want to waste the space)

Therefore, the hash function f must map several different identifiers into the same bucket. Ideally, every bucket contains, at most, one object, but collisions are inevitable.

Specifically:

Two identifiers (I1&2) are said to be synonyms with respect to Hash Function f if : f(I1)=ht(j)=f(I2)

f(primary key)= External key to hash table (index j)

Possible collisions (not of KEYS, but of the hash value ... keys should be unique)

Distinct synonyms are entered into the same bucket so long as the s slots in that bucket have not been used. A collision occurs when two nonidentical identifiers are hashed into the same bucket. An overflow is said to occur when a new identifier I is mapped or hashed by f into a full bucket. When the bucket size s is 1, collisions and overflows occur simultaneously.


Note: In the case where no overflow occurs, the time to enter or search for identifiers using hashing depends only on the time required to compute the hash function f and the time to search one bucket.

Since the bucket size is usually small the search for an identifier within a bucket is carried out using sequential search.

The time, then, is independent of n, the number of identifiers in use.

A hashing function, f, transforms an identifier X into an index in the hash table. The desired properties of such a function are thus that it be easily computable and that it minimize the number of collisions.

Thus there are two major considerations in hash table mechanisms:

  1. what should the function be?

  2. how should one handle overflows?

Sample Hash Functions:

(pages 417-421)

  1. Division method (mod arithmetic):

    H:Key -> integer index

    e.g. table size of 100 3-digit numbers are the keys 999 possible items

    indices 0--99 on the table (note table size is 100)

      999 mod 100 (999%100) = 99

      524 mod 100 (524%100) = 24

      199 mod 100 (199%100) = 99 (collision)

    Note: the choice of M (in f(X) = X mod M) is critical.

    Text shows the Division Hash Function with a table size of 11. Here are its results Fig. 11.2

    We see that there are many cases of collision. If M is a power of 2, then what? Clustering
    Collisions and hence Overflows occur often. See page 419, Example 11.10 [Selecting the Hash Function Divisor C] for good hash function choices. For the Division Method, best results are obtained when D (and therefor the number of buckets b) is either a prime number or has no prime factors less than 20.

    The most popular techniques to deal with overflows are linear probing and chaining, which we will address later.

  2. Mid-square method

    (remove the middle):

    e.g. Java identifiers to be hashed : Consider AZ2

    Table size [0..99]

    A - Z 1,2,...,26
    a - z 27,...,52
    0-9, _, $ 53,...,64

    Concat

    AZ2--> 1+26+55= 12655

    (12655)2 = 160149025 to make even say 0160149025

    Extract middle 2 digits --- hence Mid-square

    H(AZ2)= 14

    Try some and note that identifiers with the same first value have differing mid-squares (hopefully preventing collisions of perhaps frequently used alphabet letters) .

    If X is an identifier chosen at random from the identifier space, then we want the probability that f(X) = i to be 1/b for all buckets i. Then a random X has an equal chance of hashing into any of the b buckets. A hash function satisfying this probability is termed a uniform hash function.

    Good idea...

  3. Folding Method

    ->Break the identifier x up into segments, all but possibly the last being of equal length

    ->add these partitions together to get the hash address for x

    Specific example:

    ->Break the indentifier up into binary segments (or ASCII)

    ->Xor these together

    ->Calculate the numeric integer equivalent

    Two ways of adding:

    1. shift folding (all but the last partition are shifted so that the least significant bit of each lines up with the corresponding bit of the last partition)

    2. folding at the boundaries (the identifier is folded at the partition boundaries, and digits falling into the same position are added together. This is equivalent to reversing every other partition and then adding.)

    Handling Collisions:

    (pages 421)

    Two methods:

    1. open-addressing (assume ht is an array)
    2. chaining

    1) Place new item elsewhere in the table (open-addressing)

    Example

    1.    Linear probing

      1. compute h(x)
      2. search sequentially at positions ht[h(x)], ht[h(x) +1], ..., ht[ h(x) +j]
        until one of the following happens:
          ht[ h(x) +j] = x ; x is found
          ht[ h(x) +j] is null ; x is not in table
          we return to the starting position h(x) ; the table is full and x is not in the table
        and x

      Example:

      h(key) = (key mod tablespace) + 1

      Array with tablespace 15 loads with keys 41,58,12,92,50, and 91 using hashing

      ArrayKey
      10
      291
      392
      40
      50
      650
      70
      80
      90
      100
      110
      1241
      1312
      1458
      150

      The same array when loaded with a different set of keys shows several collisions.
      (again) h(key) = (key mod tablespace) + 1

      ArrayKeyCollision 1Collision 2
      13060 
      20  
      30  
      40  
      50  
      62050 
      70  
      80  
      90  
      100  
      11104070
      120  
      130  
      140  
      150  

      An array with tablesize 11 and the same set of keys as the previous example; no collisions result

      ArrayKey
      10
      20
      30
      40
      570
      660
      750
      840
      930
      1020
      1110

      One more: Suppose keys 160, 204, 219, 119, 412, 390 and 263 are loaded with

      h(160) = 38

      h(204) = 38

      h(219) = 38

      h(119) = 39

      h(412) = 39

      h(390) = 39

      h(263) = 40

      Hash function is biased to range 38-40. If a linear method were not used, these keys would not necessarily cluster in positions 38-40

      SO?

      An improvement:

    2.    Quadratic probing-- hashed to index I
        1st try after I-> try I+1
        2nd try after I-> try I+22= I+4
        3rd try after I-> try I+32= I+9
             Etc. (each result is taken mod the table size...why?)

      Consider the following possibility with an initial hash position of 4.

           Array		Key
      	1		
      	2
      	3
      	4		  x	*4th, 8th, 12th ...probes after collision at pos. 4
      	5			*1st, 3rd, 5th, 7th, 9th ... probes after collision
      	6
      	7
      	8			*2nd, 6th, 10th,  ... probes after collision
      
      Quadratic probing will never check positions 1,2,3,6,7 after initial hash to 4. Not good!

      If the table size (with bucket size 1) have a value 4j+3 ( where j is an integer), the quadratic search will examine every bucket.
      (I am not making this up - it has been proven.)

    3. A variation of linear: double hashing

      On second attempt, rehash with a different function

        h1: key--> address1 in table

        Use of result h1 as input to second
        h2: address1 in table--> another address

        OR mathematically speaking

        h2 o h1(primary key)

    4. 1.5)Kind of in between sequential and chaining:

      Provide an overflow storage area for linked collision processing. Each key also has a link to the next key (in the event of a collision). Overflow is filled sequentially.

    2) Chaining

    1. Chaining - linked lists

      Chain the new item to a location outside the table

      See Figure 11.3

      Now b is the number of head nodes -- (more like buckets that can be filled with > 1 object; each bucket holds a link)

    2. Bucket Hashing

      This same idea could also be done without the links by storage of buckets with size > 1 in an array: physically contiguous buckets, each of same size

      Another visual (2-D array)

    Summary

    With a good hash function, hashing can reduce the time of search to O(1). Can't do better than that!

    Comparisons
    Conclusions
    Pointers


    OK, back to Java. java.util.Hashtable - notice that it inherits from java.util.Dictionary BUT that it says the class is obsolete and that one should use the Map interface. Apparently the Dictionary class was a "totally" abstract class with no methods implemented - so now we are to be ontologically honest and use an interface ;-) . Note that all of these are in the Collections Framework and my notes on Collections.

    The following material is from Core Java.

    Java provides a Hashtable class. Note the lower case t. Supposedly under Windows you get strange error messages if you use HashTable since the Windows file system is not case sensitive but the Java compiler is.

    Hashtable staff = new Hashtable();
    Employee harry = new Employee("Harry Hacker");
    staff.put("987-98-9996", harry);
    ...

    To retrieve an object, you must remember the key

    String s = "987-98-9996";
    e = (Employee)staff.get(s); // gets harry

    If no information is stored in the hash table with the particular key, then get returns null.

    As with vectors, hash tables store everything as an object. This means you must cast the return value of get to the correct type.

    Keys must be unique. You cannot store two values with the same key. If you call the put method twice on the same key, then the second value replaces the first one. In fact, put returns the previous value stored with the key parameter. (This is useful; if put returns a non-null value, then you know you replaced a previous entry.)

    Methods

    If you want more control over the performance of the hash table, you can specify the initial bucket count.

    If you know approximately how many elements will eventually be in the table, then you can set the initial bucket size to about 150% of the expected element count. (remember the prime number idea). Note that the class information suggests load factor (remember load factor is n/(sb) and if s=1 then n/b (#of buckets)) of 75%. If the load factor is 75% this means 3/4 b = n and hence b = 4/3 n or 133% of full. I just made it easier and said 150%

    For example, if you need to store about 100 entries, set the intial bucket size to 151:

    Hashtable staff = new Hashtable(151);

    You can also set the load factor in the constructor.

    Hashtable(int initialCapacity, float loadFactor);

    The load factor (remember: number of items/hashtable size) gives the percentage threshold before Java moves the hash table to a new location. For example, if the load factor is 0.75 (the default) and the hash table becomes more than 75% full, then Java automatically allocates a larger table with twice as many entries and then Java reorganizes all the entries into the larger table.

    Why? Hash tables that are nearly full perform poorly.

    We are able to use strings as keys because the String class has a hashCode method that computes the hash code for a string.

    Creating your own Hashable Classes:

    In the Java library, all classes that define their own hashCode method form the hash code according to the contents of their objects.

    If your hash table keys are objects of a class that you defined yourself, you should define the hashValue method of the key class.

    This method should return an integer (which may be negative.)

    The hash table code will later reduce the integer by dividing by the bucket count and taking the remainder. . (so...what method does Java use?) mid-square, division, ?

    "Just scramble up the hash codes of the data fields in some way that will give the hash codes for different objects a good chance of being widely scattered."

    Their example:

    pn = new PartNumber("WDJT", 4);
    
    class PartNumber()
    	{	private String description;
    		private int version;
    	...
    		public int hashCode()
    			{return 13*description.hashValue()+17*version;}
    	...
    	}
    
    There is another important method that must be defined for user-defined key classes, namely equals. The Object class defines equals, but the method only tests whether or not two objects are identical. We need to redefine equals to check for equal contents, otherwise a comparison in the hash table would always fail:
    class PartNumber()
    	{ ...
    		public booean equals(Object b)
    			{ if (b instanceof PartNumber)
    				{ PartNumber p = (PartNumber)b;
    					return description.equals(p.description) && version == p.version;
    													}
    				else return false;
    	}
    

    Enumerations

    If you want to print all of the employees in the staff hash table, we use the elements() method in the Hashtable class.

    Enumeration e = staff.elements();

    This returns an object (e) of the type that implements the Enumeration interface

    The Enumeration interface has two methods, hasMoreElements() and nextElement().

    The following loop uses those methods to iterate through all values in the hash table:

    while (e.hasMoreElements())
    	{ String id = (String)e.nextElement();
    	...
    	}
    
    Notice this method does not need to know to which data structure the enumeration is attached - many Java classes return objects that implement the Enumeration interface (e.g. vectors).

    v.elements() returns an enumeration object that can traverse all elements of the vector. (ahh ... inheritence)


    Some links to other sources and animations for hashing:

    MIT notes quadratic hash function, another quadratci probing
    more on probing, etc