Hashing is a randomized scheme that may be used to search, insert, and remove elements.
It provides expected performance of
(1) and has worse-case performance of
(n).
Java provides several implementations of a hash table: java.util.HashTable, java.util.HashMap, and java.util.HashSet,
as well as java.util.IdentityHashMap, java.util.LinkedHashMap, and java.util.LinkedHashSet
See text: page 415
In computer science, we generally use the term symbol table rather than dictionary when referring to the ADT.
The notion of a symbol table arises frequently in computer science.
When building loaders, assemblers, compilers, or any keyword driven translator
a symbol table is used. (Again, much like the idea of a dictionary... image)
In these contexts a symbol table is a set of name-attribute pairs.
Associated with each name in the table is an attribute, a collection of
attributes, or some direction about what further processing is needed.
(example: in a thesarus: word -> synonyms)
The operations that one generally wants to perform on symbol tables are:
Is a symbol table a set of names or can the same name appear more than
once?
We need a mechanism to determine which of several occurances of the same name
has been the most recently introduced
Methods:
e.g. Table of employee (or student) records (instances).
Each employee has an ID number (possibly a social security number). You want
to be able to find the records of an individual employee quickly, given
the employee ID number.
If the ID number is a small integer, say 1, 2, 3, etc., you could use a Vector or an ArrayList
(why are these nice in Java?),
but if you use social security numbers, they will be too long for an
index.
In static hashing the identifiers are stored in a fixed size table called a
hash table.
The address or location of an identifier X, is obtained by computing some
arithmetic function, f, of X. f(X) gives the address of X in the table. This
address will be referred to as the hash or home address of X. The memory
available to maintain the symbol table is assumed to be sequential.
Specifically: The Hashtable class in Java provides this mechanism.
The hash table computes a small integer (called the hash code) out of the
external key .
The computation does not actually matter much, except that it provide a quick
and evenly distributed set of resulting integers. Then the hash table stores
the key and the data item (reference) in an array, at the location of the hash
code.
All of this is invisible to the user of the Hashtable class.
First, we will go over how hash tables work, then look at their use
specifically in Java.
Hashing is typically a many to few correspondence.
Example
H: SSN -> student file is "many to few"
Let's call the hash table ht. The hash table is partioned into b
buckets, ht[0], ... ht[b]. Each bucket is said to consist of s slots,
each slot being large enough to hold one record. Usually, s=1 and each bucket
can hold exactly one record.
In the above example of identifiers, where T is the number of distinct
possibilities, we would say that
Since the number of identifiers, n, in use, is usually several orders of
magnitude less than the total number of possible identifiers, T, the number of
buckets, b, in the hash table is also much less than T. (don't want to waste
the space)
Therefore, the hash function f must map several different identifiers into the
same bucket. Ideally, every bucket contains, at most, one object, but
collisions are inevitable.
Specifically:
Two identifiers (I1&2) are said to be synonyms with respect to Hash
Function f if : f(I1)=ht(j)=f(I2)
f(primary key)= External key to hash table (index j)
Distinct synonyms are entered into the same bucket so long as the s slots in
that bucket have not been used. A collision occurs when two
nonidentical identifiers are hashed into the same bucket. An overflow
is said to occur when a new identifier I is mapped or hashed by f into a full
bucket. When the bucket size s is 1, collisions and overflows occur
simultaneously.
Since the bucket size is usually small the search for an identifier within a
bucket is carried out using sequential search.
The time, then, is independent of n, the number of identifiers in use.
A hashing function, f, transforms an identifier X into an index in the hash
table. The desired properties of such a function are thus that it be easily
computable and that it minimize the number of collisions.
Thus there are two major considerations in hash table mechanisms:
H:Key -> integer index
e.g. table size of 100 3-digit numbers are the keys 999 possible items
indices 0--99 on the table (note table size is 100)
524 mod 100 (524%100) = 24
199 mod 100 (199%100) = 99 (collision)
Text shows the Division Hash Function with a table size of 11. Here are its results Fig. 11.2
We see that there are many cases of collision. If M is a power of 2,
then what? Clustering
The most popular
techniques to deal with overflows are linear probing and chaining, which we will address later.
(remove the middle):
e.g. Java identifiers to be hashed : Consider AZ2
Table size [0..99]
Concat
AZ2--> 1+26+55= 12655
(12655)2 = 160149025 to make even say 0160149025
Extract middle 2 digits --- hence Mid-square
H(AZ2)= 14
Try some and note that identifiers with the same first value have differing
mid-squares (hopefully preventing collisions of perhaps frequently used alphabet letters) .
If X is an identifier chosen at random from the identifier space, then we want
the probability that f(X) = i to be 1/b for all buckets i. Then a random X has
an equal chance of hashing into any of the b buckets. A hash function
satisfying this probability is termed a uniform hash function.
Good idea...
->Break the identifier x up into segments, all but possibly the last being of equal
length
->add these partitions together to get the hash address for x
Specific example:
->Break the indentifier up into binary segments (or ASCII)
->Xor these together
->Calculate the numeric integer equivalent
Two ways of adding:
Two methods:
Example
The same array when loaded with a different set of keys shows several
collisions.
An array with tablesize 11 and the same set of keys as the previous example;
no collisions result
One more: Suppose keys 160, 204, 219, 119, 412, 390 and 263 are loaded with
h(160) = 38
h(204) = 38
h(219) = 38
h(119) = 39
h(412) = 39
h(390) = 39
h(263) = 40
Hash function is biased to range 38-40. If a linear method were not used,
these keys would not necessarily cluster in positions 38-40
SO?
An improvement:
Consider the following possibility with an initial hash position of 4.
If the table size (with bucket size 1) have a value 4j+3 ( where j is an
integer), the quadratic search will examine every bucket.
2) Chaining
Chain the new item to a location outside the table
See Figure 11.3
Now b is the number of head nodes --
(more like buckets that can be filled with > 1 object; each bucket holds a
link) Comparisons
OK, back to Java. java.util.Hashtable - notice
that it inherits from java.util.Dictionary BUT that it
says the class is obsolete and that one should use the Map interface. Apparently
the Dictionary class was a "totally" abstract class with no methods implemented - so now we are to be ontologically honest and use an interface ;-) .
Note that all of these are in the
Collections Framework and my notes on Collections.
The following material is from Core Java.
Java provides a Hashtable class. Note the lower case t.
Supposedly under Windows you get strange error messages if you use HashTable
since the Windows file system is not case sensitive but the Java compiler
is.
To retrieve an object, you must remember the key
If no information is stored in the hash table with the particular key,
then get returns null.
As with vectors, hash tables store everything as an object. This means
you must cast the return value of get to the correct type.
Keys must be unique. You cannot store two values with the same key. If
you call the put method twice on the same key, then the second value replaces
the first one. In fact, put returns the previous value stored with the key
parameter. (This is useful; if put returns a non-null value, then you know you
replaced a previous entry.)
Methods
If you want more control over the performance of the hash table, you can
specify the initial bucket count.
If you know approximately how many elements will eventually be in the
table, then you can set the initial bucket size to about 150% of the expected
element count. (remember the prime number idea). Note that the class information suggests load factor
(remember load factor is n/(sb) and if s=1 then n/b (#of buckets))
of 75%. If the load factor is 75% this means 3/4 b = n and hence b = 4/3 n or 133% of full. I just made it
easier and said 150%
For example, if you need to store about 100 entries, set the intial bucket
size to 151:
Hashtable staff = new Hashtable(151);
You can also set the load factor in the constructor.
Hashtable(int initialCapacity, float loadFactor);
The load factor (remember: number of items/hashtable size) gives the
percentage threshold before Java moves the hash table to a new location. For
example, if the load factor is 0.75 (the default) and the hash table becomes
more than 75% full, then Java automatically allocates a larger table with twice
as many entries and then Java reorganizes all the entries into the larger
table.
Why? Hash tables that are nearly full perform poorly.
We are able to use strings as keys because the String class has a
hashCode method that computes the hash code for a string.
In the Java library, all classes that define their own hashCode method
form the hash code according to the contents of their objects.
If your hash table keys are objects of a class that you defined yourself,
you should define the hashValue method of the key class.
This method should return an integer (which may be negative.)
The hash table code will later reduce the integer by dividing by the
bucket count and taking the remainder. . (so...what method does Java use?) mid-square, division, ?
"Just scramble up the hash codes of the data fields in some way that will give the hash codes for different objects a good chance of being widely scattered."
Their example:
This returns an object (e) of the type that implements the Enumeration interface
The
The following loop uses those methods to iterate through all values in the hash table:
Some links to other sources and animations for hashing:
MIT notes
quadratic hash function, another quadratci probingHashing
ADT for symbol table (dictionary) (pg. 401)
(Example of a need for names appearing more than once: scope of variables.)First things first. The general problem:
Typical problem: search for the record associated with some key.
*Tables, files, etc. are viewed as groups of objects (records)
F:I -> A where I is the index and A is the actual memory address
Hash function H
H: Key--> index or address
*Each object has a unique distinguishing piece of information-- a key or
keys.
10 digits--> 109 possible permutations but only perhaps 10,000
students
T = 54*(64)5 possibilities (26 lower, 26 caps, _, $) ( actually, plus unicode)
is >109...Any reasonable application would use far less!
Possible collisions (not of KEYS, but of the hash value ... keys should be
unique)
Note: In the case where no overflow occurs, the time to enter or search for
identifiers using hashing depends only on the time required to compute the
hash function f and the time to search one bucket.Sample Hash Functions:
(pages 417-421)
999 mod 100 (999%100) = 99
Note: the choice of M (in f(X) = X mod M) is critical.
Collisions and hence Overflows occur often. See page 419, Example 11.10 [Selecting the Hash Function Divisor C] for
good hash function choices. For the Division Method, best results are obtained when D (and therefor the number of buckets b) is
either a prime number or has no prime factors less than 20.
A - Z 1,2,...,26
a - z 27,...,52
0-9, _, $ 53,...,64
Handling Collisions:
(pages 421)
1) Place new item elsewhere in the table (open-addressing)
until one of the following happens:
ht[ h(x) +j] = x ; x is found
and x
ht[ h(x) +j] is null ; x is not in table
we return to the starting position h(x) ; the table is full and x is not in the table
Array Key
1 0
2 91
3 92
4 0
5 0
6 50
7 0
8 0
9 0
10 0
11 0
12 41
13 12
14 58
15 0
(again) h(key) = (key mod tablespace) + 1
Array Key Collision 1 Collision 2
1 30 60
2 0
3 0
4 0
5 0
6 20 50
7 0
8 0
9 0
10 0
11 10 40 70
12 0
13 0
14 0
15 0
Array Key
1 0
2 0
3 0
4 0
5 70
6 60
7 50
8 40
9 30
10 20
11 10
identifiers tend to cluster together
adjacent clusters tend to coalesce thus increasing search time
function, keyvalues and table size "interact"
1st try after I-> try I+1
2nd try after I-> try I+22= I+4
3rd try after I-> try I+32= I+9
Etc. (each result is taken mod the table size...why?)
Array Key
1
2
3
4 x *4th, 8th, 12th ...probes after collision at pos. 4
5 *1st, 3rd, 5th, 7th, 9th ... probes after collision
6
7
8 *2nd, 6th, 10th, ... probes after collision
Quadratic probing will never check positions 1,2,3,6,7 after initial hash to 4.
Not good!
(I am not making this up - it has been proven.)
Summary
With a good hash function, hashing can reduce the time of search to O(1). Can't do better than that!
Conclusions
Pointers
Hashtable staff = new Hashtable();
Employee harry = new Employee("Harry Hacker");
staff.put("987-98-9996", harry);
...
String s = "987-98-9996";
e = (Employee)staff.get(s); // gets harry
get(Object key)
put(K key, V value)
size() // (returns the number of entries in the hash table)
remove(Object key) Creating your own Hashable Classes:
pn = new PartNumber("WDJT", 4);
class PartNumber()
{ private String description;
private int version;
...
public int hashCode()
{return 13*description.hashValue()+17*version;}
...
}
There is another important method that must be defined for user-defined key classes, namely equals. The Object class defines equals, but the method only tests
whether or not two objects are identical. We need to redefine equals to check for equal contents, otherwise a comparison in
the hash table would always fail:
class PartNumber()
{ ...
public booean equals(Object b)
{ if (b instanceof PartNumber)
{ PartNumber p = (PartNumber)b;
return description.equals(p.description) && version == p.version;
}
else return false;
}
Enumerations
If you want to print all of the employees in the staff hash table, we use the elements() method in the Hashtable class.
Enumeration e = staff.elements();
Enumeration interface has two methods, hasMoreElements() and nextElement().
while (e.hasMoreElements())
{ String id = (String)e.nextElement();
...
}
Notice this method does not need to know to which data structure the enumeration is attached - many Java classes return objects that implement the Enumeration interface (e.g. vectors).
v.elements() returns an enumeration object that can traverse all elements of the vector. (ahh ... inheritence)
more on probing, etc