CSCI 311 - Fall 2008 - Program 2

Extendible Hashing

Dr. Melody Stapleton, Instructor

Pratik Mehta, Teaching Assistant

Due Date: All Program Components: Monday, November 24, uploaded by midnight

Your TA, Pratik, will provide the final input file(s) to run your program against by Monday, October 27

You are to implement the Extendible Hashing Algorithm, as discussed in class and posted in the course notes on the web.  Extendible Hashing uses an array-structured directory that has the possibility of doubling in size or halving in size based on the current contents of the hash buckets.  You are to use good design principles in developing appropriate C++ classes. 

Your guiding principles in developing this assignment are two-fold:

1.      Minimize time: only generate as many bits of any pseudokey as are absolutely necessary.  Only do as much analysis and calculation as is absolutely necessary.

2.      Minimize space: only allocate a bucket if something will be immediately stored in it.  Only use as many buckets as are absolutely necessary, never more.  Also, never have a directory “bigger” than you need.  Keep the directory as small as possible, given the current contents of the buckets and local depths of those buckets.

You will also need to implement "chaining to overflow buckets" as discussed in class and shown in the course notes.

Here are examples of Extendible Hashing from the Solutions Manual of Elmasri and Navathe’s 5th Edition, Database Systems text.  Note that, unlike your requirements, these egs show allocation of a bucket even when no records have yet hashed to a bucket.  You are NOT to do this in your program!

You are to structure your main program to accept and interpret file input with the following format:

You will need to write your program to first accept the power of 2 (from the input file, on the first line) that will be used for the hash function mod value.  Be sure to return an error to the user if they do not enter a power of 2 and then halt execution of (exit) the program.  Return the error to the output file.  The next line on the input file will be the bucket size to be used for each bucket.

Please note: We will use a value of -1 for our global directory prefix to indicate that there are no values stored in our hash structure at present.

Listed below are the possibilities for the different categories of commands that will follow the above 2 lines:

1)      If the global prefix is -1 you will print out the phrase “HASH STRUCTURE EMPTY”. 

2)      If the global prefix is 0, this indicates that there is only one bucket in the structure and no directory.  Print out the phrase “LOCAL BUCKET PREFIX” followed by a space and then the integer value that is the local prefix (depth) of that bucket, in this case 0.  On the next line, print the phrase “BUCKET CONTENTS:” followed by a space and a comma separated list of the key values that are the contents of the bucket.  The format of this list would be <value>,<space><value> etc.  In other words a comma and a space separate values from each other.

3)      If the global prefix is 1 or more, this indicates that you in fact, have a directory structure.  You will print out the directory index, in binary ascending order.  After printing out the directory index value, on the next line you will either print the word NULL, to indicate that the bucket pointer is null for that directory index, or you will print out the phrase “LOCAL BUCKET PREFIX” followed by a space and then the integer value that is the local prefix of that bucket.  On the next line, print the phrase “BUCKET CONTENTS:” followed by a space and a comma separated list of the contents of the key values that are the contents of the bucket. The format of this list would be <value>,<space><value> etc.  In other words a comma and a space separate values from each other.  In the case of  chained buckets where overflow has occurred, you would print out the contents of the first bucket as above and then each subsequent bucket you would print by beginning a new line and printing the phrase “OVERFLOW BUCKET” followed by a space and a comma separated list as described above.  Note that chaining to overflow buckets would only occur when the number of records that collide by having IDENTICAL pseudokeys IN ALL BITS that can be generated exceeds the number of slots in a bucket.  An example of this case was covered in the notes.

By key values, we mean the integer key values prior to being hashed.

 Pratik will provide detailed examples of the output expected. 

You will, of course, need to write a hash function, and you will need to create a Extendible_Hash class that has at least the following member functions:

Output: Send your output for this program to a file.  For inserts, it is an error to insert a duplicate value in the Hash structure, be sure to notify the user of such an attempt.  Tell the user if an operation is successful, as well.  Be sure to give appropriate "error" messages back to the user of your program, via the output file.  For e.g., if asked to delete a value from the Hash structure that is not found in the structure, you should reply that the value was not found and so could not be deleted.  For the Search operation, as well as returning the appropriate pointer as described above, if the item being sought is found, print out a message that says you found it and what its value is.  On the other hand, if the item is not found, print out a message to that effect.

Also Note: You are responsible to write a program that deals with *all cases*.  That is, your program should not only *work good on some input files*, it needs to work on *all possible* input files.  This means it is not your TAs responsibility to give you an input file that allows your program to work.  It is your responsibility to write code that cannot be broken by any input file and so will work in all cases.  This will be the case for every program you write for this course.

A sample file follows that adheres to the input file format:

8

5

I  46

I 55

I 88

P

S 46

S 99

I 99

D 50

S 200

I 200

D 55

P

I 120

I 200

I 300

I 400

I 500

S 300

P

D 600

D 500