Chinese Kid IT World: Algorithm and Data structure: Hashing

Introduction

•Dictionaries are data structures that support search,
insert, and delete operations.
•One of the most important criteria in dictionary is to
provide fast searching of information.
•One the effective representations is a hash table.
•The implementation of hash table is frequently called
hashing.

Hash Table

•Hash Tables represent the storage space for the entire search. The primary concern in the Hash Table is to provide appropriate indexing and to allow many of the different possible keys to be mapped onto the same location.
• There are two common methods of Hash Tables:1.Array Hash Table2.Linked Hash Table

•An Array Hash Table is simply an array that stores all the elements.
•Since array is indexed by default, we do not have the problem of deciding the method of indexing the Hash Table.
•The problems that we face with Array Hash Table are collision and overflow.
•A Linked Hash Table uses an array for indexing.
•Each element of the array is a pointer that points to a linked list that stores the data.
•The elements are not stored in the array itself but rather in a separate list.
•This method is also known as chaining.
•Chaining provides a simple method to overcome collision.
•However, if the list of element is too long, it may slow down the searching process.

Hash Function Techniques
•If the hash function is uniform, or equally distributes the data keys among the hash table indices, then hashing effectively subdivides the list to be searched.
• Worst-case behavior occurs when all keys hash to the same index. Then we simply have a single linked list that must be sequentially searched.
•Consequently, it is important to choose a good hash function.
•Several methods may be used to hash key values.

Advantages of Linked Hash Table (Chaining)
•Space Saving - Linked list as dynamic memory storage save memory space because node can be added and remove from the list.
•Collision resolution. - In linked list, collision does not occur as in array. A collision occurs when 2 item need to be stored in the same location. This problem appears in array hash table and easily overcome by Chaining.
•Overflow - Linked list avoid record overflow as in array implementation. It is extremely useful when dealing with ambiguous number of record.
•Deletion - Deletion can be performed quick and easy task as performing deletion from simple linked list.

•Use of space - All the links require space. If the record s are large, then this space is negligible in comparison with that needed for the records themselves; but if the records are small, then it is not.
•Small records - Suppose, for example, that the links take one word each and that the items themselves take one word (which is the key alone). Such applications are quite common, where we use the hash table only to answer some yes-no question about the key. Suppose that we use chaining and make the hash table itself quite small, with the same number of n of entries as the number of items. Then we shall use 3n words of storage altogether: n for the hash table, n for the keys, and n for the links to find the next node (if any) on each chain.
•Search speed - Search time for chaining is typically quit slow than array implementation.

Collision and The Resolution with Open Addressing
•The idea of a hash table is to allow many of the different possible keys that might occur to be mapped to the same location under the action of the hash function.
•In array implementation, it is commonly that 2 records might be mapped into the same location.
•If this is case, we called it as Collision.
•A simple example is like this. If a hash function which hash a name base on the first alphabet, then John and James will be hashed into the same location.
•The simplest Method to resolve a collision is to start with the hash address (the location where the collision occurred) and do a sequential search for the desire key or an empty location. Hence this method searches in a straight line and is there for called linear probing. The array should be consider circular, so that when the last location is reached, the search proceeds to the first location of the array.
•The major drawback of linear probing is that, as the table becomes about half full, there is a tendency toward clustering; that is, records starts to appear in long string of adjacent position with gaps between the strings. Thus the sequential searches needed to find an empty position become longer and longer.
•The problem of clustering is essentially one of instability; if a few keys happen randomly to be near each other, then it become more and more likely that other keys will join them, and the distribution will become progressively more unbalanced.

•Rehashing
–If we are to avoid the problem of clustering, then we must use some more sophisticated way to select the sequence of location to check when a collision occurs. There are many ways to do so. One, called rehashing, use a second hash function to obtain the second position to consider. If this position is filled, then some other method is needed to get the third position, and so on. But if we have a fairly good spread from the first hash function, then little is to be gained by an independent second hash function.

•Quadratic Probing
–Another method is to use Quadratic Probing. If there is a collision at hash address h, this method probes the table at location h + 1, h + 4, h + 9,…., that is at location h + i^2 (% HASHSIZE), for I = 1, 2, 3, …. That is the increment function is i^2. This method substantially reduces Clustering, but it is not obvious that it will probe all locations in the table, and in fact it does not.

•Key-Independent Increments
–Rather than having the increment depend on the number of probes already made, we can let it be some simple function of the key itself. For example, we could truncate the key to a single character and use its code as the increment. In C, we might write
–increment = *key
–A good approach, when the remainder after division is taken as the hash function is to let the increment depend on the quotient of the same division. An optimizing compiler should specify the division only once, so the calculation will be fast, and the result generally satisfactory.
–In this method, the increment, once determined, remains constant. If HASHSIZE is a prime, it follows that the probes will step through all the entries of the array before any repetition. Hence overflow will not be indicated until the array is completely full.

•Random Probing
–A final method is to use a pseudorandom number generator to obtain the increment.
–The generator used should be one that always generates the same sequence provided it starts with the same seed.
–The seed, then, can be specified as some function of the key.
–This method is excellent in avoiding clustering, but is likely to be slower than the others.

Chinese Kid IT World

Saturday, September 20, 2008

Algorithm and Data structure: Hashing

No comments: