Hashing Techniques


Hash function is the technique which will help us to index huge data into small chunk utilizing the numerical data. The outcome of a hash function is a hash code, checksums etc helps to easily look up the index for retrieving corresponding data from a large data set. In our continuous journey in exploring technology, we landed up in researching distributed hash tables which led us to analyze the various algorithms running behind a hash function.

A brief about Distributed Hash Table

In a distributed network environment, implementation of Distributed Hash Table (DHT) results in a set of numerical data for a given set of data, which shall be stored in any of the participating node (i.e., in underlying peer-to-peer systems).

The main characteristics of the DHT

  • Decentralization
  • Scalability
  • Fault-Tolerance

Decentralization is the mantra behind a scalability of application. The distributed hash keys are stored in any participating node and replicated with its neighbor too. There is no centralized control to perform the distributed hashing. The DHT leads to a reliable system, independent of nodes joining/leaving or failing from the network, without impacting the application functionality.

DHT meets the needs of many applications such as Membase, Memcahe, session data and user related data in the distributed environment. The hashing functionality caters the quick retrieval of data which is mapped to the hash bucket, by performing the major role of storing and retrieving the data using the hash key in the bucket. While applying the hash function, we need to consider the following factors:

S.No Factors Description
1 Minimal Hash Collision When two sets of data having the same hash value, checksum etc., (Diagram), it results in Hash Collision.
2 Performance The latency in retrieving data should be minimal

Is there any “Best Hashing algorithm”, would solve the collision problem and improve the performance?

The simplest answer is “No”. The above question shall be addressed in an alternative way by identifying algorithms leading to less collision instead of digging into a collision free hashing mechanism. Though we are able to identify an efficient i.e., collision free hashing mechanism, the complexity will grow along with data and it would lead to hash collision.

A better hashing shall be achieved using the following algorithms:

  • FNVHash
  • knuthHash
  • BobJenkins
  • SuperFastHash

In the above set of hashing algorithms, FNVHash and BobJenkins are well suited for large set of data. Sample FNVHash implementation in below.

Thanks to Paul Hsieh’s test program, here are some performance numbers for these different implementations (as benchmarked on my aging PowerBook G4):

FNVHash : 3.9300s
knuthHash : 2.9700s
BobJenkins : 2.4600s
SuperFastHash   : 2.2800s

The latest version of Bob Jenkin’s hash is called Lookup3 is outsmarting the SuperFastHash by providing better collision properties and also meeting the performance benchmark of it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s