Hashing is a technique used to map data of arbitrary size (the "key") to data of a fixed size (the "hash value" or "index"). This is used to implement a Hash Table, a data structure that provides very fast O(1) average-case insertion, deletion, and search.
It's like a "magic" array where you can store an item (e.g., a person's name) and find it instantly without searching.
key as input and computes an index (a hash value) for the Hash Table.
index = hash_function(key)
The goal is to store the data associated with "John Smith" at hash_table[index].
The Load Factor is a measure of how full the hash table is. It is crucial for performance.
α = n / M
where:
n = number of elements stored in the table
M = size of the hash table (number of buckets)
A high load factor (e.g., α > 1 for chaining, α > 0.7 for open addressing) leads to more collisions and slower performance. A low load factor is fast but wastes memory.
The simplest method. Take the key k and the table size M.
h(k) = k % M (k modulo M)
Tip: M should ideally be a prime number that is not close to a power of 2. This helps distribute the keys more uniformly.
Square the key k, and then take the middle r digits of the result as the index.
Example: k = 1234, M = 100 (so r=2).
k² = 1522756. Middle 2 digits: 27. index = 27.
The key is partitioned into several parts, and the parts are combined (e.g., added) to get the index.
Example: k = 12345678, M = 1000.
Fold into parts of 3: 123, 456, 78.
Add them: 123 + 456 + 78 = 657. index = 657.
A Collision occurs when a hash function maps two or more different keys to the same index (bucket).
h(key1) == h(key2), wherekey1 != key2
Since collisions are inevitable (Pigeonhole Principle), we must have a strategy to handle them.
There are two main categories for resolving collisions:
This is the simplest and most common technique.
M) does not hold the element itself, but instead holds a pointer to a linked list.index = h(key). Go to hash_table[index] and insert the new element at the beginning of that linked list (O(1)).index = h(key). Go to hash_table[index] and perform a linear search on the linked list (O(n/M) or O(α)).Advantages:
Disadvantages:
In this method, all elements are stored inside the hash table array itself. When a collision occurs, we "probe" (search) for the next available empty bucket according to a rule.
The hash function is modified to include a probe sequence i (where i = 0, 1, 2, ...).
h(key, i) = (h'(key) + f(i)) % M
Important: The load factor α must be less than 1 (α ≤ 1). The table can never be more than 100% full.
This is the simplest probing method. If a collision occurs, just check the next sequential bucket.
f(i) = i.h'(k), (h'(k) + 1) % M, (h'(k) + 2) % M, ...If collisions start to build up in one spot, they create a "cluster" of occupied buckets. A new insertion in that area will have to travel a long way to find an empty spot, making search/insert times degrade toward O(n).
This method probes using a quadratic step, which helps avoid primary clustering.
f(i) = c1*i + c2*i² (Commonly, f(i) = i²).h'(k), (h'(k) + 1) % M, (h'(k) + 4) % M, (h'(k) + 9) % M, ...If two keys have the same initial hash h'(k), they will follow the exact same probe sequence. This is less of a problem than primary clustering but still not ideal.
This is the best open addressing method. It uses a second hash function to determine the "step size" of the probe. The step size is different for every key.
f(i) = i * h2(k).h'(k), (h'(k) + h2(k)) % M, (h'(k) + 2*h2(k)) % M, ...h2(k) must be chosen carefully (e.g., h2(k) = 1 + (k % (M-1))) to ensure it never produces 0.| Feature | Separate Chaining (Closed Addr.) | Open Addressing (Linear Probing) | Open Addressing (Double Hashing) |
|---|---|---|---|
| Load Factor (α) | Can be > 1. | Must be < 1. (Performance degrades badly after 0.7). | Must be < 1. |
| Clustering | None (but lists can get long). | Primary Clustering (Bad). | None (Good). |
| Space Usage | Extra space for pointers. | No extra space. Good cache performance. | No extra space. Good cache performance. |
| Deletion | Easy. | Very difficult (must use "tombstones" or rehash). | Very difficult. |