Unit 5: Hashing
Table of Contents
1. Introduction to Hashing
Hashing is a technique used to map data of arbitrary size (the "key") to data of a fixed size (the "hash value" or "index"). This is used to implement a Hash Table, a data structure that provides very fast O(1) average-case insertion, deletion, and search.
It's like a "magic" array where you can store an item (e.g., a person's name) and find it instantly without searching.
Hash Table, Hash Key, Hash Function
- Hash Table: This is the data structure itself, typically an array. Each slot in the array is called a "bucket" or "slot."
- Hash Key: This is the input data (e.g., a string "John Smith", a number 12345) that we want to store or search for.
- Hash Function: This is the "magic" function. It takes the
keyas input and computes an index (a hash value) for theHash Table.index = hash_function(key)The goal is to store the data associated with "John Smith" at
hash_table[index].
Load Factor (α)
The Load Factor is a measure of how full the hash table is. It is crucial for performance.
α = n / M
where:
n = number of elements stored in the table
M = size of the hash table (number of buckets)
A high load factor (e.g., α > 1 for chaining, α > 0.7 for open addressing) leads to more collisions and slower performance. A low load factor is fast but wastes memory.
2. Hash Functions
Characteristics of Good Hash Functions
- Fast Computation: It must be very quick to compute (ideally O(1)).
- Uniform Distribution (Deterministic): It should map keys as evenly as possible across all buckets. Given a key, it must always produce the same index.
- Low Collisions: It should minimize the number of "collisions" (when two different keys map to the same index).
Types of Hash Functions
- 1. Division Method:
The simplest method. Take the key
kand the table sizeM.h(k) = k % M(k modulo M)Tip:
Mshould ideally be a prime number that is not close to a power of 2. This helps distribute the keys more uniformly. - 2. Mid-Square Method:
Square the key
k, and then take the middlerdigits of the result as the index.Example:
k = 1234,M = 100(sor=2).
k² = 1522756. Middle 2 digits:27.index = 27. - 3. Folding Method:
The key is partitioned into several parts, and the parts are combined (e.g., added) to get the index.
Example:
k = 12345678,M = 1000. Fold into parts of 3:123,456,78. Add them:123 + 456 + 78 = 657.index = 657.
3. Collision
A Collision occurs when a hash function maps two or more different keys to the same index (bucket).
h(key1) == h(key2), wherekey1 != key2
Since collisions are inevitable (Pigeonhole Principle), we must have a strategy to handle them.
Collision Resolution Techniques
There are two main categories for resolving collisions:
- Closed Addressing (or Separate Chaining): Store the colliding elements *outside* the table.
- Open Addressing (or Closed Hashing): Store the colliding elements in *another empty bucket* within the table.
- Open Hashing = Closed Addressing = Separate Chaining (Buckets are "open" to hold more than one item via a list).
- Closed Hashing = Open Addressing = Probing (The table is "closed"; all items must fit within the table's "address" space).
4. Closed Addressing (Separate Chaining)
This is the simplest and most common technique.
- Idea: Each bucket in the hash table (
M) does not hold the element itself, but instead holds a pointer to a linked list. - Operation:
- Insert: Compute
index = h(key). Go tohash_table[index]and insert the new element at the beginning of that linked list (O(1)). - Search: Compute
index = h(key). Go tohash_table[index]and perform a linear search on the linked list (O(n/M) or O(α)).
- Insert: Compute
Advantages:
- Very simple to implement.
- Load factor (α) can be greater than 1.
- Deletion is easy (just remove from the linked list).
Disadvantages:
- Wastes memory on pointers.
- Can be slow if one chain becomes very long (poor hash function).
5. Open Addressing (Probing)
In this method, all elements are stored inside the hash table array itself. When a collision occurs, we "probe" (search) for the next available empty bucket according to a rule.
The hash function is modified to include a probe sequence i (where i = 0, 1, 2, ...).
h(key, i) = (h'(key) + f(i)) % M Important: The load factor α must be less than 1 (α ≤ 1). The table can never be more than 100% full.
Linear Probing
This is the simplest probing method. If a collision occurs, just check the next sequential bucket.
- Function:
f(i) = i. - Probe Sequence:
h'(k),(h'(k) + 1) % M,(h'(k) + 2) % M, ... - Problem: Primary Clustering.
If collisions start to build up in one spot, they create a "cluster" of occupied buckets. A new insertion in that area will have to travel a long way to find an empty spot, making search/insert times degrade toward O(n).
Quadratic Probing
This method probes using a quadratic step, which helps avoid primary clustering.
- Function:
f(i) = c1*i + c2*i²(Commonly,f(i) = i²). - Probe Sequence:
h'(k),(h'(k) + 1) % M,(h'(k) + 4) % M,(h'(k) + 9) % M, ... - Problem: Secondary Clustering.
If two keys have the same initial hash
h'(k), they will follow the exact same probe sequence. This is less of a problem than primary clustering but still not ideal.
Double Hashing
This is the best open addressing method. It uses a second hash function to determine the "step size" of the probe. The step size is different for every key.
- Function:
f(i) = i * h2(k). - Probe Sequence:
h'(k),(h'(k) + h2(k)) % M,(h'(k) + 2*h2(k)) % M, ... h2(k)must be chosen carefully (e.g.,h2(k) = 1 + (k % (M-1))) to ensure it never produces 0.- Advantage: This method avoids both primary and secondary clustering. It distributes keys most uniformly.
6. Comparison of Collision Resolution Techniques
| Feature | Separate Chaining (Closed Addr.) | Open Addressing (Linear Probing) | Open Addressing (Double Hashing) |
|---|---|---|---|
| Load Factor (α) | Can be > 1. | Must be < 1. (Performance degrades badly after 0.7). | Must be < 1. |
| Clustering | None (but lists can get long). | Primary Clustering (Bad). | None (Good). |
| Space Usage | Extra space for pointers. | No extra space. Good cache performance. | No extra space. Good cache performance. |
| Deletion | Easy. | Very difficult (must use "tombstones" or rehash). | Very difficult. |
- If you need a simple implementation and deletion is common, use Separate Chaining.
- If memory is tight, speed is critical, and deletions are rare, use Open Addressing with Double Hashing.