Unit 5: Hashing

Table of Contents

1. Introduction to Hashing

Hashing is a technique used to map data of arbitrary size (the "key") to data of a fixed size (the "hash value" or "index"). This is used to implement a Hash Table, a data structure that provides very fast O(1) average-case insertion, deletion, and search.

It's like a "magic" array where you can store an item (e.g., a person's name) and find it instantly without searching.

Hash Table, Hash Key, Hash Function

Load Factor (α)

The Load Factor is a measure of how full the hash table is. It is crucial for performance.

α = n / M
where:
n = number of elements stored in the table
M = size of the hash table (number of buckets)

A high load factor (e.g., α > 1 for chaining, α > 0.7 for open addressing) leads to more collisions and slower performance. A low load factor is fast but wastes memory.

2. Hash Functions

Characteristics of Good Hash Functions

  1. Fast Computation: It must be very quick to compute (ideally O(1)).
  2. Uniform Distribution (Deterministic): It should map keys as evenly as possible across all buckets. Given a key, it must always produce the same index.
  3. Low Collisions: It should minimize the number of "collisions" (when two different keys map to the same index).

Types of Hash Functions

3. Collision

A Collision occurs when a hash function maps two or more different keys to the same index (bucket).

h(key1) == h(key2), where key1 != key2

Since collisions are inevitable (Pigeonhole Principle), we must have a strategy to handle them.

Collision Resolution Techniques

There are two main categories for resolving collisions:

  1. Closed Addressing (or Separate Chaining): Store the colliding elements *outside* the table.
  2. Open Addressing (or Closed Hashing): Store the colliding elements in *another empty bucket* within the table.
Confusing Terminology: The syllabus uses "Open Addressing & closed Addressing", which is ambiguous. We will follow the standard definitions where "Closed Addressing" means Separate Chaining and "Open Addressing" means Probing.

4. Closed Addressing (Separate Chaining)

This is the simplest and most common technique.

Advantages:

Disadvantages:

5. Open Addressing (Probing)

In this method, all elements are stored inside the hash table array itself. When a collision occurs, we "probe" (search) for the next available empty bucket according to a rule.

The hash function is modified to include a probe sequence i (where i = 0, 1, 2, ...).

h(key, i) = (h'(key) + f(i)) % M

Important: The load factor α must be less than 1 (α ≤ 1). The table can never be more than 100% full.

Linear Probing

This is the simplest probing method. If a collision occurs, just check the next sequential bucket.

Quadratic Probing

This method probes using a quadratic step, which helps avoid primary clustering.

Double Hashing

This is the best open addressing method. It uses a second hash function to determine the "step size" of the probe. The step size is different for every key.

6. Comparison of Collision Resolution Techniques

Feature Separate Chaining (Closed Addr.) Open Addressing (Linear Probing) Open Addressing (Double Hashing)
Load Factor (α) Can be > 1. Must be < 1. (Performance degrades badly after 0.7). Must be < 1.
Clustering None (but lists can get long). Primary Clustering (Bad). None (Good).
Space Usage Extra space for pointers. No extra space. Good cache performance. No extra space. Good cache performance.
Deletion Easy. Very difficult (must use "tombstones" or rehash). Very difficult.
When to use which?