Knowlet

Unit 5: Hashing

1. Introduction to Hashing

Hashing is a technique used to map data of arbitrary size (the "key") to data of a fixed size (the "hash value" or "index"). This is used to implement a Hash Table, a data structure that provides very fast O(1) average-case insertion, deletion, and search.

It's like a "magic" array where you can store an item (e.g., a person's name) and find it instantly without searching.

Hash Table, Hash Key, Hash Function

  • Hash Table: This is the data structure itself, typically an array. Each slot in the array is called a "bucket" or "slot."
  • Hash Key: This is the input data (e.g., a string "John Smith", a number 12345) that we want to store or search for.
  • Hash Function: This is the "magic" function. It takes the key as input and computes an index (a hash value) for the Hash Table.
    index = hash_function(key)

    The goal is to store the data associated with "John Smith" at hash_table[index].

Load Factor (α)

The Load Factor is a measure of how full the hash table is. It is crucial for performance.

α = n / M
where:
n = number of elements stored in the table
M = size of the hash table (number of buckets)

A high load factor (e.g., α > 1 for chaining, α > 0.7 for open addressing) leads to more collisions and slower performance. A low load factor is fast but wastes memory.

2. Hash Functions

Characteristics of Good Hash Functions

  1. Fast Computation: It must be very quick to compute (ideally O(1)).
  2. Uniform Distribution (Deterministic): It should map keys as evenly as possible across all buckets. Given a key, it must always produce the same index.
  3. Low Collisions: It should minimize the number of "collisions" (when two different keys map to the same index).

Types of Hash Functions

  • 1. Division Method:

    The simplest method. Take the key k and the table size M.

    h(k) = k % M (k modulo M)

    Tip: M should ideally be a prime number that is not close to a power of 2. This helps distribute the keys more uniformly.

  • 2. Mid-Square Method:

    Square the key k, and then take the middle r digits of the result as the index.

    Example: k = 1234, M = 100 (so r=2).
    k² = 1522756. Middle 2 digits: 27. index = 27.

  • 3. Folding Method:

    The key is partitioned into several parts, and the parts are combined (e.g., added) to get the index.

    Example: k = 12345678, M = 1000. Fold into parts of 3: 123, 456, 78. Add them: 123 + 456 + 78 = 657. index = 657.

3. Collision

A Collision occurs when a hash function maps two or more different keys to the same index (bucket).

h(key1) == h(key2), where key1 != key2

Since collisions are inevitable (Pigeonhole Principle), we must have a strategy to handle them.

Collision Resolution Techniques

There are two main categories for resolving collisions:

  1. Closed Addressing (or Separate Chaining): Store the colliding elements *outside* the table.
  2. Open Addressing (or Closed Hashing): Store the colliding elements in *another empty bucket* within the table.
Confusing Terminology:
  • Open Hashing = Closed Addressing = Separate Chaining (Buckets are "open" to hold more than one item via a list).
  • Closed Hashing = Open Addressing = Probing (The table is "closed"; all items must fit within the table's "address" space).
The syllabus uses "Open Addressing & closed Addressing", which is ambiguous. We will follow the standard definitions where "Closed Addressing" means Separate Chaining and "Open Addressing" means Probing.

4. Closed Addressing (Separate Chaining)

This is the simplest and most common technique.

  • Idea: Each bucket in the hash table (M) does not hold the element itself, but instead holds a pointer to a linked list.
  • Operation:
    • Insert: Compute index = h(key). Go to hash_table[index] and insert the new element at the beginning of that linked list (O(1)).
    • Search: Compute index = h(key). Go to hash_table[index] and perform a linear search on the linked list (O(n/M) or O(α)).

Advantages:

  • Very simple to implement.
  • Load factor (α) can be greater than 1.
  • Deletion is easy (just remove from the linked list).

Disadvantages:

  • Wastes memory on pointers.
  • Can be slow if one chain becomes very long (poor hash function).

5. Open Addressing (Probing)

In this method, all elements are stored inside the hash table array itself. When a collision occurs, we "probe" (search) for the next available empty bucket according to a rule.

The hash function is modified to include a probe sequence i (where i = 0, 1, 2, ...).

h(key, i) = (h'(key) + f(i)) % M

Important: The load factor α must be less than 1 (α ≤ 1). The table can never be more than 100% full.

Linear Probing

This is the simplest probing method. If a collision occurs, just check the next sequential bucket.

  • Function: f(i) = i.
  • Probe Sequence: h'(k), (h'(k) + 1) % M, (h'(k) + 2) % M, ...
  • Problem: Primary Clustering.

    If collisions start to build up in one spot, they create a "cluster" of occupied buckets. A new insertion in that area will have to travel a long way to find an empty spot, making search/insert times degrade toward O(n).

Quadratic Probing

This method probes using a quadratic step, which helps avoid primary clustering.

  • Function: f(i) = c1*i + c2*i² (Commonly, f(i) = i²).
  • Probe Sequence: h'(k), (h'(k) + 1) % M, (h'(k) + 4) % M, (h'(k) + 9) % M, ...
  • Problem: Secondary Clustering.

    If two keys have the same initial hash h'(k), they will follow the exact same probe sequence. This is less of a problem than primary clustering but still not ideal.

Double Hashing

This is the best open addressing method. It uses a second hash function to determine the "step size" of the probe. The step size is different for every key.

  • Function: f(i) = i * h2(k).
  • Probe Sequence: h'(k), (h'(k) + h2(k)) % M, (h'(k) + 2*h2(k)) % M, ...
  • h2(k) must be chosen carefully (e.g., h2(k) = 1 + (k % (M-1))) to ensure it never produces 0.
  • Advantage: This method avoids both primary and secondary clustering. It distributes keys most uniformly.

6. Comparison of Collision Resolution Techniques

Feature Separate Chaining (Closed Addr.) Open Addressing (Linear Probing) Open Addressing (Double Hashing)
Load Factor (α) Can be > 1. Must be < 1. (Performance degrades badly after 0.7). Must be < 1.
Clustering None (but lists can get long). Primary Clustering (Bad). None (Good).
Space Usage Extra space for pointers. No extra space. Good cache performance. No extra space. Good cache performance.
Deletion Easy. Very difficult (must use "tombstones" or rehash). Very difficult.
When to use which?
  • If you need a simple implementation and deletion is common, use Separate Chaining.
  • If memory is tight, speed is critical, and deletions are rare, use Open Addressing with Double Hashing.

Did this resource help you study?

Share feedback or report issues to help improve this resource.