Homeβ€Ί DSAβ€Ί Universal Hashing and Perfect Hashing

Universal Hashing and Perfect Hashing

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Hashing β†’ Topic 10 of 11
Learn universal hashing, 2-universal hash families, and perfect hashing.
πŸ”₯ Advanced β€” solid DSA foundation required
In this tutorial, you'll learn:
  • Fixed hash functions are attackable β€” adversary can force O(n) operations.
  • 2-universal family: Pr[h(x)=h(y)] ≀ 1/m for random h, any xβ‰ y.
  • Python 3.3+ randomises hash seeds β€” this is universal hashing in practice.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Any fixed hash function can be attacked β€” an adversary who knows your function can craft inputs that all collide, degrading O(1) hash table lookups to O(n). Universal hashing solves this: choose a random hash function from a family at runtime. The adversary doesn't know which one you picked, so they can't craft collisions. On average, any two keys collide with probability at most 1/m.

In 2011, the same hash collision vulnerability hit Python, PHP, Ruby, Java, and several other languages simultaneously. Attackers discovered that submitting carefully crafted HTTP POST data β€” with parameter names all hashing to the same slot β€” could cause web servers to spend seconds processing a single request. The fix was not changing the hash function; it was randomising the hash function at startup. This is universal hashing in practice.

Universal hashing is the theoretical foundation for this defence: if the hash function is chosen randomly from a 2-universal family, no adversary can reliably cause collisions without knowing which function was selected. The adversary's best attack strategy β€” crafting inputs that collide under the fixed function β€” is defeated.

Why Fixed Hash Functions Fail

Any deterministic hash function h can be attacked. Given h, an adversary can find n keys that all hash to the same slot β€” making every operation O(n). This is a real attack: hash DoS attacks against web servers were widely exploited before randomisation became standard (Crosby & Wallach, 2003).

⚠️
Hash DoS AttackIn 2011, multiple web frameworks (PHP, Python, Ruby, Java) were vulnerable to hash collision DoS attacks. An attacker could send POST requests with carefully crafted keys that all collided in the request parameter hash table, causing 100% CPU usage. Fix: randomise hash seeds.

Universal Hash Family Construction

The polynomial hash family for strings is 2-universal: choose random a,b modulo prime p > m: h_{a,b}(x) = (ax + b) mod p mod m

universal_hash.py Β· PYTHON
1234567891011121314151617181920212223
import random

class UniversalHash:
    """2-universal hash family for integers."""
    def __init__(self, m: int, prime: int = 10**9+7):
        self.m = m
        self.p = prime
        self.a = random.randint(1, prime - 1)
        self.b = random.randint(0, prime - 1)

    def hash(self, x: int) -> int:
        return (self.a * x + self.b) % self.p % self.m

    def rehash(self):
        """Pick a new random function from the family."""
        self.a = random.randint(1, self.p - 1)
        self.b = random.randint(0, self.p - 1)

# Demonstrate: collision probability β‰ˆ 1/m
h = UniversalHash(m=100)
collisions = sum(1 for _ in range(10000)
                 if h.hash(42) == h.hash(99))
print(f'Collision probability β‰ˆ {collisions/10000:.3f} (expected ≀ {1/100:.3f})')
β–Ά Output
Collision probability β‰ˆ 0.009 (expected ≀ 0.010

Python's Hash Randomisation

Since Python 3.3, Python randomises its string/bytes hash function at startup (PYTHONHASHSEED). This is exactly universal hashing in practice β€” prevents hash DoS attacks.

python_hash_seed.py Β· PYTHON
123456789
import os
# Python uses random seed at startup
print(hash('hello'))  # Different every run unless PYTHONHASHSEED set
print(os.environ.get('PYTHONHASHSEED', 'random'))  # 'random' = randomised

# Stable hashing when needed (e.g., caching)
import hashlib
stable = int(hashlib.md5(b'hello').hexdigest(), 16) % (10**9)
print(stable)  # Same every run
β–Ά Output
7483924618273645 # changes each run
random
294702384

Perfect Hashing β€” O(1) Worst Case

For a static set of n keys, FKS (Fredman-KomlΓ³s-SzemerΓ©di) perfect hashing achieves O(1) worst-case lookup with O(n) space: 1. Hash n keys into n/2 buckets (first level, some collisions expected) 2. For each bucket with k_i keys, use a second-level table of size k_iΒ² β€” guarantees no collisions with high probability

Total space: O(n) expected. Build time: O(n) expected. Lookup: O(1) worst case.

Used in compilers (keyword lookup), databases (static index), and network packet classification.

🎯 Key Takeaways

  • Fixed hash functions are attackable β€” adversary can force O(n) operations.
  • 2-universal family: Pr[h(x)=h(y)] ≀ 1/m for random h, any xβ‰ y.
  • Python 3.3+ randomises hash seeds β€” this is universal hashing in practice.
  • Perfect hashing: O(1) worst-case lookup for static key sets using two-level scheme.
  • FKS hashing achieves O(n) space and O(1) worst case for static sets.

Interview Questions on This Topic

  • QWhat is a universal hash family and why is it important?
  • QHow does Python prevent hash DoS attacks?
  • QExplain the two-level perfect hashing scheme.
  • QWhat is the collision probability guarantee for a 2-universal hash family?

Frequently Asked Questions

Is Python's dict vulnerable to hash collision attacks?

Not since Python 3.3 β€” PYTHONHASHSEED randomises string/bytes hashing at startup. Integers still hash deterministically (hash(n)=n for small n), but integer-keyed dict DoS is harder to exploit in practice.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousMD5 Hash AlgorithmNext β†’Hamming Code β€” Error Detection and Correction
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged