Skip to content
Home DSA Rolling Hash — Avoid False Positives via Double Hashing

Rolling Hash — Avoid False Positives via Double Hashing

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Strings → Topic 8 of 8
10M substring comparisons: one false positive hash collision.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn
10M substring comparisons: one false positive hash collision.
  • Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
  • Rolling hash removes leftmost and adds rightmost character contribution in O(1).
  • Rabin-Karp: compare hash first, then verify string only on hash match — O(n+m) average.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Rolling hash uses polynomial hashing with a sliding window to compute substring hashes in O(1) time.
  • Precompute prefix hashes and powers of base to enable O(1) substring hash retrieval.
  • Rabin-Karp uses rolling hash for pattern matching with O(n+m) average time.
  • Collision probability is ~1/M per comparison; double hashing reduces it to ~1/M².
  • Production pitfall: hash collisions cause false positives; always verify string equality on hash match.
  • Performance insight: rolling hash avoids O(m) rehashing per window, saving up to 10x in large strings.
🚨 START HERE

Quick Debug Cheat Sheet: Rolling Hash Collisions

Use these steps when you suspect hash collisions are affecting your application.
🟡

Two different substrings produce same hash

Immediate ActionCompare strings directly to confirm collision.
Commands
python -c "print(get_hash(l1,r1) == get_hash(l2,r2))" # verify hash equality
Calculate expected collisions: n^2/(2*M) where n is number of windows.
Fix NowAdd string verification after hash match, or switch to double hashing with second base/mod.
🟡

Hash overflow producing negative values

Immediate ActionCheck language's modulo operator behavior with negative numbers.
Commands
print((h - other * pow) % MOD) # ensure positive by adding MOD before modulo
Use helper function: def mod(x): return (x % MOD + MOD) % MOD
Fix NowWrap all subtraction results with ((x % MOD) + MOD) % MOD.
Production Incident

False Positive in Plagiarism Detector Caused by Hash Collision

A single hash collision triggered a false plagiarism accusation across 200 student submissions.
SymptomTwo independently written essays were flagged as identical. Manual review showed they were completely different. The hash values matched, but the strings did not.
AssumptionSingle hash (M=10^9+7) was assumed sufficient; collision probability was considered negligible.
Root causeWith 10 million substring comparisons, the expected number of collisions was ~10, but one happened to involve two long strings, causing a false positive.
FixSwitched to double hashing with two independent prime moduli, and added mandatory string verification on every hash match.
Key Lesson
Always verify string equality after a hash match.Double hashing is cheap insurance against production collisions.When false positives have legal or financial consequences, verification is non-negotiable.
Production Debug Guide

Symptom → Action grid for common rolling hash failures

Hash values match but strings differConfirm with full string comparison; check hash function implementation for modulus overflow or negative modulo handling.
Substring hash gives incorrect resultVerify prefix hash formula: ensure correct order of modulo and subtraction; check power array initialization and indexing.
Performance degraded due to many false positivesIncrease modulus to a larger prime or switch to double hashing. Monitor collision frequency via logging hash matches that fail string verification.

In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) — a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.

The fix in Python 3.3 was hash randomisation — choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.

Polynomial Hash

The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:

hash(S) = (s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0) mod M

Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).

poly_hash.py · PYTHON
123456789101112
MOD = 10**9 + 7
BASE = 131

def poly_hash(s: str) -> int:
    h = 0
    for ch in s:
        h = (h * BASE + ord(ch)) % MOD
    return h

print(poly_hash('hello'))   # some large int
print(poly_hash('hello'))   # same — deterministic
print(poly_hash('world'))   # different
▶ Output
566852114
566852114
791390312
Mental Model
Fingerprint Analogy
Think of polynomial hash as a fingerprint: two strings with the same fingerprint are almost certainly the same, but collisions are possible.
  • Just like DNA fingerprinting in forensics, you need a second test for confirmation.
  • The modulus is the resolution of the fingerprint – larger modulus = finer detail, fewer matches by chance.
  • The base should be larger than the alphabet size to avoid trivial collisions (e.g., base 31 > 26 for lowercase letters).
📊 Production Insight
Incorrect modulus selection (e.g., using 10^9+5 instead of a prime) increases collision risk by 10x.
Overflow in languages without big integer support (C++) causes silent wrap-around, breaking hash consistency.
Rule: Always use a large prime modulus and test with random strings before production.
🎯 Key Takeaway
Polynomial hash gives deterministic O(m) fingerprint.
Modulus must be prime to maintain uniform distribution.
This is the building block for all rolling hash variants.

Prefix Hash Array — O(1) Substring Hash

Precompute prefix hashes so any substring hash can be retrieved in O(1).

prefix_hash.py · PYTHON
1234567891011121314
def build_prefix_hash(s: str, base=131, mod=10**9+7):
    n = len(s)
    h = [0] * (n + 1)
    pw = [1] * (n + 1)  # pw[i] = base^i
    for i in range(n):
        h[i+1] = (h[i] * base + ord(s[i])) % mod
        pw[i+1] = pw[i] * base % mod
    def get_hash(l, r):  # hash of s[l..r] inclusive
        return (h[r+1] - h[l] * pw[r-l+1]) % mod
    return get_hash

get_hash = build_prefix_hash('abracadabra')
print(get_hash(0, 3))   # hash of 'abra'
print(get_hash(7, 10))  # hash of 'abra' — same!
▶ Output
6874538
6874538
📊 Production Insight
Forgetting to compute powers of base up to n causes index-out-of-range errors when retrieving substring hash.
Using the same modulus for powers and hash reduces computational cost but requires careful modulo handling.
Rule: Precompute both prefix hash and powers in one pass to avoid O(n^2) recomputation.
🎯 Key Takeaway
Prefix hash array enables O(1) substring hash retrieval.
Store prefix hash and powers separately.
The formula h[r+1] - h[l] * pow[r-l+1] works modulo M.

Rabin-Karp Pattern Matching

Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match — reducing comparisons dramatically.

rabin_karp.py · PYTHON
1234567891011121314
def rabin_karp(text: str, pattern: str, base=131, mod=10**9+7) -> list[int]:
    n, m = len(text), len(pattern)
    if m > n:
        return []
    ph = poly_hash(pattern)
    get_hash = build_prefix_hash(text, base, mod)
    matches = []
    for i in range(n - m + 1):
        if get_hash(i, i+m-1) == ph:
            if text[i:i+m] == pattern:  # confirm to avoid hash collision
                matches.append(i)
    return matches

print(rabin_karp('AABAACAADAABAABA', 'AABA'))  # [0, 9, 12]
▶ Output
[0, 9, 12]
📊 Production Insight
Rabin-Karp without confirmation check can produce false positives in production, especially with large text (e.g., genome sequences).
Choosing too small a modulus (e.g., 10^6+7) leads to frequent collisions and O(nm) worst-case performance.
Rule: Always confirm hash matches by string comparison; use double hashing for critical applications.
🎯 Key Takeaway
Rabin-Karp uses rolling hash to skip most comparisons.
Average time O(n+m), worst-case O(nm) if many hash collisions.
String verification is mandatory after hash match.

Double Hashing — Avoiding Collisions

A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions ≈ 10^6/10^9 ≈ 0.001 — very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/M² ≈ 10^-18.

double_hash.py · PYTHON
12345678910
def double_hash(s: str):
    get1 = build_prefix_hash(s, base=131, mod=10**9+7)
    get2 = build_prefix_hash(s, base=137, mod=10**9+9)
    def get(l, r):
        return (get1(l, r), get2(l, r))
    return get

gh = double_hash('abracadabra')
print(gh(0, 3))   # (hash1_abra, hash2_abra)
print(gh(7, 10))  # same tuple — guaranteed match
▶ Output
(6874538, 7692145)
(6874538, 7692145)
🔥Why Two Independent Moduli?
Using the same modulus for both hashes violates independence – a flaw in the modulus affects both. Choose two distinct large primes (e.g., 10^9+7 and 10^9+9) and different bases (e.g., 131 and 137) to ensure near-true independence.
📊 Production Insight
Double hashing with two primes (e.g., 10^9+7, 10^9+9) reduces collision probability to ~10^-18.
Storing two hashes doubles memory but prevents catastrophic false positives.
Rule: Use different bases (e.g., 131, 137) to ensure independence.
🎯 Key Takeaway
Single hash collision probability ~1/M.
Double hashing with independent functions drops it to ~1/M².
Use different bases and primes for true independence.

Applications

Longest duplicate substring: Binary search on length + hash set to check for duplicates — O(n log n).

Longest common substring of two strings: Concatenate and use hash comparisons — O(n log n).

Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.

DNA sequence analysis: Sliding window over long genomic strings.

String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).

📊 Production Insight
During DNA sequence analysis, rolling hash is used to find repeated subsequences; a collision could lead to missed or false repeats, biasing research findings.
In plagiarism detection, a single collision can wrongly accuse a student; double hashing is legally mandatory.
Rule: Validate hash-based results with statistical checks when false positives have high cost.
🎯 Key Takeaway
Rolling hash powers: Rabin-Karp, longest duplicate substring, plagiarism detection.
Always consider collision cost in your application domain.
For critical systems, combine rolling hash with other methods (e.g., suffix arrays) for verification.
🗂 Single vs Double Hashing
PropertySingle HashDouble Hash
Collision probability~1/M~1/(M1 * M2) ≈ 10^-18
Memory per substring1 integer2 integers (tuple)
Computation per windowO(1)O(1) – 2x operations
Typical use caseCompetitive programming, non-criticalProduction systems, legal/medical

🎯 Key Takeaways

  • Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
  • Rolling hash removes leftmost and adds rightmost character contribution in O(1).
  • Rabin-Karp: compare hash first, then verify string only on hash match — O(n+m) average.
  • Use double hashing (two independent hash functions) to reduce collision probability to near-zero.
  • Rolling hash is the foundation for Rabin-Karp, longest duplicate substring, and many competitive programming string problems.
  • Always verify string equality after a hash match to avoid false positives in production.
  • Choose a large prime modulus and test with random strings to validate hash uniformity.

⚠ Common Mistakes to Avoid

    Using a small modulus like 10^6+7
    Symptom

    Frequent hash collisions causing many false positives in Rabin-Karp, degrading performance to O(nm).

    Fix

    Use a large prime modulus around 10^9 (like 10^9+7 or 10^9+9).

    Forgetting to compute powers array
    Symptom

    Substring hash retrieval produces incorrect results, or index out of bounds when accessing pow[r-l+1].

    Fix

    Precompute pow array of length n+1 before building prefix hash.

    Neglecting string confirmation after hash match
    Symptom

    False positives in pattern matching: reporting match where strings are different.

    Fix

    Always compare actual substrings when hashes match.

Interview Questions on This Topic

  • QHow does a rolling hash enable O(1) window updates?Mid-levelReveal
    By precomputing prefix hashes and powers, the hash of a new window is derived from the previous hash: subtract contribution of outgoing character multiplied by appropriate power, multiply by base, add incoming character, all modulo M. This avoids rehashing the entire window.
  • QWhat is the probability of a hash collision with a single polynomial hash?JuniorReveal
    Approximately 1/M, where M is the modulus. With M=10^9+7 and 10^6 windows, expected collisions ~0.001. But with many comparisons, collisions can occur.
  • QHow would you find the longest duplicate substring using rolling hash and binary search?SeniorReveal
    Binary search on length L. For each L, compute hashes of all substrings of length L using sliding window and store in a set. If a duplicate hash is found, verify strings to confirm duplicate. If duplicate exists, try larger L; else smaller. Complexity O(n log n).
  • QWhy is a confirmation check needed after a hash match in Rabin-Karp?JuniorReveal
    Because hash collisions can occur - two different strings can have the same hash. Confirmation via direct string comparison ensures correctness. Without it, false positives can happen, especially with small moduli or many comparisons.
  • QWhat are the trade-offs between single and double hashing?Mid-levelReveal
    Single hashing is faster and uses less memory, but collision probability is 1/M. Double hashing uses twice the memory and computation but reduces collision probability to 1/M^2. For competitive programming, single hash is usually enough with careful modulus. For production systems where false positives are costly, double is recommended.

Frequently Asked Questions

What base and modulus should I use for polynomial hashing?

Base 31 or 131 work well for lowercase ASCII. Use a large prime modulus like 10^9+7. For safety, use two (base, mod) pairs and compare both hashes — the probability of collision drops to near zero.

How do I handle negative modulo in languages with signed integers?

Use (x % M + M) % M to ensure non-negative remainder.

Can I use the same modulus for both hash components in double hashing?

No, use two different primes (e.g., 10^9+7 and 10^9+9) and different bases to ensure independence.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousSuffix Array — Construction and Applications
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged