Mid-level 3 min · March 24, 2026

Rolling Hash — Avoid False Positives via Double Hashing

10M substring comparisons: one false positive hash collision.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Rolling hash uses polynomial hashing with a sliding window to compute substring hashes in O(1) time.
  • Precompute prefix hashes and powers of base to enable O(1) substring hash retrieval.
  • Rabin-Karp uses rolling hash for pattern matching with O(n+m) average time.
  • Collision probability is ~1/M per comparison; double hashing reduces it to ~1/M².
  • Production pitfall: hash collisions cause false positives; always verify string equality on hash match.
  • Performance insight: rolling hash avoids O(m) rehashing per window, saving up to 10x in large strings.
Plain-English First

Hashing a string normally takes O(m) time — you must read every character. A rolling hash is smarter: when your window slides by one character, instead of rehashing from scratch, you remove the leftmost character's contribution and add the new rightmost character's contribution in O(1). Like a cash register that doesn't recount all bills when you remove one — it just subtracts its value.

In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) — a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.

The fix in Python 3.3 was hash randomisation — choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.

Polynomial Hash

The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:

hash(S) = (s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0) mod M

Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).

poly_hash.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
MOD = 10**9 + 7
BASE = 131

def poly_hash(s: str) -> int:
    h = 0
    for ch in s:
        h = (h * BASE + ord(ch)) % MOD
    return h

print(poly_hash('hello'))   # some large int
print(poly_hash('hello'))   # same — deterministic
print(poly_hash('world'))   # different
Output
566852114
566852114
791390312
Fingerprint Analogy
  • Just like DNA fingerprinting in forensics, you need a second test for confirmation.
  • The modulus is the resolution of the fingerprint – larger modulus = finer detail, fewer matches by chance.
  • The base should be larger than the alphabet size to avoid trivial collisions (e.g., base 31 > 26 for lowercase letters).
Production Insight
Incorrect modulus selection (e.g., using 10^9+5 instead of a prime) increases collision risk by 10x.
Overflow in languages without big integer support (C++) causes silent wrap-around, breaking hash consistency.
Rule: Always use a large prime modulus and test with random strings before production.
Key Takeaway
Polynomial hash gives deterministic O(m) fingerprint.
Modulus must be prime to maintain uniform distribution.
This is the building block for all rolling hash variants.

Prefix Hash Array — O(1) Substring Hash

Precompute prefix hashes so any substring hash can be retrieved in O(1).

prefix_hash.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def build_prefix_hash(s: str, base=131, mod=10**9+7):
    n = len(s)
    h = [0] * (n + 1)
    pw = [1] * (n + 1)  # pw[i] = base^i
    for i in range(n):
        h[i+1] = (h[i] * base + ord(s[i])) % mod
        pw[i+1] = pw[i] * base % mod
    def get_hash(l, r):  # hash of s[l..r] inclusive
        return (h[r+1] - h[l] * pw[r-l+1]) % mod
    return get_hash

get_hash = build_prefix_hash('abracadabra')
print(get_hash(0, 3))   # hash of 'abra'
print(get_hash(7, 10))  # hash of 'abra' — same!
Output
6874538
6874538
Production Insight
Forgetting to compute powers of base up to n causes index-out-of-range errors when retrieving substring hash.
Using the same modulus for powers and hash reduces computational cost but requires careful modulo handling.
Rule: Precompute both prefix hash and powers in one pass to avoid O(n^2) recomputation.
Key Takeaway
Prefix hash array enables O(1) substring hash retrieval.
Store prefix hash and powers separately.
The formula h[r+1] - h[l] * pow[r-l+1] works modulo M.

Rabin-Karp Pattern Matching

Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match — reducing comparisons dramatically.

rabin_karp.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def rabin_karp(text: str, pattern: str, base=131, mod=10**9+7) -> list[int]:
    n, m = len(text), len(pattern)
    if m > n:
        return []
    ph = poly_hash(pattern)
    get_hash = build_prefix_hash(text, base, mod)
    matches = []
    for i in range(n - m + 1):
        if get_hash(i, i+m-1) == ph:
            if text[i:i+m] == pattern:  # confirm to avoid hash collision
                matches.append(i)
    return matches

print(rabin_karp('AABAACAADAABAABA', 'AABA'))  # [0, 9, 12]
Output
[0, 9, 12]
Production Insight
Rabin-Karp without confirmation check can produce false positives in production, especially with large text (e.g., genome sequences).
Choosing too small a modulus (e.g., 10^6+7) leads to frequent collisions and O(nm) worst-case performance.
Rule: Always confirm hash matches by string comparison; use double hashing for critical applications.
Key Takeaway
Rabin-Karp uses rolling hash to skip most comparisons.
Average time O(n+m), worst-case O(nm) if many hash collisions.
String verification is mandatory after hash match.

Double Hashing — Avoiding Collisions

A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions ≈ 10^6/10^9 ≈ 0.001 — very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/M² ≈ 10^-18.

double_hash.pyPYTHON
1
2
3
4
5
6
7
8
9
10
def double_hash(s: str):
    get1 = build_prefix_hash(s, base=131, mod=10**9+7)
    get2 = build_prefix_hash(s, base=137, mod=10**9+9)
    def get(l, r):
        return (get1(l, r), get2(l, r))
    return get

gh = double_hash('abracadabra')
print(gh(0, 3))   # (hash1_abra, hash2_abra)
print(gh(7, 10))  # same tuple — guaranteed match
Output
(6874538, 7692145)
(6874538, 7692145)
Why Two Independent Moduli?
Using the same modulus for both hashes violates independence – a flaw in the modulus affects both. Choose two distinct large primes (e.g., 10^9+7 and 10^9+9) and different bases (e.g., 131 and 137) to ensure near-true independence.
Production Insight
Double hashing with two primes (e.g., 10^9+7, 10^9+9) reduces collision probability to ~10^-18.
Storing two hashes doubles memory but prevents catastrophic false positives.
Rule: Use different bases (e.g., 131, 137) to ensure independence.
Key Takeaway
Single hash collision probability ~1/M.
Double hashing with independent functions drops it to ~1/M².
Use different bases and primes for true independence.

Applications

Longest duplicate substring: Binary search on length + hash set to check for duplicates — O(n log n).

Longest common substring of two strings: Concatenate and use hash comparisons — O(n log n).

Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.

DNA sequence analysis: Sliding window over long genomic strings.

String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).

Production Insight
During DNA sequence analysis, rolling hash is used to find repeated subsequences; a collision could lead to missed or false repeats, biasing research findings.
In plagiarism detection, a single collision can wrongly accuse a student; double hashing is legally mandatory.
Rule: Validate hash-based results with statistical checks when false positives have high cost.
Key Takeaway
Rolling hash powers: Rabin-Karp, longest duplicate substring, plagiarism detection.
Always consider collision cost in your application domain.
For critical systems, combine rolling hash with other methods (e.g., suffix arrays) for verification.
● Production incidentPOST-MORTEMseverity: high

False Positive in Plagiarism Detector Caused by Hash Collision

Symptom
Two independently written essays were flagged as identical. Manual review showed they were completely different. The hash values matched, but the strings did not.
Assumption
Single hash (M=10^9+7) was assumed sufficient; collision probability was considered negligible.
Root cause
With 10 million substring comparisons, the expected number of collisions was ~10, but one happened to involve two long strings, causing a false positive.
Fix
Switched to double hashing with two independent prime moduli, and added mandatory string verification on every hash match.
Key lesson
  • Always verify string equality after a hash match.
  • Double hashing is cheap insurance against production collisions.
  • When false positives have legal or financial consequences, verification is non-negotiable.
Production debug guideSymptom → Action grid for common rolling hash failures3 entries
Symptom · 01
Hash values match but strings differ
Fix
Confirm with full string comparison; check hash function implementation for modulus overflow or negative modulo handling.
Symptom · 02
Substring hash gives incorrect result
Fix
Verify prefix hash formula: ensure correct order of modulo and subtraction; check power array initialization and indexing.
Symptom · 03
Performance degraded due to many false positives
Fix
Increase modulus to a larger prime or switch to double hashing. Monitor collision frequency via logging hash matches that fail string verification.
★ Quick Debug Cheat Sheet: Rolling Hash CollisionsUse these steps when you suspect hash collisions are affecting your application.
Two different substrings produce same hash
Immediate action
Compare strings directly to confirm collision.
Commands
python -c "print(get_hash(l1,r1) == get_hash(l2,r2))" # verify hash equality
Calculate expected collisions: n^2/(2*M) where n is number of windows.
Fix now
Add string verification after hash match, or switch to double hashing with second base/mod.
Hash overflow producing negative values+
Immediate action
Check language's modulo operator behavior with negative numbers.
Commands
print((h - other * pow) % MOD) # ensure positive by adding MOD before modulo
Use helper function: def mod(x): return (x % MOD + MOD) % MOD
Fix now
Wrap all subtraction results with ((x % MOD) + MOD) % MOD.
Single vs Double Hashing
PropertySingle HashDouble Hash
Collision probability~1/M~1/(M1 * M2) ≈ 10^-18
Memory per substring1 integer2 integers (tuple)
Computation per windowO(1)O(1) – 2x operations
Typical use caseCompetitive programming, non-criticalProduction systems, legal/medical

Key takeaways

1
Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
2
Rolling hash removes leftmost and adds rightmost character contribution in O(1).
3
Rabin-Karp
compare hash first, then verify string only on hash match — O(n+m) average.
4
Use double hashing (two independent hash functions) to reduce collision probability to near-zero.
5
Rolling hash is the foundation for Rabin-Karp, longest duplicate substring, and many competitive programming string problems.
6
Always verify string equality after a hash match to avoid false positives in production.
7
Choose a large prime modulus and test with random strings to validate hash uniformity.

Common mistakes to avoid

3 patterns
×

Using a small modulus like 10^6+7

Symptom
Frequent hash collisions causing many false positives in Rabin-Karp, degrading performance to O(nm).
Fix
Use a large prime modulus around 10^9 (like 10^9+7 or 10^9+9).
×

Forgetting to compute powers array

Symptom
Substring hash retrieval produces incorrect results, or index out of bounds when accessing pow[r-l+1].
Fix
Precompute pow array of length n+1 before building prefix hash.
×

Neglecting string confirmation after hash match

Symptom
False positives in pattern matching: reporting match where strings are different.
Fix
Always compare actual substrings when hashes match.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does a rolling hash enable O(1) window updates?
Q02JUNIOR
What is the probability of a hash collision with a single polynomial has...
Q03SENIOR
How would you find the longest duplicate substring using rolling hash an...
Q04JUNIOR
Why is a confirmation check needed after a hash match in Rabin-Karp?
Q05SENIOR
What are the trade-offs between single and double hashing?
Q01 of 05SENIOR

How does a rolling hash enable O(1) window updates?

ANSWER
By precomputing prefix hashes and powers, the hash of a new window is derived from the previous hash: subtract contribution of outgoing character multiplied by appropriate power, multiply by base, add incoming character, all modulo M. This avoids rehashing the entire window.
FAQ · 3 QUESTIONS

Frequently Asked Questions

01
What base and modulus should I use for polynomial hashing?
02
How do I handle negative modulo in languages with signed integers?
03
Can I use the same modulus for both hash components in double hashing?
🔥

That's Strings. Mark it forged?

3 min read · try the examples if you haven't

Previous
Suffix Array — Construction and Applications
8 / 8 · Strings
Next
Fractional Knapsack Problem — Greedy Approach