Rolling Hash — Avoid False Positives via Double Hashing
10M substring comparisons: one false positive hash collision.
- Rolling hash uses polynomial hashing with a sliding window to compute substring hashes in O(1) time.
- Precompute prefix hashes and powers of base to enable O(1) substring hash retrieval.
- Rabin-Karp uses rolling hash for pattern matching with O(n+m) average time.
- Collision probability is ~1/M per comparison; double hashing reduces it to ~1/M².
- Production pitfall: hash collisions cause false positives; always verify string equality on hash match.
- Performance insight: rolling hash avoids O(m) rehashing per window, saving up to 10x in large strings.
Hashing a string normally takes O(m) time — you must read every character. A rolling hash is smarter: when your window slides by one character, instead of rehashing from scratch, you remove the leftmost character's contribution and add the new rightmost character's contribution in O(1). Like a cash register that doesn't recount all bills when you remove one — it just subtracts its value.
In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) — a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.
The fix in Python 3.3 was hash randomisation — choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.
Polynomial Hash
The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:
hash(S) = (s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0) mod M
Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).
- Just like DNA fingerprinting in forensics, you need a second test for confirmation.
- The modulus is the resolution of the fingerprint – larger modulus = finer detail, fewer matches by chance.
- The base should be larger than the alphabet size to avoid trivial collisions (e.g., base 31 > 26 for lowercase letters).
Prefix Hash Array — O(1) Substring Hash
Precompute prefix hashes so any substring hash can be retrieved in O(1).
Rabin-Karp Pattern Matching
Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match — reducing comparisons dramatically.
Double Hashing — Avoiding Collisions
A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions ≈ 10^6/10^9 ≈ 0.001 — very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/M² ≈ 10^-18.
Applications
Rolling hash appears in many problems:
Longest duplicate substring: Binary search on length + hash set to check for duplicates — O(n log n).
Longest common substring of two strings: Concatenate and use hash comparisons — O(n log n).
Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.
DNA sequence analysis: Sliding window over long genomic strings.
String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).
False Positive in Plagiarism Detector Caused by Hash Collision
- Always verify string equality after a hash match.
- Double hashing is cheap insurance against production collisions.
- When false positives have legal or financial consequences, verification is non-negotiable.
Key takeaways
Common mistakes to avoid
3 patternsUsing a small modulus like 10^6+7
Forgetting to compute powers array
Neglecting string confirmation after hash match
Interview Questions on This Topic
How does a rolling hash enable O(1) window updates?
Frequently Asked Questions
That's Strings. Mark it forged?
3 min read · try the examples if you haven't