Rolling Hash — Avoid False Positives via Double Hashing
- Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
- Rolling hash removes leftmost and adds rightmost character contribution in O(1).
- Rabin-Karp: compare hash first, then verify string only on hash match — O(n+m) average.
- Rolling hash uses polynomial hashing with a sliding window to compute substring hashes in O(1) time.
- Precompute prefix hashes and powers of base to enable O(1) substring hash retrieval.
- Rabin-Karp uses rolling hash for pattern matching with O(n+m) average time.
- Collision probability is ~1/M per comparison; double hashing reduces it to ~1/M².
- Production pitfall: hash collisions cause false positives; always verify string equality on hash match.
- Performance insight: rolling hash avoids O(m) rehashing per window, saving up to 10x in large strings.
Quick Debug Cheat Sheet: Rolling Hash Collisions
Two different substrings produce same hash
python -c "print(get_hash(l1,r1) == get_hash(l2,r2))" # verify hash equalityCalculate expected collisions: n^2/(2*M) where n is number of windows.Hash overflow producing negative values
print((h - other * pow) % MOD) # ensure positive by adding MOD before moduloUse helper function: def mod(x): return (x % MOD + MOD) % MODProduction Incident
Production Debug GuideSymptom → Action grid for common rolling hash failures
In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) — a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.
The fix in Python 3.3 was hash randomisation — choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.
Polynomial Hash
The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:
hash(S) = (s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0) mod M
Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).
MOD = 10**9 + 7 BASE = 131 def poly_hash(s: str) -> int: h = 0 for ch in s: h = (h * BASE + ord(ch)) % MOD return h print(poly_hash('hello')) # some large int print(poly_hash('hello')) # same — deterministic print(poly_hash('world')) # different
566852114
791390312
- Just like DNA fingerprinting in forensics, you need a second test for confirmation.
- The modulus is the resolution of the fingerprint – larger modulus = finer detail, fewer matches by chance.
- The base should be larger than the alphabet size to avoid trivial collisions (e.g., base 31 > 26 for lowercase letters).
Prefix Hash Array — O(1) Substring Hash
Precompute prefix hashes so any substring hash can be retrieved in O(1).
def build_prefix_hash(s: str, base=131, mod=10**9+7): n = len(s) h = [0] * (n + 1) pw = [1] * (n + 1) # pw[i] = base^i for i in range(n): h[i+1] = (h[i] * base + ord(s[i])) % mod pw[i+1] = pw[i] * base % mod def get_hash(l, r): # hash of s[l..r] inclusive return (h[r+1] - h[l] * pw[r-l+1]) % mod return get_hash get_hash = build_prefix_hash('abracadabra') print(get_hash(0, 3)) # hash of 'abra' print(get_hash(7, 10)) # hash of 'abra' — same!
6874538
Rabin-Karp Pattern Matching
Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match — reducing comparisons dramatically.
def rabin_karp(text: str, pattern: str, base=131, mod=10**9+7) -> list[int]: n, m = len(text), len(pattern) if m > n: return [] ph = poly_hash(pattern) get_hash = build_prefix_hash(text, base, mod) matches = [] for i in range(n - m + 1): if get_hash(i, i+m-1) == ph: if text[i:i+m] == pattern: # confirm to avoid hash collision matches.append(i) return matches print(rabin_karp('AABAACAADAABAABA', 'AABA')) # [0, 9, 12]
Double Hashing — Avoiding Collisions
A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions ≈ 10^6/10^9 ≈ 0.001 — very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/M² ≈ 10^-18.
def double_hash(s: str): get1 = build_prefix_hash(s, base=131, mod=10**9+7) get2 = build_prefix_hash(s, base=137, mod=10**9+9) def get(l, r): return (get1(l, r), get2(l, r)) return get gh = double_hash('abracadabra') print(gh(0, 3)) # (hash1_abra, hash2_abra) print(gh(7, 10)) # same tuple — guaranteed match
(6874538, 7692145)
Applications
Rolling hash appears in many problems:
Longest duplicate substring: Binary search on length + hash set to check for duplicates — O(n log n).
Longest common substring of two strings: Concatenate and use hash comparisons — O(n log n).
Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.
DNA sequence analysis: Sliding window over long genomic strings.
String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).
| Property | Single Hash | Double Hash |
|---|---|---|
| Collision probability | ~1/M | ~1/(M1 * M2) ≈ 10^-18 |
| Memory per substring | 1 integer | 2 integers (tuple) |
| Computation per window | O(1) | O(1) – 2x operations |
| Typical use case | Competitive programming, non-critical | Production systems, legal/medical |
🎯 Key Takeaways
- Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
- Rolling hash removes leftmost and adds rightmost character contribution in O(1).
- Rabin-Karp: compare hash first, then verify string only on hash match — O(n+m) average.
- Use double hashing (two independent hash functions) to reduce collision probability to near-zero.
- Rolling hash is the foundation for Rabin-Karp, longest duplicate substring, and many competitive programming string problems.
- Always verify string equality after a hash match to avoid false positives in production.
- Choose a large prime modulus and test with random strings to validate hash uniformity.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QHow does a rolling hash enable O(1) window updates?Mid-levelReveal
- QWhat is the probability of a hash collision with a single polynomial hash?JuniorReveal
- QHow would you find the longest duplicate substring using rolling hash and binary search?SeniorReveal
- QWhy is a confirmation check needed after a hash match in Rabin-Karp?JuniorReveal
- QWhat are the trade-offs between single and double hashing?Mid-levelReveal
Frequently Asked Questions
What base and modulus should I use for polynomial hashing?
Base 31 or 131 work well for lowercase ASCII. Use a large prime modulus like 10^9+7. For safety, use two (base, mod) pairs and compare both hashes — the probability of collision drops to near zero.
How do I handle negative modulo in languages with signed integers?
Use (x % M + M) % M to ensure non-negative remainder.
Can I use the same modulus for both hash components in double hashing?
No, use two different primes (e.g., 10^9+7 and 10^9+9) and different bases to ensure independence.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.