Rolling Hash — Avoid False Positives via Double Hashing
10M substring comparisons: one false positive hash collision.
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
- Rolling hash uses polynomial hashing with a sliding window to compute substring hashes in O(1) time.
- Precompute prefix hashes and powers of base to enable O(1) substring hash retrieval.
- Rabin-Karp uses rolling hash for pattern matching with O(n+m) average time.
- Collision probability is ~1/M per comparison; double hashing reduces it to ~1/M².
- Production pitfall: hash collisions cause false positives; always verify string equality on hash match.
- Performance insight: rolling hash avoids O(m) rehashing per window, saving up to 10x in large strings.
Hashing a string normally takes O(m) time — you must read every character. A rolling hash is smarter: when your window slides by one character, instead of rehashing from scratch, you remove the leftmost character's contribution and add the new rightmost character's contribution in O(1). Like a cash register that doesn't recount all bills when you remove one — it just subtracts its value.
In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) — a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.
The fix in Python 3.3 was hash randomisation — choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.
Why Rolling Hash Is a Sliding Window Checksum
Rolling hash is a technique that computes a hash over a sliding window of data in O(1) time per shift, rather than O(k) where k is the window size. The core mechanic: treat the string as a base-b number modulo a large prime, then update the hash by subtracting the outgoing character's contribution, shifting left by one power of b, and adding the incoming character. This gives you a constant-time fingerprint for every substring of fixed length.
In practice, the hash is computed as h = (h - s[i] b^(k-1)) b + s[i+k] mod M. The choice of base b and modulus M determines collision probability. A single 64-bit modulus gives ~2^-64 collision chance per pair, but adversarial inputs can force collisions. Double hashing — using two independent moduli — drops collision probability to ~2^-128, making false positives negligible for most systems.
Use rolling hash when you need to find repeated substrings, detect plagiarism, or implement Rabin-Karp substring search. It matters because it replaces O(n*k) naive comparison with O(n) preprocessing and O(1) per window — critical when scanning gigabytes of text or streaming data. Real systems like rsync, deduplication engines, and DNA sequence aligners depend on this.
Polynomial Hash
The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:
hash(S) = (s[0]·B^(m-1) + s[1]·B^(m-2) + ... + s[m-1]·B^0) mod M
Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).
- Just like DNA fingerprinting in forensics, you need a second test for confirmation.
- The modulus is the resolution of the fingerprint – larger modulus = finer detail, fewer matches by chance.
- The base should be larger than the alphabet size to avoid trivial collisions (e.g., base 31 > 26 for lowercase letters).
Prefix Hash Array — O(1) Substring Hash
Precompute prefix hashes so any substring hash can be retrieved in O(1).
Rabin-Karp Pattern Matching
Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match — reducing comparisons dramatically.
Double Hashing — Avoiding Collisions
A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions ≈ 10^6/10^9 ≈ 0.001 — very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/M² ≈ 10^-18.
Applications
Rolling hash appears in many problems:
Longest duplicate substring: Binary search on length + hash set to check for duplicates — O(n log n).
Longest common substring of two strings: Concatenate and use hash comparisons — O(n log n).
Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.
DNA sequence analysis: Sliding window over long genomic strings.
String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).
How Modulo Arithmetic Kills Real-World Performance
Every rolling hash tutorial on the planet will tell you to use (base * hash + char) % MOD. That modulo operation looks innocent — it's not. On modern x86-64, integer division (which modulo requires) takes 20-80 cycles, versus 1-2 cycles for addition or multiplication. When you're hashing a 10MB string, you've just spent millions of cycles burning through your CPU's pipeline for no algorithmic benefit.
The fix? Use power-of-two modulus and let the hardware handle it. For example, MOD = 2^31-1 (a Mersenne prime) allows you to replace % MOD with bitwise AND and shift tricks. Or if you're hashing for equality only (not for cryptographic safety), pick MOD = 2^32 and rely on natural overflow. Java's int overflow is defined as wrapping — use that. The collisions stay rare, the speed jumps 10x.
The WHY: Your hash function doesn't need to be cryptographically sound. It only needs to separate strings that are actually different. Accept a 1-in-a-trillion collision rate and your algorithm becomes fast enough for streaming pipelines. Don't cargo-cult the modulo — understand the trade-off.
int overflow is fine for rolling hash if your base and window size don't cause power overflow that wraps to zero. If power wraps to zero, every subtraction zeroes out the out-going character — your hash becomes useless. Always compute power with a loop that multiplies before overflow, or use long for the power and cast down. Test with a length-10 window before trusting it.Double Hashing Is Not Enough — Watch for Base-Character Collisions
Everybody loves double hashing. Use two primes as bases, two modulos, compare both hashes — collisions become astronomically unlikely. But I've seen production systems burn for hours because of a specific degenerate case: when your base integer shares factors with the character value range. If your base is 31 (common) and your characters include values divisible by 31 (e.g., ASCII characters like '\x1F' = 31), then every multiplied hash step introduces a zero factor. Windows that differ only in those positions collide with probability 1.
Here's the fix: pick a base that is coprime to every possible character value. For ASCII (0-127), any base that is not a divisor of 128 works — use 131 or 137. For Unicode, where character values can be any 21-bit number, you need a base that is prime and larger than the maximum character code. 911 is a solid choice. Or normalize your characters first: subtract the smallest character value so the range starts at 1, not 0.
The WHY: This isn't a theoretical edge case. I've debugged a plagiarism detector that missed copy-pasted code because the substring "\x1F\x1F\x1F" (three unit-separator characters) collided with "\x00" repeated three times. The base-31 hash made them identical. Double hashing didn't save us because both bases were 31 and 37 — 37 is also coprime with 31 but the same character factor problem applies. Always check your character encoding before trusting a rolling hash.
Why Your Modulo Choice Is the Most Dangerous Line in the File
You picked 1e9+7 because every tutorial uses it. Stop. That modulus kills cache locality and makes your hash twice as slow on modern hardware. The real cost isn't the modulo operation itself — it's the data dependency. Your CPU can't pipeline the next hash computation because it's waiting for the modulo result from the previous one. Integer division is 20-80 cycles. On x86-64, that's 60-200 nanoseconds per hash update. Multiply by a million characters and your O(n) algorithm becomes O(n * pain).
The fix is surprisingly simple: use a power-of-two modulus like 2^31-1 (Mersenne prime) or just 2^32 with natural overflow. Java's long overflow wraps cleanly. You lose mathematical purity but gain real-world speed. Rabin-Karp doesn't need cryptographic strength — it needs to not catch fire on a 10MB file. Profile first, then decide if collision probability is actually your bottleneck.
Master DSA Topics - May 2026: Rolling Hash Meets Suffix Trees
By May 2026, every FAANG interviewer will expect you to know that rolling hash is the engine behind Ukkonen's suffix tree construction — but faster and with less memory. Stop treating rolling hash as a toy for substring search. It's the core of online suffix array building and LCP (Longest Common Prefix) queries. The trick: a rolling hash over a suffix gives you O(1) LCP between any two suffixes by binary searching on hash equality.
This is how you solve "longest repeated substring" in O(n log n) without building a suffix automaton. Use a prefix hash array, then for each length, binary search to check if any two substrings match via hash collision. Double hashing protects against false positives. The production cost is one extra array — trivial. In 2026, the senior candidates will write this from memory. The juniors will reach for a trie. Be the senior.
False Positive in Plagiarism Detector Caused by Hash Collision
- Always verify string equality after a hash match.
- Double hashing is cheap insurance against production collisions.
- When false positives have legal or financial consequences, verification is non-negotiable.
python -c "print(get_hash(l1,r1) == get_hash(l2,r2))" # verify hash equalityCalculate expected collisions: n^2/(2*M) where n is number of windows.Key takeaways
Common mistakes to avoid
3 patternsUsing a small modulus like 10^6+7
Forgetting to compute powers array
Neglecting string confirmation after hash match
Practice These on LeetCode
Interview Questions on This Topic
How does a rolling hash enable O(1) window updates?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
That's Strings. Mark it forged?
6 min read · try the examples if you haven't