Rolling Hash and Polynomial String Hashing
- Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
- Rolling hash removes leftmost and adds rightmost character contribution in O(1).
- Rabin-Karp: compare hash first, then verify string only on hash match β O(n+m) average.
In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) β a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.
The fix in Python 3.3 was hash randomisation β choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.
Polynomial Hash
The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:
hash(S) = (s[0]Β·B^(m-1) + s[1]Β·B^(m-2) + ... + s[m-1]Β·B^0) mod M
Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).
MOD = 10**9 + 7 BASE = 131 def poly_hash(s: str) -> int: h = 0 for ch in s: h = (h * BASE + ord(ch)) % MOD return h print(poly_hash('hello')) # some large int print(poly_hash('hello')) # same β deterministic print(poly_hash('world')) # different
566852114
791390312
Prefix Hash Array β O(1) Substring Hash
Precompute prefix hashes so any substring hash can be retrieved in O(1).
def build_prefix_hash(s: str, base=131, mod=10**9+7): n = len(s) h = [0] * (n + 1) pw = [1] * (n + 1) # pw[i] = base^i for i in range(n): h[i+1] = (h[i] * base + ord(s[i])) % mod pw[i+1] = pw[i] * base % mod def get_hash(l, r): # hash of s[l..r] inclusive return (h[r+1] - h[l] * pw[r-l+1]) % mod return get_hash get_hash = build_prefix_hash('abracadabra') print(get_hash(0, 3)) # hash of 'abra' print(get_hash(7, 10)) # hash of 'abra' β same!
6874538
Rabin-Karp Pattern Matching
Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match β reducing comparisons dramatically.
def rabin_karp(text: str, pattern: str, base=131, mod=10**9+7) -> list[int]: n, m = len(text), len(pattern) if m > n: return [] ph = poly_hash(pattern) get_hash = build_prefix_hash(text, base, mod) matches = [] for i in range(n - m + 1): if get_hash(i, i+m-1) == ph: if text[i:i+m] == pattern: # confirm to avoid hash collision matches.append(i) return matches print(rabin_karp('AABAACAADAABAABA', 'AABA')) # [0, 9, 12]
Double Hashing β Avoiding Collisions
A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions β 10^6/10^9 β 0.001 β very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/MΒ² β 10^-18.
def double_hash(s: str): get1 = build_prefix_hash(s, base=131, mod=10**9+7) get2 = build_prefix_hash(s, base=137, mod=10**9+9) def get(l, r): return (get1(l, r), get2(l, r)) return get gh = double_hash('abracadabra') print(gh(0, 3)) # (hash1_abra, hash2_abra) print(gh(7, 10)) # same tuple β guaranteed match
(6874538, 7692145)
Applications
Rolling hash appears in many problems:
Longest duplicate substring: Binary search on length + hash set to check for duplicates β O(n log n).
Longest common substring of two strings: Concatenate and use hash comparisons β O(n log n).
Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.
DNA sequence analysis: Sliding window over long genomic strings.
String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).
π― Key Takeaways
- Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
- Rolling hash removes leftmost and adds rightmost character contribution in O(1).
- Rabin-Karp: compare hash first, then verify string only on hash match β O(n+m) average.
- Use double hashing (two independent hash functions) to reduce collision probability to near-zero.
- Rolling hash is the foundation for Rabin-Karp, longest duplicate substring, and many competitive programming string problems.
Interview Questions on This Topic
- QHow does a rolling hash enable O(1) window updates?
- QWhat is the probability of a hash collision with a single polynomial hash?
- QHow would you find the longest duplicate substring using rolling hash and binary search?
- QWhy is a confirmation check needed after a hash match in Rabin-Karp?
Frequently Asked Questions
What base and modulus should I use for polynomial hashing?
Base 31 or 131 work well for lowercase ASCII. Use a large prime modulus like 10^9+7. For safety, use two (base, mod) pairs and compare both hashes β the probability of collision drops to near zero.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.