Homeβ€Ί DSAβ€Ί Rolling Hash and Polynomial String Hashing

Rolling Hash and Polynomial String Hashing

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Strings β†’ Topic 8 of 8
Master rolling hash (Rabin-Karp fingerprinting) and polynomial string hashing.
βš™οΈ Intermediate β€” basic DSA knowledge assumed
In this tutorial, you'll learn:
  • Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
  • Rolling hash removes leftmost and adds rightmost character contribution in O(1).
  • Rabin-Karp: compare hash first, then verify string only on hash match β€” O(n+m) average.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Hashing a string normally takes O(m) time β€” you must read every character. A rolling hash is smarter: when your window slides by one character, instead of rehashing from scratch, you remove the leftmost character's contribution and add the new rightmost character's contribution in O(1). Like a cash register that doesn't recount all bills when you remove one β€” it just subtracts its value.

In 2011, several web frameworks were hit by hash collision DoS attacks. Attackers crafted HTTP POST requests where all parameter names hashed to the same bucket. What should be O(n) request parsing became O(n^2) β€” a single crafted 100KB request could saturate a server's CPU for seconds. Python, PHP, Ruby, and Java were all affected at the same time.

The fix in Python 3.3 was hash randomisation β€” choosing a random hash function from a family at startup. Rolling hash and universal hashing are closely related: both are about selecting hash functions with predictable, controllable collision probability. Understanding rolling hash means understanding both the attack vector and the defence.

Polynomial Hash

The polynomial hash of a string S = s[0]s[1]...s[m-1] with base B and modulus M is:

hash(S) = (s[0]Β·B^(m-1) + s[1]Β·B^(m-2) + ... + s[m-1]Β·B^0) mod M

Common values: B = 31 or 131, M = 10^9+7 or 10^9+9 (large primes).

poly_hash.py Β· PYTHON
123456789101112
MOD = 10**9 + 7
BASE = 131

def poly_hash(s: str) -> int:
    h = 0
    for ch in s:
        h = (h * BASE + ord(ch)) % MOD
    return h

print(poly_hash('hello'))   # some large int
print(poly_hash('hello'))   # same β€” deterministic
print(poly_hash('world'))   # different
β–Ά Output
566852114
566852114
791390312

Prefix Hash Array β€” O(1) Substring Hash

Precompute prefix hashes so any substring hash can be retrieved in O(1).

prefix_hash.py Β· PYTHON
1234567891011121314
def build_prefix_hash(s: str, base=131, mod=10**9+7):
    n = len(s)
    h = [0] * (n + 1)
    pw = [1] * (n + 1)  # pw[i] = base^i
    for i in range(n):
        h[i+1] = (h[i] * base + ord(s[i])) % mod
        pw[i+1] = pw[i] * base % mod
    def get_hash(l, r):  # hash of s[l..r] inclusive
        return (h[r+1] - h[l] * pw[r-l+1]) % mod
    return get_hash

get_hash = build_prefix_hash('abracadabra')
print(get_hash(0, 3))   # hash of 'abra'
print(get_hash(7, 10))  # hash of 'abra' β€” same!
β–Ά Output
6874538
6874538

Rabin-Karp Pattern Matching

Use rolling hash to compare pattern hash against each window hash in O(1). Full match check only on hash match β€” reducing comparisons dramatically.

rabin_karp.py Β· PYTHON
1234567891011121314
def rabin_karp(text: str, pattern: str, base=131, mod=10**9+7) -> list[int]:
    n, m = len(text), len(pattern)
    if m > n:
        return []
    ph = poly_hash(pattern)
    get_hash = build_prefix_hash(text, base, mod)
    matches = []
    for i in range(n - m + 1):
        if get_hash(i, i+m-1) == ph:
            if text[i:i+m] == pattern:  # confirm to avoid hash collision
                matches.append(i)
    return matches

print(rabin_karp('AABAACAADAABAABA', 'AABA'))  # [0, 9, 12]
β–Ά Output
[0, 9, 12]

Double Hashing β€” Avoiding Collisions

A single hash has collision probability ~1/M. With M=10^9+7 and n=10^6 windows, expected collisions β‰ˆ 10^6/10^9 β‰ˆ 0.001 β€” very low but not zero. For safety in competitive programming, use double hashing: two independent hash functions. The probability of simultaneous collision is ~1/MΒ² β‰ˆ 10^-18.

double_hash.py Β· PYTHON
12345678910
def double_hash(s: str):
    get1 = build_prefix_hash(s, base=131, mod=10**9+7)
    get2 = build_prefix_hash(s, base=137, mod=10**9+9)
    def get(l, r):
        return (get1(l, r), get2(l, r))
    return get

gh = double_hash('abracadabra')
print(gh(0, 3))   # (hash1_abra, hash2_abra)
print(gh(7, 10))  # same tuple β€” guaranteed match
β–Ά Output
(6874538, 7692145)
(6874538, 7692145)

Applications

Rolling hash appears in many problems:

Longest duplicate substring: Binary search on length + hash set to check for duplicates β€” O(n log n).

Longest common substring of two strings: Concatenate and use hash comparisons β€” O(n log n).

Plagiarism detection: Rabin-Karp with multiple patterns simultaneously.

DNA sequence analysis: Sliding window over long genomic strings.

String equality in O(1): After O(n) preprocessing, compare any two substrings in O(1).

🎯 Key Takeaways

  • Polynomial hash enables O(m) hashing; prefix hash array enables O(1) substring hash retrieval.
  • Rolling hash removes leftmost and adds rightmost character contribution in O(1).
  • Rabin-Karp: compare hash first, then verify string only on hash match β€” O(n+m) average.
  • Use double hashing (two independent hash functions) to reduce collision probability to near-zero.
  • Rolling hash is the foundation for Rabin-Karp, longest duplicate substring, and many competitive programming string problems.

Interview Questions on This Topic

  • QHow does a rolling hash enable O(1) window updates?
  • QWhat is the probability of a hash collision with a single polynomial hash?
  • QHow would you find the longest duplicate substring using rolling hash and binary search?
  • QWhy is a confirmation check needed after a hash match in Rabin-Karp?

Frequently Asked Questions

What base and modulus should I use for polynomial hashing?

Base 31 or 131 work well for lowercase ASCII. Use a large prime modulus like 10^9+7. For safety, use two (base, mod) pairs and compare both hashes β€” the probability of collision drops to near zero.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousSuffix Array β€” Construction and Applications
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged