Skip to content
Home DSA Rabin-Karp String Matching: Rolling Hashes, Spurious Hits & Real-World Performance

Rabin-Karp String Matching: Rolling Hashes, Spurious Hits & Real-World Performance

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Hashing → Topic 6 of 11
Rabin-Karp string matching explained deeply — rolling hash internals, collision handling, worst-case pitfalls, and production-ready Java implementation with real output.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn
Rabin-Karp string matching explained deeply — rolling hash internals, collision handling, worst-case pitfalls, and production-ready Java implementation with real output.
  • Rabin-Karp uses a rolling hash to check text windows in O(1) per slide after O(m) setup.
  • Average case O(n+m); worst case O(n*m) when many hash collisions force full comparisons.
  • Always confirm hash matches with a full string comparison to avoid false positives.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Imagine you're looking for a specific page in a 1,000-page book by its fingerprint — instead of reading every word on every page, you stamp a unique number on each page and compare numbers first. Only when the numbers match do you actually read that page. Rabin-Karp does exactly that with text: it computes a numeric fingerprint (hash) for the pattern you're searching for, then slides a window across the text computing fingerprints on the fly. When fingerprints match, only then does it check the actual characters. It's the difference between comparing ID card photos one by one versus first sorting everyone by hair colour.

Every time your IDE finds all occurrences of a variable name, every time a plagiarism detector scans a 50,000-word essay, every time a database engine hunts for a substring — a string matching algorithm is doing the heavy lifting. Most developers reach for the naive O(n·m) approach without realising it silently tanks performance the moment pattern or text grows large. Rabin-Karp isn't just a clever trick; it's the algorithm underpinning multi-pattern search in tools like grep, antivirus scanners, and bioinformatics pipelines.

The core problem Rabin-Karp solves is the sliding-window comparison bottleneck. In a naive search, shifting the window one character to the right means re-comparing up to m characters all over again. Rabin-Karp replaces that character-by-character re-comparison with a single arithmetic operation — subtracting the outgoing character's contribution and adding the incoming one. This 'rolling hash' trick keeps the average case firmly at O(n + m) while letting you search for multiple patterns simultaneously by checking each window's hash against a set of pattern hashes in O(1).

By the end of this article you'll understand exactly how the polynomial rolling hash is constructed and why each design choice matters, how to handle hash collisions (spurious hits) correctly, how to implement a production-safe version in Java with meaningful variable names and real output, and when Rabin-Karp beats KMP or Boyer-Moore — and when it doesn't. You'll also walk away knowing the three gotchas that trip up even experienced engineers implementing this for the first time.

What is Rabin-Karp? — Plain English

Rabin-Karp is a string-matching algorithm that uses hashing to find a pattern within a text. Instead of comparing the pattern character-by-character against every text window, it first computes a hash of the pattern and a rolling hash of each text window of the same length. Only when hashes match does it do a character-by-character confirmation. This makes the average case O(n+m), and the rolling hash allows moving the window one character to the right in O(1) time.

The real power of Rabin-Karp is when you need to search for multiple patterns simultaneously: store all pattern hashes in a set, then each rolling window hash lookup is O(1). This is why plagiarism-detection tools and code duplication finders often use Rabin-Karp.

How Rabin-Karp Works — Step by Step

  1. Choose a base (e.g., 31 or 256) and a large prime modulus to reduce collisions.
  2. Compute the hash of the pattern and the hash of the first window of text (length m).
  3. For each window position i from 0 to n-m:
  4. a. If hash(text[i..i+m-1]) == hash(pattern):
  5. - Confirm with a character-by-character comparison (handles hash collisions).
  6. - If equal, record a match at index i.
  7. b. Slide the window right: remove text[i], add text[i+m].
  8. Rolling hash update: h = (h - text[i] base^(m-1)) base + text[i+m] (mod prime)
  9. Return all match positions.

Time complexity: O(n+m) average case. O(n*m) worst case (if every window causes a hash collision requiring full comparison — rare with a good hash function).

Worked Example — Tracing the Algorithm

Pattern: 'abc' (m=3). Text: 'xabcabc'. Base=256, Prime=101.

Hash of 'abc': h = (ord('a')256^2 + ord('b')256 + ord('c')) % 101 = (9765536 + 98256 + 99) % 101 = (6356992 + 25088 + 99) % 101 = 6382179 % 101 = 54

Window 0: 'xab'. hash != 54. Slide. Window 1: 'abc'. hash == 54. Confirm 'abc'=='abc'. MATCH at index 1. Window 2: 'bca'. hash != 54. Slide. Window 3: 'cab'. hash != 54. Slide. Window 4: 'abc'. hash == 54. Confirm 'abc'=='abc'. MATCH at index 4.

Matches found at indices 1 and 4.

Implementation

The following implementation demonstrates Rabin-Karp with clear comments.

rabin_karp.py · PYTHON
1234567891011121314151617181920212223242526
def rabin_karp(text, pattern, base=256, prime=101):
    n, m = len(text), len(pattern)
    if m > n: return []

    # Precompute base^(m-1) mod prime for rolling hash removal
    h = pow(base, m - 1, prime)

    pattern_hash = 0
    window_hash  = 0
    for i in range(m):
        pattern_hash = (base * pattern_hash + ord(pattern[i])) % prime
        window_hash  = (base * window_hash  + ord(text[i]))    % prime

    matches = []
    for i in range(n - m + 1):
        if window_hash == pattern_hash:
            if text[i:i+m] == pattern:  # confirm to handle collisions
                matches.append(i)
        if i < n - m:
            # Roll the window: remove text[i], add text[i+m]
            window_hash = (base * (window_hash - ord(text[i]) * h) + ord(text[i+m])) % prime
            if window_hash < 0:
                window_hash += prime
    return matches

print(rabin_karp('xabcabc', 'abc'))
▶ Output
[1, 4]
ConceptUse CaseExample
Rabin-Karp String MatchingCore usageSee code above

🎯 Key Takeaways

  • Rabin-Karp uses a rolling hash to check text windows in O(1) per slide after O(m) setup.
  • Average case O(n+m); worst case O(n*m) when many hash collisions force full comparisons.
  • Always confirm hash matches with a full string comparison to avoid false positives.
  • Best suited for multi-pattern search — store pattern hashes in a set for O(1) lookup.
  • The rolling hash update is: h = (base(h - old_charbase^(m-1)) + new_char) mod prime.

⚠ Common Mistakes to Avoid

    Memorising syntax before understanding the concept
    Skipping practice and only reading theory

Interview Questions on This Topic

  • QWhat is a spurious hit in Rabin-Karp and how does the algorithm handle it?
  • QWhat is the time complexity of Rabin-Karp and what causes the worst case?
  • QHow would you use Rabin-Karp to search for any of 100 different patterns in a text?

Frequently Asked Questions

What is a hash collision in Rabin-Karp?

A hash collision is when two different strings have the same hash value. Rabin-Karp checks hashes first (fast) and only does a full character comparison when hashes match. If a collision occurs, the full comparison will detect the mismatch and not report a false positive. The algorithm is always correct — collisions only add extra work, not wrong answers.

When is Rabin-Karp better than KMP?

Rabin-Karp excels when searching for multiple patterns simultaneously — hash all patterns into a set, then each window comparison is O(1) lookup. KMP must be run separately for each pattern. For a single pattern, KMP or Z-algorithm are generally preferred because Rabin-Karp's hash collision overhead can be unpredictable.

Why use a prime modulus?

A prime modulus distributes hash values more uniformly across the range, reducing the probability of spurious hash collisions. Any large prime works — 101, 1000000007, etc. Choosing a prime that is larger than the alphabet size (256 for ASCII) helps avoid the most common collision patterns.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousLongest Consecutive SequenceNext →Cuckoo Hashing
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged