Rabin-Karp: Modulo Choice Caused 18% False Positives
2% to 18% false positive surge from Rabin-Karp modulus 2^31-1 on binary data.
- Rabin-Karp uses a rolling hash to find a pattern in text — slides a window and updates hash in O(1) using arithmetic.
- The rolling hash update: new_hash = (base (old_hash - old_char base^(m-1)) + new_char) mod prime.
- Average case O(n+m); worst case O(n*m) happens when every window spuriously matches the pattern hash.
- Real power is multi-pattern search: store all pattern hashes in a set, then each window lookup is O(1).
- Production failure: choosing modulus 2^31-1 allowed 'abc' and 'ab\x00c' to collide — false plagiarism matches.
- Biggest mistake: re-computing hash from scratch for each window instead of using the rolling update.
Imagine you're looking for a specific page in a 1,000-page book by its fingerprint — instead of reading every word on every page, you stamp a unique number on each page and compare numbers first. Only when the numbers match do you actually read that page. Rabin-Karp does exactly that with text: it computes a numeric fingerprint (hash) for the pattern you're searching for, then slides a window across the text computing fingerprints on the fly. When fingerprints match, only then does it check the actual characters. It's the difference between comparing ID card photos one by one versus first sorting everyone by hair colour.
Your IDE finds every occurrence of a variable name. A plagiarism detector scans a 50,000-word essay. A database engine hunts for a substring. A string matching algorithm does the heavy lifting. Most developers reach for the naive O(n·m) approach without realising it silently tanks performance the moment pattern or text grows large. Rabin-Karp isn't just a clever trick — it's the algorithm underpinning multi-pattern search in tools like grep, antivirus scanners, and bioinformatics pipelines.
The core problem Rabin-Karp solves is the sliding-window comparison bottleneck. In a naive search, shifting the window one character to the right means re-comparing up to m characters all over again. Rabin-Karp replaces that with a single arithmetic operation — subtracting the outgoing character's contribution and adding the incoming one. This 'rolling hash' trick keeps the average case at O(n+m) while letting you search for multiple patterns simultaneously by checking each window's hash against a set of pattern hashes in O(1).
By the end of this article you'll understand exactly how the polynomial rolling hash is constructed, why each design choice matters, how to handle hash collisions correctly, and when Rabin-Karp beats KMP — and when it doesn't. You'll also walk away knowing the three gotchas that trip up even experienced engineers implementing this for the first time.
What is Rabin-Karp? — Plain English
Rabin-Karp is a string-matching algorithm that uses hashing to find a pattern within a text. Instead of comparing the pattern character-by-character against every text window, it first computes a hash of the pattern and a rolling hash of each text window of the same length. Only when hashes match does it do a character-by-character confirmation. This makes the average case O(n+m), and the rolling hash allows moving the window one character to the right in O(1) time.
The real power of Rabin-Karp is when you need to search for multiple patterns simultaneously: store all pattern hashes in a set, then each rolling window hash lookup is O(1). This is why plagiarism-detection tools and code duplication finders often use Rabin-Karp.
One nuance that trips people up: the algorithm is probabilistic in performance but deterministic in correctness. You always verify with a full comparison, so you never report a false match. The probability of collision affects only speed — more collisions mean more full comparisons.
How Rabin-Karp Works — Step by Step
- Choose a base (e.g., 31 or 256) and a large prime modulus to reduce collisions.
- Compute the hash of the pattern and the hash of the first window of text (length m).
- Precompute base^(m-1) mod prime — this factor is used when removing the leftmost character.
- For each window position i from 0 to n-m:
- a. If hash(text[i..i+m-1]) == hash(pattern):
- - Confirm with a character-by-character comparison (handles hash collisions).
- - If equal, record a match at index i.
- b. Slide the window right: remove text[i], add text[i+m].
- The rolling hash update:
- h = (base (h - text[i] base^(m-1)) + text[i+m]) mod prime
- Return all match positions.
Time complexity: O(n+m) average case. O(n*m) worst case (if every window causes a hash collision requiring full comparison — rare with a good hash function).
Important: The subtraction step can produce negative intermediate values. Always add modulus to keep the result in range: (h - text[i] h_pow) % prime in Python works fine because % handles negatives, but in other languages you need explicit (h - text[i] h_pow + prime) % prime.
- hash('abc') = (aB^2 + bB + c) mod M
- To slide right to 'bcd': subtract a*B^2, multiply by B, add d.
- Precomputing B^(m-1) makes the subtraction O(1).
- Modulo keeps numbers small but still O(1) per operation.
Worked Example — Tracing the Algorithm
Pattern: 'abc' (m=3). Text: 'xabcabc'. Base=256
% handles negative numbers correctly. In Java or C++, write (h - old_char * h_pow % mod + mod) % mod to avoid negative values. The multiplication can overflow 32-bit integers — use 64-bit (long in Java) or explicit mod after each multiplication.Implementation — Production-Ready Java
The following implementation demonstrates Rabin-Karp with proper modulo handling, multi-pattern support, and clear comments. It's designed for searching multiple patterns of the same length — the algorithm's sweet spot.
- Base 31 (not 256): prime base gives better distribution, avoids null-byte zero issues.
- Modulus 1,000,000
C++ Implementation — Production-Ready Code
The following C++ implementation mirrors the Java version but leverages C++ features like uint64_t for predictable overflow handling and templates for flexibility. The same rolling hash logic applies, but C++ requires explicit modulo after each multiplication to avoid overflow.
- Use
uint64_t(unsigned 64-bit) from<cstdint>for modulus arithmetic. - Precompute powers using
std::vector<uint64_t>. - Use
std::unordered_setfor pattern hash lookup. - Use `std::map<std::string
oldCharVal highPower can overflow even uint64_t if both are near 2^64. Always apply modulo after each multiplication: ((oldCharVal % MOD) (highPower % MOD)) % MOD. The code above assumes values are already < MOD due to modulo operations in hash() and precomputation, but defensively applying modulo after each operation is safer for production.uint64_t for hash state and precomputing powers once per pattern length. They also added a SIMD-accelerated string comparison for verification, reducing verification overhead by 40%.uint64_t for intermediate values and apply modulo after each multiplication.unordered_set for pattern hashes gives O(1) lookup per window.Advantages and Disadvantages of Rabin-Karp
Every algorithm has trade-offs. Rabin-Karp's strength lies in its simplicity and multi-pattern capability, but it's not a universal solution. The table below summarises the key advantages and disadvantages based on real production usage.
Advantages | Advantage | Explanation | |-----------|-------------| | Average O(n+m) time | With a good hash function, comparison is avoided for most windows, keeping runtime linear. | | Built for multi-pattern | Store all pattern hashes in a set and check each window in O(1) — trivial to implement. | | Single-pass over text | Text is scanned once even for multiple patterns (same length). Useful for streaming. | | Extends to 2D | 2D rolling hash works well for image matching and fingerprinting. | | Simple to implement | No complex failure function or automaton — just arithmetic and a hash set. |
Disadvantages | Disadvantage | Explanation | |--------------|-------------| | Worst-case O(n*m) | Poor hash selection or adversarial input can cause every window to collide, degrading to naive. | | Requires careful parameter tuning | Base, modulus, and character mapping must be chosen carefully to avoid collisions. | | Overhead for single pattern | KMP or Boyer-Moore have better constant factors for single-pattern search. | | Variable-length patterns awkward | Must run separate passes for each distinct pattern length. Aho-Corasick handles this natively. | | Integer overflow risks | In languages without automatic big integers, overflow in intermediate multiplication can silently produce wrong results. |
When the disadvantages bite: If you have many patterns of different lengths, or if your input contains adversarial data (e.g., random strings designed to collide), or if you need guaranteed worst-case performance, consider alternatives.
Visualizing the Rolling Hash – Window Slide Diagram
The core operation of Rabin-Karp is the sliding window hash update. Instead of recomputing the hash from scratch for each window, we remove the contribution of the leftmost character (which was multiplied by the highest power of the base) and add the new character on the right. This diagram shows a concrete example with pattern length 3, base 10 (decimal digits), and no modulo for simplicity.
Example: Text digit string "123456", pattern length 3. Hash of "123" = 110^2 + 210 + 3 = 123. Slide right: remove 1*10^2 = 100, multiply remainder (23) by 10 = 230, add 4 → 234. The diagram below tracks the state at each step.
The rolling hash update is the heart of the algorithm. Without it, each window would require O(m) computation, making the total O(n*m). With it, each window costs O(1) — just two multiplications, a subtraction, and an addition.
base^(m-1) mod modulus had not been precomputed correctly. Visualizing the sliding window like the diagram above helped them spot the bug: they were using base^m instead of base^(m-1) as the removal factor.Double Hashing: Reducing Collision Probability
Even with a large prime modulus, hash collisions are possible. For critical applications (plagiarism detection, DNA sequence matching, security-sensitive search), a single 64-bit hash may not provide enough collision resistance. Double hashing — computing two independent hashes for the same string — drastically reduces collision probability to near zero.
How double hashing works - Choose two different bases (e.g., 31 and 37) and two different prime moduli (e.g., 1,000,000,007 and 1,000,000,009). - For each window, compute both hashes using the same rolling update formula but separate state variables. - Only consider a match when both hashes agree with their respective pattern hash pairs.
Collision probability: With a single 64-bit hash mapped to [0, MOD), probability of collision among K windows is roughly K^2 / (2MOD). For MOD=1e9+7 and K=1e6, probability ~ 5e-4. With double hashing and two independent moduli each ~1e9, probability becomes negligible (~ (K^2 / (2MOD1)) (K^2 / (2MOD2)) ≈ 2.5e-13).
Performance cost: Double hashing doubles the arithmetic per window (two rolling updates) and doubles the memory for pattern hashes (store pairs). In practice, this adds about 2x computational overhead but eliminates false positives from hash collisions entirely.
- Plagiarism detection where a false positive is expensive (legal or academic consequences).
- Security applications where adversary may craft collision strings.
- Large-scale indexing with billions of windows.
- When you cannot afford to do full string comparison for every hash match (e.g., streaming data with no random access).
Implementation sketch: Store pattern hashes as a set of pair<long, long> or a hash set of a combined 128-bit value. For each window, compute both hashes and check if the pair is in the set.", "code": { "language": "java", "filename": "io/thecodeforge/strings/DoubleHashRabinKarp.java", "code": "package io.thecodeforge.strings;
import java.util.*;
/* Rabin-Karp with double hashing for reduced collision probability. Uses two independent bases and moduli. / public class DoubleHashRabinKarp { private static final long BASE1 = 31L; private static final long BASE2 = 37L; private static final long MOD1 = 1_000_000_007L; private static final long MOD2 = 1_000_000_009L;
private final int patternLength; private final long[] basePowers1, basePowers2; private final Set<Long> patternHashes; // Combined hash: (hash1 << 32) | hash2
public DoubleHashRabinKarp(String[] patterns) { if (patterns.length == 0) throw new IllegalArgumentException(); this.patternLength = patterns[0].length();
// Precompute powers basePowers1 = new long[patternLength + 1]; basePowers2 = new long[patternLength + 1]; basePowers1[0] = basePowers2[0] = 1L; for (int i = 1; i <= patternLength; i++) { basePowers1[i] = (basePowers1[i-1] BASE1) % MOD1; basePowers2[i] = (basePowers2[i-1] BASE2) % MOD2; }
// Hash all patterns this.patternHashes = new HashSet<>(patterns.length); for (String p : patterns) { long[] h = computeHash(p, 0, patternLength); patternHashes.add(combine(h[0], h[1])); } }
public List<Integer> search(String text) { List<Integer> matches = new ArrayList<>(); int n = text.length(); if (n < patternLength) return matches;
long[] windowHash = computeHash(text, 0, patternLength); long highPower1 = basePowers1[patternLength - 1]; long highPower2 = basePowers2[patternLength - 1];
for (int i = 0; i <= n - patternLength; i++) { if (patternHashes.contains(combine(windowHash[0], windowHash[1]))) { // Hash match — verify with substring comparison if (text.substring(i, i + patternLength).equals( // actual pattern lookup omitted for brevity )) { matches.add(i); } }
if (i < n - patternLength) { long oldChar = charValue(text.charAt(i)); long newChar = charValue(text.charAt(i + patternLength));
// Update hash1 long updated1 = (windowHash[0] - (oldChar highPower1) % MOD1 + MOD1) % MOD1; windowHash[0] = (updated1 BASE1 + newChar) % MOD1;
// Update hash2 long updated2 = (windowHash[1] - (oldChar highPower2) % MOD2 + MOD2) % MOD2; windowHash[1] = (updated2 BASE2 + newChar) % MOD2; } }
return matches; }
private long[] computeHash(String s, int start, int end) { long h1 = 0L, h2 = 0L; for (int i = start; i < end; i++) { long v = charValue(s.charAt(i)); h1 = (h1 BASE1 + v) % MOD1; h2 = (h2 BASE2 + v) % MOD2; } return new long[]{h1, h2}; }
private long charValue(char c) { return (c) + 1L; }
private long combine(long h1, long h2) { return (h1 << 32) | h2; } }
Practice Problems
Solidify your understanding of Rabin-Karp and rolling hashes by solving these problems. They range from direct applications to variations that require adapting the algorithm to different contexts.
- LeetCode 187. Repeated DNA Sequences
- Find all 10-character-long sequences that occur more than once in a DNA string. Use rolling hash to slide a window of length 10 across the string and track seen hashes.
- LeetCode 438. Find All Anagrams in a String
- Given a string s and a non-empty string p, find all start indices where p's anagrams occur in s. Rolling hash of length p allows O(n) sliding, but you must handle anagrams (same character multiset). Combine rolling hash with a character count array or use a hash of multiset.
- LeetCode 1044. Longest Duplicate Substring
- Given a string s, find the longest substring that occurs at least twice. Use binary search on length and Rabin-Karp with rolling hash to check if any substring of length L repeats. Classic application of rolling hash with binary search.
- LeetCode 686. Repeated String Match
- Given two strings a and b, find the minimum number of times to repeat a so that b becomes a substring. Can be solved with KMP or Rabin-Karp when the repeated string length exceeds a certain bound.
- LeetCode 28. Implement strStr()
- Implement string matching. While the naive O(n*m) passes, solving it with Rabin-Karp demonstrates your understanding of the rolling hash mechanism. LeetCode tests include edge cases with large alphabets.
- Codeforces 271D – Good Substrings
- Count number of distinct substrings of a given string that contain at most k 'bad' characters. Use rolling hash of all substrings and a set to count distinct ones. Requires careful parameter selection to avoid collisions.
- SPOJ NHAY – A Needle in a Haystack
- Classic problem: find all positions of a pattern in a text. Use Rabin-Karp to handle up to 10^6 characters efficiently.
- LeetCode 1923. Longest Common Subpath
- Given two arrays/list of integers, find the longest common subpath. Use rolling hash to compare subarrays. This is essentially a 2D version of the repeated substring problem.
Tips: All these problems can be solved with a well-tuned Rabin-Karp. Start with a single large prime modulus (1e9+7) and base 131. Use double hashing only if you encounter collisions (rare in competitive programming but common in production). Pay attention to integer overflow — use 64-bit integers and apply modulo after each operation.
The Modulo Choice That Broke a Plagiarism Detector
- Not all primes are equal for polynomial rolling hashes. 2^31-1 (Mersenne) has known collision weaknesses.
- Always verify hash matches with full string comparison before reporting a match — never trust hash alone.
- If you log hash collisions as 'potential matches', you will drown in false positives. Log only confirmed matches.
- For binary data (null bytes), ensure your base and modulus handle zero-character inputs correctly.
- Test your hash function on adversarial inputs — not just random text.
Common mistakes to avoid
5 patternsRecomputing base^(m-1) for every window instead of precomputing it once
Using a primitive modulus that causes frequent collisions (like 101 or 2^31-1)
Forgetting to map '\0' (null character) to a non-zero value in charValue()
Reporting matches at the hash stage without full string verification
Using int (32-bit) for hash values in Java, causing overflow during multiplication
(a % MOD * b % MOD) % MOD.Interview Questions on This Topic
What is a spurious hit in Rabin-Karp and how does the algorithm handle it?
That's Hashing. Mark it forged?
10 min read · try the examples if you haven't