Boyer-Moore String Search — When Bad Character Alone Fails
A 20-char pattern took 10s on repetitive logs.
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
- Boyer-Moore compares pattern right-to-left and skips characters using two heuristics
- Bad character rule shifts pattern to align mismatched char's rightmost occurrence
- Good suffix rule shifts to align matched suffix's previous occurrence
- Performance: O(n/m) in practice for long patterns – examines ~1 in m characters
- Production insight: Implementation errors (missing good suffix) can cause O(nm) worst case on repetitive text
- Biggest mistake: Assuming bad character alone guarantees sub-linear performance (Horspool is simpler but not guaranteed O(n))
Most string search algorithms compare left to right. Boyer-Moore does something counterintuitive — it compares the pattern right to left. When it finds a mismatch, it uses two powerful rules to skip ahead by many characters at once, sometimes jumping the entire pattern length. This is why Boyer-Moore gets faster as the pattern gets longer — the opposite of what you'd expect.
Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs — counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.
This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text — for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.
Why Boyer-Moore Skips Characters Most Algorithms Must Check
Boyer-Moore string search is a sublinear pattern-matching algorithm that skips over characters by using two heuristics: the bad-character rule and the good-suffix rule. Unlike naive or KMP scanning left-to-right, Boyer-Moore aligns the pattern with the text and compares from right to left. When a mismatch occurs, it shifts the pattern by the maximum of the two heuristic shifts, often skipping many characters at once. In practice, for patterns of length m and text of length n, it can achieve O(n/m) comparisons on average, making it the fastest general-purpose string search for large alphabets.
The core insight: by preprocessing the pattern into two tables (bad-character shift and good-suffix shift), the algorithm knows exactly how far to slide the pattern after a mismatch. The bad-character rule uses the last occurrence of the mismatched character in the pattern; the good-suffix rule uses the matched suffix to find a safe alignment. When both heuristics are used together, Boyer-Moore guarantees linear worst-case time O(n + m) while achieving sublinear average-case performance. This is critical for real-time search in large documents, log files, or network packet inspection.
Use Boyer-Moore when you need to search for a fixed pattern in a large body of text, especially with a large alphabet (e.g., ASCII, Unicode). It excels in text editors, intrusion detection systems (Snort, Suricata), and DNA sequence matching. However, for very short patterns or small alphabets (e.g., binary data), the overhead of preprocessing may outweigh the benefits — in those cases, a simpler algorithm like KMP or even naive search may be faster.
Right-to-Left Matching
Boyer-Moore aligns the pattern at the left of the text and compares right-to-left — from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.
This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] — but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.
The Bad Character Heuristic
When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.
If c doesn't appear in the pattern at all, shift the entire pattern past position i+j — a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift — ignore it (use 0 or the good suffix rule).
Boyer-Moore with Bad Character Only (Simplified)
A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity. It's the default in many text editors and grep implementations because it's fast enough for general text. The trade-off is worst-case O(nm) on repetitive inputs.
The Good Suffix Heuristic
When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.
- Best case: O(n/m) — each character in text examined at most once every m steps
- Worst case: O(n) when good suffix is included, O(nm) without
- Typical English text: Much faster than KMP in practice
For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement. But in production, if you can't guarantee input randomness, implement the good suffix rule.
Worked Example: Tracing Shifts on a Short Text
Text: AAABBBCCCABC Pattern: ABC
Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' → mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.
Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' → mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.
Step 3: s=3. Compare: text[5]='B' vs 'C' → mismatch. Bad char 'B' at 1. Shift=1. s=4.
... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.
This demonstrates the sub-linear behaviour: for a 3-character pattern, we skip 2 out of every 3 characters on average.
Case 1 – Mismatch Becomes Match: The Bad Character Rescue
When you hit a mismatch scanning right-to-left, the bad character heuristic doesn't just shrug. It asks: can this mismatched character help me align the pattern faster?
If the mismatched character exists somewhere else in the pattern, you shift until that occurrence aligns with the mismatched position. This turns a failure into a shorter jump — but it's still a skip that naive search would kill for.
Think of it like this: you're looking for "NEEDLE" in "NEEDED". Your right-to-left scan trips on 'D' vs 'E'. The 'D' appears at pattern index 3. Shift so that this 'D' lines up with the text's 'D'. You just skipped comparing characters that would have matched — and you didn't waste cycles on them.
The heuristic computes (last occurrence index of mismatched char) and (current mismatch position). The shift amount is simply j - badCharTable[mismatchedChar]. That's it. No backtracking, no rechecking the same ground.
This case is where Boyer-Moore earns its reputation. Most algorithms would slide by 1. You slide by the distance to the last match — often 3, 4, or more characters. Micro-optimizations add up when your text is 10 MB of logs.
j - (-1) shifts past it entirely. That's a huge skip — don't forget to cap the shift to avoid overflow.Case 2 – Pattern Moves Past the Mismatch: When No Match Exists
Now the ugly case. The mismatched character doesn't exist in the pattern at all. Naive algo slides by 1, but you're smarter.
Because that character is completely absent from the pattern, any alignment that puts it under the pattern is doomed. So you shift the entire pattern past that character — a full j + 1 positions. You just skipped an entire substring.
Example: pattern "CAT" in text "XCAT". Right-to-left scan at index 2: text 'X', expected 'T'. 'X' isn't in "CAT". Shift by 3. You skip comparing 'X' against 'C', 'A', T — because none of them match 'X' anyway.
This is the reason Boyer-Moore can achieve sublinear time on average. Case 2 triggers frequently in natural language, logs, or any text with varied character sets. Every time it fires, you jump over characters that naive search would have checked one by one.
Remember: the bad character table returns -1 for missing chars. So j - (-1) = j + 1. That's the formula. Memorize it. Implement it. Watch your searches get fast.
shift = max(1, j - badChar[textChar]). Max with 1 ensures you never shift backwards or stall.Implementation: Building the Full Boyer-Moore Engine
Theory's done. Time to wire it up. You need two preprocessing tables: bad character (256 slots, int array) and good suffix (pattern length).
Bad character is trivial — fill with -1, then store the last index of each char in the pattern. Good suffix is trickier: it tells you where to shift when a suffix of the pattern matches the text but the preceding char doesn't.
Don't overthink the good suffix. The naive implementation works: for each suffix length, compute the closest earlier occurrence of that suffix in the pattern (or the longest prefix that's also a suffix). Use that to shift.
The main loop scans right-to-left on each alignment. When mismatch occurs, compute shift candidates from both heuristics. Take the maximum. Shift. Repeat.
Edge cases matter: pattern longer than text? Bail early. Empty pattern? Return index 0. Unicode or multi-byte? Build your table with the correct range, or use a hash map.
Production code doesn't care about elegance. It cares about correctness and speed. Test against edge cases: all same chars, no match, match at start, match at end. Profile against naive search on your dataset. You'll see the difference.
Production Search Timeout Due to Boyer-Moore Worst Case
- Bad-character-only Boyer-Moore is not safe for all inputs — always evaluate worst-case pattern and text characteristics.
- If your data has repetition (logs, DNA, machine-generated text), you need the good suffix rule for guaranteed O(n) performance.
- Profile your search on production-like data before committing to a variant.
- Grep uses a tuneable variant that falls back to other algorithms when Boyer-Moore's skip performance degrades.
print(f"Start shift: {s}, comparing text[{s+j}]='{text[s+j]}' vs pattern[{j}]='{pattern[j]}'")Test with a simple pattern and text where you know the expected positions.Key takeaways
Common mistakes to avoid
5 patternsStoring first occurrence instead of rightmost in bad character table
Not handling negative shift from bad character rule
Omitting the good suffix rule for repetitive data
Incorrect shift after a full match
Using a fixed shift table for bad characters without considering alphabet size
Practice These on LeetCode
Interview Questions on This Topic
Why does Boyer-Moore compare right-to-left instead of left-to-right?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
That's Strings. Mark it forged?
6 min read · try the examples if you haven't