Boyer-Moore String Search — When Bad Character Alone Fails
A 20-char pattern took 10s on repetitive logs.
- Boyer-Moore compares pattern right-to-left and skips characters using two heuristics
- Bad character rule shifts pattern to align mismatched char's rightmost occurrence
- Good suffix rule shifts to align matched suffix's previous occurrence
- Performance: O(n/m) in practice for long patterns – examines ~1 in m characters
- Production insight: Implementation errors (missing good suffix) can cause O(nm) worst case on repetitive text
- Biggest mistake: Assuming bad character alone guarantees sub-linear performance (Horspool is simpler but not guaranteed O(n))
Most string search algorithms compare left to right. Boyer-Moore does something counterintuitive — it compares the pattern right to left. When it finds a mismatch, it uses two powerful rules to skip ahead by many characters at once, sometimes jumping the entire pattern length. This is why Boyer-Moore gets faster as the pattern gets longer — the opposite of what you'd expect.
Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs — counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.
This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text — for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.
Right-to-Left Matching
Boyer-Moore aligns the pattern at the left of the text and compares right-to-left — from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.
This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] — but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.
The Bad Character Heuristic
When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.
If c doesn't appear in the pattern at all, shift the entire pattern past position i+j — a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift — ignore it (use 0 or the good suffix rule).
Boyer-Moore with Bad Character Only (Simplified)
A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity. It's the default in many text editors and grep implementations because it's fast enough for general text. The trade-off is worst-case O(nm) on repetitive inputs.
The Good Suffix Heuristic
When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.
- Best case: O(n/m) — each character in text examined at most once every m steps
- Worst case: O(n) when good suffix is included, O(nm) without
- Typical English text: Much faster than KMP in practice
For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement. But in production, if you can't guarantee input randomness, implement the good suffix rule.
Worked Example: Tracing Shifts on a Short Text
Text: AAABBBCCCABC Pattern: ABC
Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' → mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.
Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' → mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.
Step 3: s=3. Compare: text[5]='B' vs 'C' → mismatch. Bad char 'B' at 1. Shift=1. s=4.
... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.
This demonstrates the sub-linear behaviour: for a 3-character pattern, we skip 2 out of every 3 characters on average.
Production Search Timeout Due to Boyer-Moore Worst Case
- Bad-character-only Boyer-Moore is not safe for all inputs — always evaluate worst-case pattern and text characteristics.
- If your data has repetition (logs, DNA, machine-generated text), you need the good suffix rule for guaranteed O(n) performance.
- Profile your search on production-like data before committing to a variant.
- Grep uses a tuneable variant that falls back to other algorithms when Boyer-Moore's skip performance degrades.
Key takeaways
Common mistakes to avoid
5 patternsStoring first occurrence instead of rightmost in bad character table
Not handling negative shift from bad character rule
Omitting the good suffix rule for repetitive data
Incorrect shift after a full match
Using a fixed shift table for bad characters without considering alphabet size
Interview Questions on This Topic
Why does Boyer-Moore compare right-to-left instead of left-to-right?
Frequently Asked Questions
That's Strings. Mark it forged?
3 min read · try the examples if you haven't