Boyer-Moore String Search — When Bad Character Alone Fails
- Boyer-Moore compares right-to-left within the pattern — this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
- Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
- Boyer-Moore-Horspool (bad char only) is what most production implementations use — simpler than full Boyer-Moore, nearly as fast for typical text.
- Boyer-Moore compares pattern right-to-left and skips characters using two heuristics
- Bad character rule shifts pattern to align mismatched char's rightmost occurrence
- Good suffix rule shifts to align matched suffix's previous occurrence
- Performance: O(n/m) in practice for long patterns — examines ~1 in m characters
- Production insight: Implementation errors (missing good suffix) can cause O(nm) worst case on repetitive text
- Biggest mistake: Assuming bad character alone guarantees sub-linear performance (Horspool is simpler but not guaranteed O(n))
Boyer-Moore Implementation Quick Fixes
Pattern match missed at the beginning of text.
print(f"Start shift: {s}, comparing text[{s+j}]='{text[s+j]}' vs pattern[{j}]='{pattern[j]}'")Test with a simple pattern and text where you know the expected positions.Search runs but misses matches in the middle of text.
print(f"Bad char table for char '{c}': rightmost index {bad_char.get(c, -1)}")Verify that when a character is not found (-1), you shift by j - (-1) = j + 1 (skip entire pattern).Algorithm shifts more than expected (skips over a match).
print(f"Shift before apply: s={s}, n-m={n-m}")Add a guard: if s > n - m: breakProduction Incident
Production Debug GuideDebug why your string search is unexpectedly slow
Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs — counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.
This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text — for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.
Right-to-Left Matching
Boyer-Moore aligns the pattern at the left of the text and compares right-to-left — from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.
This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] — but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.
The Bad Character Heuristic
When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.
If c doesn't appear in the pattern at all, shift the entire pattern past position i+j — a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift — ignore it (use 0 or the good suffix rule).
def bad_char_table(pattern: str) -> dict[str, int]: """ Returns last occurrence index of each character in pattern. If char not in pattern, returns -1 (handled at call site). """ table = {} for i, ch in enumerate(pattern): table[ch] = i return table pattern = 'ABCBAB' table = bad_char_table(pattern) print(table) # {'A': 4, 'B': 5, 'C': 2}
Boyer-Moore with Bad Character Only (Simplified)
A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity. It's the default in many text editors and grep implementations because it's fast enough for general text. The trade-off is worst-case O(nm) on repetitive inputs.
def boyer_moore_horspool(text: str, pattern: str) -> list[int]: n, m = len(text), len(pattern) if m == 0 or m > n: return [] # Build bad character table bad_char = {ch: i for i, ch in enumerate(pattern)} matches = [] s = 0 # shift of the pattern w.r.t. text while s <= n - m: j = m - 1 # start from rightmost character # Move left while characters match while j >= 0 and pattern[j] == text[s + j]: j -= 1 if j < 0: matches.append(s) # full match found # Shift by 1 (or use good suffix for full Boyer-Moore) s += 1 else: # Bad character shift bc = bad_char.get(text[s + j], -1) s += max(1, j - bc) return matches print(boyer_moore_horspool('ABAAABCD', 'ABC')) # [4] print(boyer_moore_horspool('AABCAABXAAB', 'AABX')) # [4]
The Good Suffix Heuristic
When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.
- Best case: O(n/m) — each character in text examined at most once every m steps
- Worst case: O(n) when good suffix is included, O(nm) without
- Typical English text: Much faster than KMP in practice
For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement. But in production, if you can't guarantee input randomness, implement the good suffix rule.
Worked Example: Tracing Shifts on a Short Text
Text: AAABBBCCCABC Pattern: ABC
Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' → mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.
Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' → mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.
Step 3: s=3. Compare: text[5]='B' vs 'C' → mismatch. Bad char 'B' at 1. Shift=1. s=4.
... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.
This demonstrates the sub-linear behaviour: for a 3-character pattern, we skip 2 out of every 3 characters on average.
| Property | Boyer-Moore-Horspool | Full Boyer-Moore | KMP |
|---|---|---|---|
| Preprocessing | O(m + σ) | O(m + σ) | O(m) |
| Search worst case | O(nm) | O(n) | O(n) |
| Search best case | O(n/m) | O(n/m) | O(n) |
| Implementation | Simple | Complex | Moderate |
| Best for | Long patterns, random text | General purpose, repetitive text | Short patterns, guaranteed worst-case |
🎯 Key Takeaways
- Boyer-Moore compares right-to-left within the pattern — this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
- Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
- Boyer-Moore-Horspool (bad char only) is what most production implementations use — simpler than full Boyer-Moore, nearly as fast for typical text.
- O(n/m) in practice: longer patterns = faster search. A 20-character pattern in English text typically examines fewer than n/10 characters.
- Choose KMP for: DNA sequences, binary data, repetitive patterns, guaranteed worst case. Choose Boyer-Moore for: English text, long patterns, when you can profile workload.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy does Boyer-Moore compare right-to-left instead of left-to-right?Mid-levelReveal
- QExplain the bad character heuristic with an example.SeniorReveal
- QWhy is Boyer-Moore faster for longer patterns?Mid-levelReveal
- QWhat is the difference between Boyer-Moore and Boyer-Moore-Horspool?JuniorReveal
- QHow would you implement Boyer-Moore in a production system where you don't know the text distribution?SeniorReveal
Frequently Asked Questions
Which is faster in practice — KMP or Boyer-Moore?
Boyer-Moore is significantly faster in practice for long patterns in natural language text. KMP is preferred when worst-case guarantees are needed or patterns are short with lots of repetition.
Does Python's str.find() use Boyer-Moore?
CPython uses a two-way string matching algorithm combined with Boyer-Moore-Horspool heuristics for longer patterns. The exact implementation varies by Python version.
When should I implement good suffix instead of using Horspool?
Implement good suffix when your input includes any repetitive patterns – logs, DNA, machine-generated data, or when you cannot afford worst-case timeouts. If you're building a library used by unknown callers, implement the full algorithm.
Can Boyer-Moore be used for binary data?
Yes, but the alphabet is large (256 bytes), so the bad character table as an array of 256 ints is fine. However, binary data often has patterns and repetitions, so you need the good suffix rule to avoid O(nm) worst case. KMP is often preferred for binary matching due to simpler worst-case behaviour.
How does Boyer-Moore handle overlapping matches?
In the basic Horspool, after a match you shift by 1 to find overlapping matches. In full Boyer-Moore, the good suffix rule gives a larger shift. If you need overlapping matches, ensure your implementation allows them (e.g., by using the good suffix shift after a match rather than just +1).
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.