Boyer-Moore String Search Algorithm β Bad Character and Good Suffix
- Boyer-Moore compares right-to-left within the pattern β this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
- Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
- Boyer-Moore-Horspool (bad char only) is what most production implementations use β simpler than full Boyer-Moore, nearly as fast for typical text.
Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs β counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.
This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text β for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.
Right-to-Left Matching
Boyer-Moore aligns the pattern at the left of the text and compares right-to-left β from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.
This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] β but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.
The Bad Character Heuristic
When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.
If c doesn't appear in the pattern at all, shift the entire pattern past position i+j β a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift β ignore it (use 0 or the good suffix rule).
def bad_char_table(pattern: str) -> dict[str, int]: """ Returns last occurrence index of each character in pattern. If char not in pattern, returns -1 (handled at call site). """ table = {} for i, ch in enumerate(pattern): table[ch] = i return table pattern = 'ABCBAB' table = bad_char_table(pattern) print(table) # {'A': 4, 'B': 5, 'C': 2}
Boyer-Moore with Bad Character Only (Simplified)
A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity.
def boyer_moore_horspool(text: str, pattern: str) -> list[int]: n, m = len(text), len(pattern) if m == 0 or m > n: return [] # Build bad character table bad_char = {ch: i for i, ch in enumerate(pattern)} matches = [] s = 0 # shift of the pattern w.r.t. text while s <= n - m: j = m - 1 # start from rightmost character # Move left while characters match while j >= 0 and pattern[j] == text[s + j]: j -= 1 if j < 0: matches.append(s) # full match found # Shift by 1 (or use good suffix for full Boyer-Moore) s += 1 else: # Bad character shift bc = bad_char.get(text[s + j], -1) s += max(1, j - bc) return matches print(boyer_moore_horspool('ABAAABCD', 'ABC')) # [4] print(boyer_moore_horspool('AABCAABXAAB', 'AABX')) # [4]
[4]
The Good Suffix Heuristic
When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.
The full Boyer-Moore with both heuristics achieves: - Best case: O(n/m) β each character in text examined at most once every m steps - Worst case: O(nm) for the simple version, O(n) with the full good suffix rule - Typical English text: Much faster than KMP in practice
For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement.
Worked Example
Text: AAABBBCCCABC Pattern: ABC
Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' β mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.
Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' β mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.
Step 3: s=3. Compare: text[5]='B' vs 'C' β mismatch. Bad char 'B' at 1. Shift=1. s=4.
... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.
| Property | Boyer-Moore-Horspool | Full Boyer-Moore | KMP |
|---|---|---|---|
| Preprocessing | O(m + Ο) | O(m + Ο) | O(m) |
| Search worst case | O(nm) | O(n) | O(n) |
| Search best case | O(n/m) | O(n/m) | O(n) |
| Implementation | Simple | Complex | Moderate |
| Best for | Long patterns | General purpose | Repetitive patterns |
π― Key Takeaways
- Boyer-Moore compares right-to-left within the pattern β this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
- Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
- Boyer-Moore-Horspool (bad char only) is what most production implementations use β simpler than full Boyer-Moore, nearly as fast for typical text.
- O(n/m) in practice: longer patterns = faster search. A 20-character pattern in English text typically examines fewer than n/10 characters.
- Choose KMP for: DNA sequences, binary data, repetitive patterns, guaranteed worst case. Choose Boyer-Moore for: English text, long patterns, when you can profile workload.
Interview Questions on This Topic
- QWhy does Boyer-Moore compare right-to-left instead of left-to-right?
- QExplain the bad character heuristic with an example.
- QWhy is Boyer-Moore faster for longer patterns?
- QWhat is the difference between Boyer-Moore and Boyer-Moore-Horspool?
Frequently Asked Questions
Which is faster in practice β KMP or Boyer-Moore?
Boyer-Moore is significantly faster in practice for long patterns in natural language text. KMP is preferred when worst-case guarantees are needed or patterns are short with lots of repetition.
Does Python's str.find() use Boyer-Moore?
CPython uses a two-way string matching algorithm combined with Boyer-Moore-Horspool heuristics for longer patterns. The exact implementation varies by Python version.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.