Senior 3 min · March 24, 2026

Boyer-Moore String Search — When Bad Character Alone Fails

A 20-char pattern took 10s on repetitive logs.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Boyer-Moore compares pattern right-to-left and skips characters using two heuristics
  • Bad character rule shifts pattern to align mismatched char's rightmost occurrence
  • Good suffix rule shifts to align matched suffix's previous occurrence
  • Performance: O(n/m) in practice for long patterns – examines ~1 in m characters
  • Production insight: Implementation errors (missing good suffix) can cause O(nm) worst case on repetitive text
  • Biggest mistake: Assuming bad character alone guarantees sub-linear performance (Horspool is simpler but not guaranteed O(n))
Plain-English First

Most string search algorithms compare left to right. Boyer-Moore does something counterintuitive — it compares the pattern right to left. When it finds a mismatch, it uses two powerful rules to skip ahead by many characters at once, sometimes jumping the entire pattern length. This is why Boyer-Moore gets faster as the pattern gets longer — the opposite of what you'd expect.

Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs — counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.

This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text — for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.

Right-to-Left Matching

Boyer-Moore aligns the pattern at the left of the text and compares right-to-left — from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.

This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] — but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.

Production Insight
Right-to-left matching lets us skip characters based on the mismatched text character.
Without it, you'd never get sub-linear behaviour.
Rule: if you compare left-to-right, you forfeit Bad Character skipping.
Key Takeaway
Right-to-left comparison is the foundation.
It enables skipping because mismatched text char carries information about future alignments.

The Bad Character Heuristic

When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.

If c doesn't appear in the pattern at all, shift the entire pattern past position i+j — a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift — ignore it (use 0 or the good suffix rule).

bad_char_table.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
def bad_char_table(pattern: str) -> dict[str, int]:
    """
    Returns last occurrence index of each character in pattern.
    If char not in pattern, returns -1 (handled at call site).
    """
    table = {}
    for i, ch in enumerate(pattern):
        table[ch] = i
    return table

pattern = 'ABCBAB'
table = bad_char_table(pattern)
print(table)  # {'A': 4, 'B': 5, 'C': 2}
Production Insight
The bad character table must store the RIGHTMOST occurrence, not the first.
A common bug: storing first occurrence leads to insufficient shifts and O(nm) behaviour.
Rule: always enumerate from left to right, overwriting previous entries.
Key Takeaway
Bad character shifts are computed as j - bc[text[i+j]].
If negative, ignore it — always use max(1, shift) to avoid infinite loops.

Boyer-Moore with Bad Character Only (Simplified)

A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity. It's the default in many text editors and grep implementations because it's fast enough for general text. The trade-off is worst-case O(nm) on repetitive inputs.

boyer_moore_simple.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def boyer_moore_horspool(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0 or m > n:
        return []

    # Build bad character table
    bad_char = {ch: i for i, ch in enumerate(pattern)}
    matches = []
    s = 0  # shift of the pattern w.r.t. text

    while s <= n - m:
        j = m - 1  # start from rightmost character

        # Move left while characters match
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1

        if j < 0:
            matches.append(s)  # full match found
            # Shift by 1 (or use good suffix for full Boyer-Moore)
            s += 1
        else:
            # Bad character shift
            bc = bad_char.get(text[s + j], -1)
            s += max(1, j - bc)

    return matches

print(boyer_moore_horspool('ABAAABCD', 'ABC'))      # [4]
print(boyer_moore_horspool('AABCAABXAAB', 'AABX'))  # [4]
Production Insight
Many production systems use Horspool variant for its simplicity.
But if your data has high repetition (logs, DNA), you'll see catastrophic slowdown.
Rule: always test with worst-case repetitive patterns before deploying.
Key Takeaway
Horspool = bad char only.
Great for natural language, risky for repetitive data.
Test on real workloads.

The Good Suffix Heuristic

When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.

The full Boyer-Moore with both heuristics achieves
  • Best case: O(n/m) — each character in text examined at most once every m steps
  • Worst case: O(n) when good suffix is included, O(nm) without
  • Typical English text: Much faster than KMP in practice

For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement. But in production, if you can't guarantee input randomness, implement the good suffix rule.

Boyer-Moore in the Real World
GNU grep uses a Boyer-Moore variant as its default algorithm. When you run 'grep pattern file', the longer your pattern, the faster grep runs — a direct consequence of Boyer-Moore's sub-linear behaviour.
Production Insight
Good suffix rule requires building two tables: one for prefix matching, one for suffix matching.
Implementation mistakes here are common and subtle.
Rule: test with pattern like 'ABAB' to verify shifts are correct after partial matches.
Key Takeaway
Good suffix ensures O(n) worst case.
Without it, you risk O(nm) on repetitive text.
Implement it if you can't control your input.

Worked Example: Tracing Shifts on a Short Text

Text: AAABBBCCCABC Pattern: ABC

Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' → mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.

Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' → mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.

Step 3: s=3. Compare: text[5]='B' vs 'C' → mismatch. Bad char 'B' at 1. Shift=1. s=4.

... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.

This demonstrates the sub-linear behaviour: for a 3-character pattern, we skip 2 out of every 3 characters on average.

Production Insight
Tracing shifts manually catches off-by-one errors in shift calculations.
Common mistake: forgetting that mismatched text position is s+j, not just j.
Rule: verify your shift logic on a simple example before trusting it.
Key Takeaway
Worked examples reveal algorithm correctness.
Shift calculation: s += max(1, j - bc_index).
Always test with pattern and text that have known match positions.
● Production incidentPOST-MORTEMseverity: high

Production Search Timeout Due to Boyer-Moore Worst Case

Symptom
Search for a 20-character pattern in log files with repeated 'ERROR: Connection timeout...' strings took over 10 seconds per file instead of milliseconds.
Assumption
The team assumed Boyer-Moore with only the bad character heuristic would be sufficient — it works well on English text.
Root cause
The bad-character-only variant (Horspool) degrades to O(nm) when the pattern and text have repetitive characters. In this case, the pattern 'ERROR: Connection timeout' contained many repeated letters, and the log had thousands of similar lines. Every mismatch triggered a tiny shift, making the algorithm examine every character.
Fix
Replaced the Horspool implementation with the full Boyer-Moore (including good suffix rule). The search time dropped back to <100ms per file. The good suffix rule handles repetitive patterns by shifting the entire suffix to the previous occurrence, preventing character-by-character slides.
Key lesson
  • Bad-character-only Boyer-Moore is not safe for all inputs — always evaluate worst-case pattern and text characteristics.
  • If your data has repetition (logs, DNA, machine-generated text), you need the good suffix rule for guaranteed O(n) performance.
  • Profile your search on production-like data before committing to a variant.
  • Grep uses a tuneable variant that falls back to other algorithms when Boyer-Moore's skip performance degrades.
Production debug guideDebug why your string search is unexpectedly slow5 entries
Symptom · 01
Search time grows linearly with text size regardless of pattern length.
Fix
Check if you implemented only the bad character heuristic. Add the good suffix rule or switch to Boyer-Moore-Horspool only if your data is truly random (e.g., English prose).
Symptom · 02
Search on repetitive patterns (e.g., 'AAAA', 'ABAB') takes much longer than on random patterns.
Fix
Profile with repetitive patterns. If bad-character-only, the good suffix rule is mandatory for these inputs. Implement full Boyer-Moore.
Symptom · 03
After a pattern match, the algorithm shifts by only 1 character.
Fix
You may have forgotten to handle the case where a full match occurs. In the Horspool variant, after a match, shift by 1 (or use the good suffix rule to shift further). Check your shift logic.
Symptom · 04
The bad character table produces zero or negative shifts.
Fix
Ensure the table stores the rightmost occurrence of each character. If a character appears only to the right of the mismatch position, the bad character rule gives a negative shift — in that case, ignore it and fall back to a shift of 1 (or the good suffix rule).
Symptom · 05
Implementation works on short patterns but fails on long patterns (>30 chars).
Fix
Check for integer overflow in shift calculations. The shift value can be up to pattern length, which fits in an int, but if you use unsigned types, ensure no underflow. Also verify you're not skipping beyond the text bounds.
★ Boyer-Moore Implementation Quick FixesDive into common implementation bugs and how to fix them fast.
Pattern match missed at the beginning of text.
Immediate action
Check your initial alignment: pattern should be at index 0 initially. Verify shift increment doesn't skip the first match.
Commands
print(f"Start shift: {s}, comparing text[{s+j}]='{text[s+j]}' vs pattern[{j}]='{pattern[j]}'")
Test with a simple pattern and text where you know the expected positions.
Fix now
Add logging for each shift and character comparison. Ensure you handle the case where j becomes -1 (full match) before checking s limits.
Search runs but misses matches in the middle of text.+
Immediate action
Check if the bad character table is built correctly — it should store the LAST (rightmost) index of each character in the pattern.
Commands
print(f"Bad char table for char '{c}': rightmost index {bad_char.get(c, -1)}")
Verify that when a character is not found (-1), you shift by j - (-1) = j + 1 (skip entire pattern).
Fix now
If using a dict, ensure chars not in pattern return -1 explicitly. Then shift = max(1, j - bc_index).
Algorithm shifts more than expected (skips over a match).+
Immediate action
Check if you applied the shift BEFORE comparing all characters. The shift should only happen AFTER a mismatch is found.
Commands
print(f"Shift before apply: s={s}, n-m={n-m}")
Add a guard: if s > n - m: break
Fix now
In the loop, do the inner while loop first, then check if j < 0 or apply shift based on mismatch. Don't shift at the start of each iteration.
Boyer-Moore Variants vs KMP
PropertyBoyer-Moore-HorspoolFull Boyer-MooreKMP
PreprocessingO(m + σ)O(m + σ)O(m)
Search worst caseO(nm)O(n)O(n)
Search best caseO(n/m)O(n/m)O(n)
ImplementationSimpleComplexModerate
Best forLong patterns, random textGeneral purpose, repetitive textShort patterns, guaranteed worst-case

Key takeaways

1
Boyer-Moore compares right-to-left within the pattern
this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
2
Bad character rule
align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
3
Boyer-Moore-Horspool (bad char only) is what most production implementations use
simpler than full Boyer-Moore, nearly as fast for typical text.
4
O(n/m) in practice
longer patterns = faster search. A 20-character pattern in English text typically examines fewer than n/10 characters.
5
Choose KMP for
DNA sequences, binary data, repetitive patterns, guaranteed worst case. Choose Boyer-Moore for: English text, long patterns, when you can profile workload.

Common mistakes to avoid

5 patterns
×

Storing first occurrence instead of rightmost in bad character table

Symptom
Shifts are smaller than optimal; algorithm degrades to O(nm) even on random text because it never skips enough characters.
Fix
When building the table, iterate through the pattern left to right and overwrite the entry for each character. This ensures the value stored is the last (rightmost) index.
×

Not handling negative shift from bad character rule

Symptom
The computed shift (j - bc) is negative because the character appears only to the right of j. This causes the algorithm to shift left (backwards) or produce an underflow.
Fix
Use max(1, j - bc) to clamp the shift to at least 1. Or implement the good suffix rule to handle that case properly.
×

Omitting the good suffix rule for repetitive data

Symptom
Search on patterns like 'AAAA' or 'ABAB' in repetitive text takes linear time (O(nm)) instead of sub-linear.
Fix
Either implement the good suffix rule, or if you only use Horspool, ensure your application data is truly random. Otherwise, fall back to KMP when repetition is detected.
×

Incorrect shift after a full match

Symptom
After finding a match, the algorithm shifts by 1 (or worse, by 0), causing duplicate matches or infinite loops.
Fix
In Horspool, after a match, shift by 1 (s += 1). In full Boyer-Moore, you can compute a larger shift using the good suffix rule. Ensure you don't skip subsequent overlapping matches if needed.
×

Using a fixed shift table for bad characters without considering alphabet size

Symptom
If the alphabet is large (Unicode text), the bad character table can become memory-heavy. If it's small (DNA), make sure you handle all possible characters efficiently.
Fix
For large alphabets, use a hash map instead of an array. For small alphabets, a fixed-size array indexed by character code is faster. Balance memory vs speed based on your input domain.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Why does Boyer-Moore compare right-to-left instead of left-to-right?
Q02SENIOR
Explain the bad character heuristic with an example.
Q03SENIOR
Why is Boyer-Moore faster for longer patterns?
Q04JUNIOR
What is the difference between Boyer-Moore and Boyer-Moore-Horspool?
Q05SENIOR
How would you implement Boyer-Moore in a production system where you don...
Q01 of 05SENIOR

Why does Boyer-Moore compare right-to-left instead of left-to-right?

ANSWER
Right-to-left comparison allows the algorithm to use information from the mismatched character. When a mismatch occurs, we see the character in the text that caused the failure. If we compared left-to-right, we would know only that some character in the pattern mismatched, but not which one. Right-to-left gives us a concrete text character to apply the bad character heuristic.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Which is faster in practice — KMP or Boyer-Moore?
02
Does Python's str.find() use Boyer-Moore?
03
When should I implement good suffix instead of using Horspool?
04
Can Boyer-Moore be used for binary data?
05
How does Boyer-Moore handle overlapping matches?
🔥

That's Strings. Mark it forged?

3 min read · try the examples if you haven't

Previous
Z-Algorithm for Pattern Matching
4 / 8 · Strings
Next
Manacher's Algorithm — Longest Palindromic Substring