Skip to content
Home DSA Boyer-Moore String Search — When Bad Character Alone Fails

Boyer-Moore String Search — When Bad Character Alone Fails

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Strings → Topic 4 of 8
A 20-char pattern took 10s on repetitive logs.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn
A 20-char pattern took 10s on repetitive logs.
  • Boyer-Moore compares right-to-left within the pattern — this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
  • Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
  • Boyer-Moore-Horspool (bad char only) is what most production implementations use — simpler than full Boyer-Moore, nearly as fast for typical text.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Boyer-Moore compares pattern right-to-left and skips characters using two heuristics
  • Bad character rule shifts pattern to align mismatched char's rightmost occurrence
  • Good suffix rule shifts to align matched suffix's previous occurrence
  • Performance: O(n/m) in practice for long patterns — examines ~1 in m characters
  • Production insight: Implementation errors (missing good suffix) can cause O(nm) worst case on repetitive text
  • Biggest mistake: Assuming bad character alone guarantees sub-linear performance (Horspool is simpler but not guaranteed O(n))
🚨 START HERE

Boyer-Moore Implementation Quick Fixes

Dive into common implementation bugs and how to fix them fast.
🟡

Pattern match missed at the beginning of text.

Immediate ActionCheck your initial alignment: pattern should be at index 0 initially. Verify shift increment doesn't skip the first match.
Commands
print(f"Start shift: {s}, comparing text[{s+j}]='{text[s+j]}' vs pattern[{j}]='{pattern[j]}'")
Test with a simple pattern and text where you know the expected positions.
Fix NowAdd logging for each shift and character comparison. Ensure you handle the case where j becomes -1 (full match) before checking s limits.
🟡

Search runs but misses matches in the middle of text.

Immediate ActionCheck if the bad character table is built correctly — it should store the LAST (rightmost) index of each character in the pattern.
Commands
print(f"Bad char table for char '{c}': rightmost index {bad_char.get(c, -1)}")
Verify that when a character is not found (-1), you shift by j - (-1) = j + 1 (skip entire pattern).
Fix NowIf using a dict, ensure chars not in pattern return -1 explicitly. Then shift = max(1, j - bc_index).
🟡

Algorithm shifts more than expected (skips over a match).

Immediate ActionCheck if you applied the shift BEFORE comparing all characters. The shift should only happen AFTER a mismatch is found.
Commands
print(f"Shift before apply: s={s}, n-m={n-m}")
Add a guard: if s > n - m: break
Fix NowIn the loop, do the inner while loop first, then check if j < 0 or apply shift based on mismatch. Don't shift at the start of each iteration.
Production Incident

Production Search Timeout Due to Boyer-Moore Worst Case

A log analysis tool that indexed hundreds of GB of repetitive log entries started timing out after a pattern change. The root cause: a missing good suffix rule in a custom Boyer-Moore implementation.
SymptomSearch for a 20-character pattern in log files with repeated 'ERROR: Connection timeout...' strings took over 10 seconds per file instead of milliseconds.
AssumptionThe team assumed Boyer-Moore with only the bad character heuristic would be sufficient — it works well on English text.
Root causeThe bad-character-only variant (Horspool) degrades to O(nm) when the pattern and text have repetitive characters. In this case, the pattern 'ERROR: Connection timeout' contained many repeated letters, and the log had thousands of similar lines. Every mismatch triggered a tiny shift, making the algorithm examine every character.
FixReplaced the Horspool implementation with the full Boyer-Moore (including good suffix rule). The search time dropped back to <100ms per file. The good suffix rule handles repetitive patterns by shifting the entire suffix to the previous occurrence, preventing character-by-character slides.
Key Lesson
Bad-character-only Boyer-Moore is not safe for all inputs — always evaluate worst-case pattern and text characteristics.If your data has repetition (logs, DNA, machine-generated text), you need the good suffix rule for guaranteed O(n) performance.Profile your search on production-like data before committing to a variant.Grep uses a tuneable variant that falls back to other algorithms when Boyer-Moore's skip performance degrades.
Production Debug Guide

Debug why your string search is unexpectedly slow

Search time grows linearly with text size regardless of pattern length.Check if you implemented only the bad character heuristic. Add the good suffix rule or switch to Boyer-Moore-Horspool only if your data is truly random (e.g., English prose).
Search on repetitive patterns (e.g., 'AAAA', 'ABAB') takes much longer than on random patterns.Profile with repetitive patterns. If bad-character-only, the good suffix rule is mandatory for these inputs. Implement full Boyer-Moore.
After a pattern match, the algorithm shifts by only 1 character.You may have forgotten to handle the case where a full match occurs. In the Horspool variant, after a match, shift by 1 (or use the good suffix rule to shift further). Check your shift logic.
The bad character table produces zero or negative shifts.Ensure the table stores the rightmost occurrence of each character. If a character appears only to the right of the mismatch position, the bad character rule gives a negative shift — in that case, ignore it and fall back to a shift of 1 (or the good suffix rule).
Implementation works on short patterns but fails on long patterns (>30 chars).Check for integer overflow in shift calculations. The shift value can be up to pattern length, which fits in an int, but if you use unsigned types, ensure no underflow. Also verify you're not skipping beyond the text bounds.

Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs — counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.

This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text — for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.

Right-to-Left Matching

Boyer-Moore aligns the pattern at the left of the text and compares right-to-left — from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.

This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] — but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.

📊 Production Insight
Right-to-left matching lets us skip characters based on the mismatched text character.
Without it, you'd never get sub-linear behaviour.
Rule: if you compare left-to-right, you forfeit Bad Character skipping.
🎯 Key Takeaway
Right-to-left comparison is the foundation.
It enables skipping because mismatched text char carries information about future alignments.

The Bad Character Heuristic

When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.

If c doesn't appear in the pattern at all, shift the entire pattern past position i+j — a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift — ignore it (use 0 or the good suffix rule).

bad_char_table.py · PYTHON
12345678910111213
def bad_char_table(pattern: str) -> dict[str, int]:
    """
    Returns last occurrence index of each character in pattern.
    If char not in pattern, returns -1 (handled at call site).
    """
    table = {}
    for i, ch in enumerate(pattern):
        table[ch] = i
    return table

pattern = 'ABCBAB'
table = bad_char_table(pattern)
print(table)  # {'A': 4, 'B': 5, 'C': 2}
📊 Production Insight
The bad character table must store the RIGHTMOST occurrence, not the first.
A common bug: storing first occurrence leads to insufficient shifts and O(nm) behaviour.
Rule: always enumerate from left to right, overwriting previous entries.
🎯 Key Takeaway
Bad character shifts are computed as j - bc[text[i+j]].
If negative, ignore it — always use max(1, shift) to avoid infinite loops.

Boyer-Moore with Bad Character Only (Simplified)

A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity. It's the default in many text editors and grep implementations because it's fast enough for general text. The trade-off is worst-case O(nm) on repetitive inputs.

boyer_moore_simple.py · PYTHON
123456789101112131415161718192021222324252627282930
def boyer_moore_horspool(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0 or m > n:
        return []

    # Build bad character table
    bad_char = {ch: i for i, ch in enumerate(pattern)}
    matches = []
    s = 0  # shift of the pattern w.r.t. text

    while s <= n - m:
        j = m - 1  # start from rightmost character

        # Move left while characters match
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1

        if j < 0:
            matches.append(s)  # full match found
            # Shift by 1 (or use good suffix for full Boyer-Moore)
            s += 1
        else:
            # Bad character shift
            bc = bad_char.get(text[s + j], -1)
            s += max(1, j - bc)

    return matches

print(boyer_moore_horspool('ABAAABCD', 'ABC'))      # [4]
print(boyer_moore_horspool('AABCAABXAAB', 'AABX'))  # [4]
📊 Production Insight
Many production systems use Horspool variant for its simplicity.
But if your data has high repetition (logs, DNA), you'll see catastrophic slowdown.
Rule: always test with worst-case repetitive patterns before deploying.
🎯 Key Takeaway
Horspool = bad char only.
Great for natural language, risky for repetitive data.
Test on real workloads.

The Good Suffix Heuristic

When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.

The full Boyer-Moore with both heuristics achieves
  • Best case: O(n/m) — each character in text examined at most once every m steps
  • Worst case: O(n) when good suffix is included, O(nm) without
  • Typical English text: Much faster than KMP in practice

For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement. But in production, if you can't guarantee input randomness, implement the good suffix rule.

🔥Boyer-Moore in the Real World
GNU grep uses a Boyer-Moore variant as its default algorithm. When you run 'grep pattern file', the longer your pattern, the faster grep runs — a direct consequence of Boyer-Moore's sub-linear behaviour.
📊 Production Insight
Good suffix rule requires building two tables: one for prefix matching, one for suffix matching.
Implementation mistakes here are common and subtle.
Rule: test with pattern like 'ABAB' to verify shifts are correct after partial matches.
🎯 Key Takeaway
Good suffix ensures O(n) worst case.
Without it, you risk O(nm) on repetitive text.
Implement it if you can't control your input.

Worked Example: Tracing Shifts on a Short Text

Text: AAABBBCCCABC Pattern: ABC

Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' → mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.

Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' → mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.

Step 3: s=3. Compare: text[5]='B' vs 'C' → mismatch. Bad char 'B' at 1. Shift=1. s=4.

... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.

This demonstrates the sub-linear behaviour: for a 3-character pattern, we skip 2 out of every 3 characters on average.

📊 Production Insight
Tracing shifts manually catches off-by-one errors in shift calculations.
Common mistake: forgetting that mismatched text position is s+j, not just j.
Rule: verify your shift logic on a simple example before trusting it.
🎯 Key Takeaway
Worked examples reveal algorithm correctness.
Shift calculation: s += max(1, j - bc_index).
Always test with pattern and text that have known match positions.
🗂 Boyer-Moore Variants vs KMP
Key trade-offs between common string matching algorithms
PropertyBoyer-Moore-HorspoolFull Boyer-MooreKMP
PreprocessingO(m + σ)O(m + σ)O(m)
Search worst caseO(nm)O(n)O(n)
Search best caseO(n/m)O(n/m)O(n)
ImplementationSimpleComplexModerate
Best forLong patterns, random textGeneral purpose, repetitive textShort patterns, guaranteed worst-case

🎯 Key Takeaways

  • Boyer-Moore compares right-to-left within the pattern — this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
  • Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
  • Boyer-Moore-Horspool (bad char only) is what most production implementations use — simpler than full Boyer-Moore, nearly as fast for typical text.
  • O(n/m) in practice: longer patterns = faster search. A 20-character pattern in English text typically examines fewer than n/10 characters.
  • Choose KMP for: DNA sequences, binary data, repetitive patterns, guaranteed worst case. Choose Boyer-Moore for: English text, long patterns, when you can profile workload.

⚠ Common Mistakes to Avoid

    Storing first occurrence instead of rightmost in bad character table
    Symptom

    Shifts are smaller than optimal; algorithm degrades to O(nm) even on random text because it never skips enough characters.

    Fix

    When building the table, iterate through the pattern left to right and overwrite the entry for each character. This ensures the value stored is the last (rightmost) index.

    Not handling negative shift from bad character rule
    Symptom

    The computed shift (j - bc) is negative because the character appears only to the right of j. This causes the algorithm to shift left (backwards) or produce an underflow.

    Fix

    Use max(1, j - bc) to clamp the shift to at least 1. Or implement the good suffix rule to handle that case properly.

    Omitting the good suffix rule for repetitive data
    Symptom

    Search on patterns like 'AAAA' or 'ABAB' in repetitive text takes linear time (O(nm)) instead of sub-linear.

    Fix

    Either implement the good suffix rule, or if you only use Horspool, ensure your application data is truly random. Otherwise, fall back to KMP when repetition is detected.

    Incorrect shift after a full match
    Symptom

    After finding a match, the algorithm shifts by 1 (or worse, by 0), causing duplicate matches or infinite loops.

    Fix

    In Horspool, after a match, shift by 1 (s += 1). In full Boyer-Moore, you can compute a larger shift using the good suffix rule. Ensure you don't skip subsequent overlapping matches if needed.

    Using a fixed shift table for bad characters without considering alphabet size
    Symptom

    If the alphabet is large (Unicode text), the bad character table can become memory-heavy. If it's small (DNA), make sure you handle all possible characters efficiently.

    Fix

    For large alphabets, use a hash map instead of an array. For small alphabets, a fixed-size array indexed by character code is faster. Balance memory vs speed based on your input domain.

Interview Questions on This Topic

  • QWhy does Boyer-Moore compare right-to-left instead of left-to-right?Mid-levelReveal
    Right-to-left comparison allows the algorithm to use information from the mismatched character. When a mismatch occurs, we see the character in the text that caused the failure. If we compared left-to-right, we would know only that some character in the pattern mismatched, but not which one. Right-to-left gives us a concrete text character to apply the bad character heuristic.
  • QExplain the bad character heuristic with an example.SeniorReveal
    When a mismatch occurs at pattern position j and text character c = text[s+j], the bad character rule says: shift the pattern so that the rightmost occurrence of c in the pattern aligns with that position. Example: pattern 'ABCBAB', text 'ABCX...'. At s=0, compare rightmost: pattern[5]='B' vs text[5]='C' mismatch. Mismatched text char is 'C', which appears in pattern at index 2 (rightmost). j=5, bc=2, shift = max(1, 5-2)=3. So we skip 3 characters to align the 'C' in pattern with the text 'C'.
  • QWhy is Boyer-Moore faster for longer patterns?Mid-levelReveal
    The bad character rule allows shifts proportional to the pattern length. Longer patterns mean larger potential shifts. In natural language, each character mismatch can skip almost the entire pattern length. So a 20-character pattern examines roughly n/20 characters, while a 5-character pattern examines n/5. The time per comparison is constant, so longer patterns run faster in terms of total characters examined.
  • QWhat is the difference between Boyer-Moore and Boyer-Moore-Horspool?JuniorReveal
    Horspool is a simplified version that uses only the bad character heuristic. It's easier to implement and performs well on typical text. The full Boyer-Moore also includes the good suffix heuristic, which handles degenerate cases (repetitive patterns and text) and guarantees O(n) worst-case time. Horspool has O(nm) worst-case but is often 2-3x faster on real-world data due to simpler logic.
  • QHow would you implement Boyer-Moore in a production system where you don't know the text distribution?SeniorReveal
    Start with Boyer-Moore-Horspool for simplicity, but add a runtime check for pattern repetition (e.g., long runs of same character or periodic patterns). If detected, fall back to KMP or full Boyer-Moore. Also, tune the bad character table size based on the alphabet (fixed array for ASCII, dict for Unicode). Profile on representative data. If worst-case latency is critical, invest in implementing the good suffix rule.

Frequently Asked Questions

Which is faster in practice — KMP or Boyer-Moore?

Boyer-Moore is significantly faster in practice for long patterns in natural language text. KMP is preferred when worst-case guarantees are needed or patterns are short with lots of repetition.

Does Python's str.find() use Boyer-Moore?

CPython uses a two-way string matching algorithm combined with Boyer-Moore-Horspool heuristics for longer patterns. The exact implementation varies by Python version.

When should I implement good suffix instead of using Horspool?

Implement good suffix when your input includes any repetitive patterns – logs, DNA, machine-generated data, or when you cannot afford worst-case timeouts. If you're building a library used by unknown callers, implement the full algorithm.

Can Boyer-Moore be used for binary data?

Yes, but the alphabet is large (256 bytes), so the bad character table as an array of 256 ints is fine. However, binary data often has patterns and repetitions, so you need the good suffix rule to avoid O(nm) worst case. KMP is often preferred for binary matching due to simpler worst-case behaviour.

How does Boyer-Moore handle overlapping matches?

In the basic Horspool, after a match you shift by 1 to find overlapping matches. In full Boyer-Moore, the good suffix rule gives a larger shift. If you need overlapping matches, ensure your implementation allows them (e.g., by using the good suffix shift after a match rather than just +1).

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousZ-Algorithm for Pattern MatchingNext →Manacher's Algorithm — Longest Palindromic Substring
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged