Homeβ€Ί DSAβ€Ί Boyer-Moore String Search Algorithm β€” Bad Character and Good Suffix

Boyer-Moore String Search Algorithm β€” Bad Character and Good Suffix

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Strings β†’ Topic 4 of 8
Learn the Boyer-Moore string search algorithm.
βš™οΈ Intermediate β€” basic DSA knowledge assumed
In this tutorial, you'll learn:
  • Boyer-Moore compares right-to-left within the pattern β€” this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
  • Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
  • Boyer-Moore-Horspool (bad char only) is what most production implementations use β€” simpler than full Boyer-Moore, nearly as fast for typical text.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Most string search algorithms compare left to right. Boyer-Moore does something counterintuitive β€” it compares the pattern right to left. When it finds a mismatch, it uses two powerful rules to skip ahead by many characters at once, sometimes jumping the entire pattern length. This is why Boyer-Moore gets faster as the pattern gets longer β€” the opposite of what you'd expect.

Open a terminal and run: grep -c 'the' /usr/share/dict/words. Grep finds all matches faster than you can think about it. The reason: Boyer-Moore (specifically Boyer-Moore-Horspool). The longer your search pattern, the faster it runs β€” counterintuitively, Boyer-Moore gets faster as patterns get longer because it skips more characters per comparison.

This sub-linear behaviour is the key insight. KMP guarantees O(n+m) worst case. Boyer-Moore achieves O(n/m) in practice on natural language text β€” for a 10-character pattern, it examines roughly 1 in 10 characters in the text. This is why Boyer-Moore is the default in text editors, grep, and most real-world text search implementations despite KMP being theoretically simpler.

Right-to-Left Matching

Boyer-Moore aligns the pattern at the left of the text and compares right-to-left β€” from the last character of the pattern backward. When it finds a mismatch, it applies whichever of the two heuristics gives the larger skip.

This right-to-left comparison is key: when we find a mismatch at pattern position j (text position i+j), we know nothing about text[i+j+1..i+m-1] β€” but we know the characters to the right are already matched. This asymmetry is what makes skipping possible.

The Bad Character Heuristic

When a mismatch occurs at text position i+j (pattern position j), look at the mismatched text character c = text[i+j]. The bad character rule says: shift the pattern right so that the rightmost occurrence of c in the pattern aligns with position i+j.

If c doesn't appear in the pattern at all, shift the entire pattern past position i+j β€” a shift of j+1. If c appears in the pattern to the left of j, shift to align the rightmost such occurrence. If c only appears to the right of j, the bad character rule gives a negative shift β€” ignore it (use 0 or the good suffix rule).

bad_char_table.py Β· PYTHON
12345678910111213
def bad_char_table(pattern: str) -> dict[str, int]:
    """
    Returns last occurrence index of each character in pattern.
    If char not in pattern, returns -1 (handled at call site).
    """
    table = {}
    for i, ch in enumerate(pattern):
        table[ch] = i
    return table

pattern = 'ABCBAB'
table = bad_char_table(pattern)
print(table)  # {'A': 4, 'B': 5, 'C': 2}
β–Ά Output
{'A': 4, 'B': 5, 'C': 2}

Boyer-Moore with Bad Character Only (Simplified)

A simplified version using only the bad character heuristic is called Boyer-Moore-Horspool and is often used in practice for its simplicity.

boyer_moore_simple.py Β· PYTHON
123456789101112131415161718192021222324252627282930
def boyer_moore_horspool(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0 or m > n:
        return []

    # Build bad character table
    bad_char = {ch: i for i, ch in enumerate(pattern)}
    matches = []
    s = 0  # shift of the pattern w.r.t. text

    while s <= n - m:
        j = m - 1  # start from rightmost character

        # Move left while characters match
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1

        if j < 0:
            matches.append(s)  # full match found
            # Shift by 1 (or use good suffix for full Boyer-Moore)
            s += 1
        else:
            # Bad character shift
            bc = bad_char.get(text[s + j], -1)
            s += max(1, j - bc)

    return matches

print(boyer_moore_horspool('ABAAABCD', 'ABC'))      # [4]
print(boyer_moore_horspool('AABCAABXAAB', 'AABX'))  # [4]
β–Ά Output
[4]
[4]

The Good Suffix Heuristic

When a mismatch occurs after matching t characters (the 'good suffix'), the good suffix rule shifts the pattern to align the next occurrence of that suffix. This is more complex to implement but provides better worst-case guarantees.

The full Boyer-Moore with both heuristics achieves: - Best case: O(n/m) β€” each character in text examined at most once every m steps - Worst case: O(nm) for the simple version, O(n) with the full good suffix rule - Typical English text: Much faster than KMP in practice

For most interview purposes, the bad-character-only version (Horspool) is sufficient to explain and implement.

πŸ”₯
Boyer-Moore in the Real WorldGNU grep uses a Boyer-Moore variant as its default algorithm. When you run 'grep pattern file', the longer your pattern, the faster grep runs β€” a direct consequence of Boyer-Moore's sub-linear behaviour.

Worked Example

Text: AAABBBCCCABC Pattern: ABC

Step 1: Align at position 0. Compare right-to-left: text[2]='A' vs pattern[2]='C' β†’ mismatch. Bad char 'A' is at index 0 in pattern. Shift = max(1, 2-0) = 2. Move to s=2.

Step 2: s=2. Compare right-to-left: text[4]='B' vs pattern[2]='C' β†’ mismatch. Bad char 'B' at index 1. Shift = max(1, 2-1) = 1. Move to s=3.

Step 3: s=3. Compare: text[5]='B' vs 'C' β†’ mismatch. Bad char 'B' at 1. Shift=1. s=4.

... continues until s=9 where ABC matches. Total: only examined ~8 positions instead of 30 naive comparisons.

PropertyBoyer-Moore-HorspoolFull Boyer-MooreKMP
PreprocessingO(m + Οƒ)O(m + Οƒ)O(m)
Search worst caseO(nm)O(n)O(n)
Search best caseO(n/m)O(n/m)O(n)
ImplementationSimpleComplexModerate
Best forLong patternsGeneral purposeRepetitive patterns

🎯 Key Takeaways

  • Boyer-Moore compares right-to-left within the pattern β€” this is the key difference from KMP/Z. Right-to-left enables skipping forward in the text on mismatch.
  • Bad character rule: align rightmost occurrence of mismatched text char with current pattern position. If char not in pattern, skip entire pattern length past current position.
  • Boyer-Moore-Horspool (bad char only) is what most production implementations use β€” simpler than full Boyer-Moore, nearly as fast for typical text.
  • O(n/m) in practice: longer patterns = faster search. A 20-character pattern in English text typically examines fewer than n/10 characters.
  • Choose KMP for: DNA sequences, binary data, repetitive patterns, guaranteed worst case. Choose Boyer-Moore for: English text, long patterns, when you can profile workload.

Interview Questions on This Topic

  • QWhy does Boyer-Moore compare right-to-left instead of left-to-right?
  • QExplain the bad character heuristic with an example.
  • QWhy is Boyer-Moore faster for longer patterns?
  • QWhat is the difference between Boyer-Moore and Boyer-Moore-Horspool?

Frequently Asked Questions

Which is faster in practice β€” KMP or Boyer-Moore?

Boyer-Moore is significantly faster in practice for long patterns in natural language text. KMP is preferred when worst-case guarantees are needed or patterns are short with lots of repetition.

Does Python's str.find() use Boyer-Moore?

CPython uses a two-way string matching algorithm combined with Boyer-Moore-Horspool heuristics for longer patterns. The exact implementation varies by Python version.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousZ-Algorithm for Pattern MatchingNext β†’Manacher's Algorithm β€” Longest Palindromic Substring
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged