Senior 3 min · March 24, 2026

String Searching Algorithms — 12-Min Parser Failure on 2GB

12-minute log scan on 2GB file due to naive string search on repeated 'A's (worst-case O(nm)).

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • String searching locates all occurrences of a pattern in a text. Core operation for grep, log analysis, DNA alignment.
  • Naive: O(nm) worst-case. Fine for short patterns; breaks at scale.
  • KMP: O(n+m) guaranteed. Best for repetitive patterns or adversarial input.
  • Boyer-Moore: O(n/m) best-case. Fastest in practice for natural language.
  • Rabin-Karp: Rolling hash enables multi-pattern search in O(n+m) average.
  • Production reality: str.find() in CPython uses two-way algorithm — hand-rolling in Python is rarely needed.
Plain-English First

Imagine you have a massive book and need to find a specific word. You could read every page from start to finish (naive), or you could be smarter — use bookmarks, skip ahead when you know the word can't start here, or pre-process the word to avoid backtracking. String searching algorithms are exactly these smarter strategies applied to text.

Every production system that processes text does string searching. Log parsers, protocol handlers, search engines, DNA alignment tools, plagiarism detectors — all of them have string search at their core. The algorithm you choose matters more than most engineers realise until they hit the wrong one at scale.

The naive approach is O(nm). For a 1GB log file with n=10^9 and a 100-char pattern, that is 10^11 comparisons — 100 seconds. Boyer-Moore on the same problem: sub-linear O(n/m) in practice — under a second.

This guide maps the landscape. Up front: KMP for guaranteed linear time on repetitive patterns, Boyer-Moore for long patterns in English text (fastest in practice), Rabin-Karp for multi-pattern search, Aho-Corasick when you have thousands of patterns simultaneously.

The Naive / Brute Force Approach

The simplest approach: slide the pattern across the text one character at a time and check for a match at each position.

Time complexity: O(nm) worst case — occurs with patterns like AAAB in text AAAA...A Space complexity: O(1)

Despite being slow, the naive approach performs well in practice for typical English text because mismatches occur early.

naive_search.pyPYTHON
1
2
3
4
5
6
7
8
9
def naive_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    positions = []
    for i in range(n - m + 1):
        if text[i:i+m] == pattern:
            positions.append(i)
    return positions

print(naive_search('AABAACAADAABAABA', 'AABA'))  # [0, 9, 12]
Output
[0, 9, 12]
Production Insight
Naive search is fine for patterns < 10 chars in text < 1MB.
For log files > 100MB, even with early mismatch, repeated characters in the pattern (e.g., searching for 'TTTTTTTT') cause O(nm).
Rule: never use naive search in production code; always use a library function.
Key Takeaway
Naive is easy to implement but dangerous at scale.
Production rule: if you need guaranteed performance, don't write your own — use built-in.
Worst-case input always exists and will find your code.

Algorithm Landscape — When to Use What

Each algorithm solves the same problem but excels in different scenarios:

KMP (Knuth-Morris-Pratt): Best when the pattern has repeated internal structure. Guarantees O(n+m) with no worst-case degradation. The go-to general purpose algorithm.

Rabin-Karp: Best for multiple pattern search or plagiarism detection. Uses hashing — O(n+m) average, O(nm) worst case but rare in practice.

Boyer-Moore: Best for long patterns in English-like text. Skips characters aggressively — sub-linear O(n/m) in practice. Used in most real-world text editors (grep, vi).

Z-Algorithm: Elegant and easy to implement. O(n+m). Great for problems involving pattern matching within a string concatenation trick.

Aho-Corasick: Best when searching for many patterns simultaneously. O(n + total pattern length + matches). Used in antivirus engines and network intrusion detection.

Interview Tip
For most interview questions, KMP or Rabin-Karp are expected. Boyer-Moore is impressive to mention but rarely required to implement. Z-Algorithm appears in competitive programming.
Production Insight
Switching algorithms can give 1000x speedup: naive O(nm) vs Boyer-Moore O(n/m).
Boyer-Moore's best-case when pattern is long and text is natural language (grep).
Rule: pick Boyer-Moore for long patterns in human text; KMP for repetitive or adversarial inputs.
Key Takeaway
Know the input characteristics before choosing an algorithm.
Boyer-Moore wins on long patterns in English text.
KMP wins on repetitive patterns and guaranteed linear time.

Complexity Comparison Table

A summary of all major string searching algorithms by preprocessing time, search time, and space requirements.

Production Insight
Aho-Corasick's preprocessing O(Σm) can be heavy for large alphabets (e.g., Unicode).
Boyer-Moore space O(m+σ) includes the bad character table — for ASCII it's negligible, for Unicode it's large.
Rule: if memory is tight, prefer KMP (O(m)) or Rabin-Karp (O(1)).
Key Takeaway
Preprocessing cost matters in one-shot vs repeated searches.
For one-off searches: Boyer-Moore's preprocessing may dominate small patterns.
For repeated searches: preprocessing is amortised.

The Sliding Window Mental Model

All string searching algorithms share a sliding window mental model — a window of size m slides across the text. What differs is how each algorithm decides to slide:

  • Naive: Always slides by 1
  • KMP: Uses failure function to skip ahead — never re-examines matched characters
  • Boyer-Moore: Slides by the maximum of bad character and good suffix rules — can skip m characters at a time
  • Rabin-Karp: Computes a rolling hash so the window comparison is O(1) average
  • Z-Algorithm: Pre-computes how far each suffix matches the pattern prefix

Understanding this model makes it easy to reason about why each algorithm achieves its complexity.

Common Mistake
Using text.find() or 'in' in Python uses an optimised C implementation of Boyer-Moore-Horspool under the hood — so in production Python code you rarely need to implement these manually. They matter for interviews and understanding.
Production Insight
Sliding window mental model helps debug performance: if search is slow, ask 'how far does the window slide on a mismatch?'
Pythons's built-in uses two-way algorithm with O(n+m) — no pathological cases.
Rule: trust the runtime for production, learn algorithms for interviews.
Key Takeaway
All string search algorithms are just fancy sliding window logic.
The 'slide distance' determines performance — the larger the skip, the faster.
Know how each algorithm decides to shift.

Production-Grade String Search: Real-World Considerations

Beyond algorithm complexity, production string search faces additional challenges:

Unicode and multi-byte encodings: UTF-8 variable-length characters mean you can't just compare bytes — you need to align on code-point boundaries. Python's str handles this, but C/Java must be careful.

Memory impact: For very large texts (GBs), streaming implementations that avoid loading the entire text into memory are critical. Rabin-Karp and KMP can both be implemented as stream algorithms.

Pattern with regex features: Real-world search often needs wildcards, alternation, case-insensitivity. Dedicated regex engines (PCRE, RE2) use automata theory — beyond simple string matching.

Hardware acceleration: Modern CPUs have SIMD instructions (SSE4.2, AVX2) that can compare 16-32 bytes in parallel. libc's memmem uses micro-optimisations. In production, you're almost always better off relying on the standard library.

Production Insight
A BigQuery log pipeline using C++ memmem (SSE4.2) processed 10GB in 0.5 seconds. The same logic in Python with naive search took 3 minutes.
UTF-8 alignment mistakes caused off-by-one errors in a GDPR data redaction tool — the fix was to use ICU libraries.
Rule: for production string search, leverage the standard library; micro-optimisations are for specialised needs.
Key Takeaway
Production string search is about more than algorithm complexity.
Unicode handling, streaming, and SIMD acceleration matter more than the algorithm choice.
Always use the standard library's implementation; it's been battle-tested.
● Production incidentPOST-MORTEMseverity: high

The Log Parser That Took 12 Minutes to Scan a 2GB File

Symptom
Log parser scheduled every 5 minutes consistently took 12+ minutes to scan the current log file. Alerts fired for processing lag and backpressure built up.
Assumption
The team assumed 'modern computers can handle 2GB strings quickly' and that Python's str.find() would perform well enough. They didn't check the worst-case behaviour for their specific pattern.
Root cause
The pattern searched for 'AAA...AA' (50 As with occasional B). Naive search on repeated characters triggered the worst-case O(nm) comparisons on every alignment.
Fix
Switched to Python's built-in str.find() which uses a two-way algorithm (Breslauer-Galil) with O(n+m) worst-case. Scan time dropped from 12 minutes to 0.3 seconds.
Key lesson
  • Never assume linear runtime — worst-case inputs exist and will find your code.
  • Production string search in Python: use str.find() or re.search(). They are optimised C implementations with guaranteed performance.
  • Always profile with your actual data patterns, not random text.
Production debug guideIs your string search slow? Here's how to diagnose and fix it.4 entries
Symptom · 01
Search takes seconds on a file that should be instant
Fix
Estimate n and m. If n > 10^6 and m > 10, naive O(nm) is dangerous. Profile with timeit on a sample.
Symptom · 02
Search works locally but times out in production on specific inputs
Fix
Check pattern content: patterns with long runs of identical characters (e.g., 'AAAA...A') cause worst-case for naive and Boyer-Moore. Switch to KMP or use built-in.
Symptom · 03
Multi-pattern search is too slow with separate calls
Fix
Replace repeated single-pattern searches with a single Aho-Corasick pass. Example: antivirus signatures, keyword filtering.
Symptom · 04
Rolling hash collisions produce false positives or missed matches
Fix
Use double hashing (two independent moduli) or switch to a non-hashing algorithm like KMP when collisions cause production bugs.
★ String Search Debugging Cheat SheetQuick commands and checks for common string search performance and correctness issues in production.
Python str.find() unexpectedly slow
Immediate action
Check pattern length and character repetition. If pattern has many repeated chars, worst-case could be pathological.
Commands
python -m timeit -s "t='A'*10**6 + 'B'", p='A'*500 + 'B'" "t.find(p)"
test with random pattern to compare: python -m timeit -s "t='X'*10**6, p='X'*500" "t.find(p)"
Fix now
If slow, switch to re.search() which uses a tuned algorithm. Or preprocess: break pattern into pieces.
Rabin-Karp gives wrong answer on large texts+
Immediate action
Check for hash collisions. Increase modulus or use double hashing. Verify modulo arithmetic handles negative intermediate values.
Commands
print(hash(p) % MOD) in debug to see collisions
Implement second hash and compare both. If mismatch, it's a collision.
Fix now
Use double hashing with two primes (e.g., 109+7 and 109+9) to reduce collision probability to ~1e-18.
String Search Algorithms Comparison
AlgorithmPreprocessingSearch TimeSpaceBest For
NaiveO(1)O(nm)O(1)Short patterns, simple cases
KMPO(m)O(n)O(m)General purpose, repeated patterns
Rabin-KarpO(m)O(n+m) avgO(1)Multiple patterns, plagiarism detection
Boyer-MooreO(m+σ)O(n/m) bestO(m+σ)Long patterns, English text
Z-AlgorithmO(m)O(n+m)O(n+m)Competitive programming
Aho-CorasickO(Σm)O(n+matches)O(Σm)Multi-pattern search

Key takeaways

1
For production Python string search
use str.find() or re — optimised C implementations beat hand-rolled Python regardless of algorithm choice.
2
KMP guarantees O(n+m) even on adversarial inputs like AAAB in AAAA...A. Critical for log parsers, security-sensitive matching, bioinformatics.
3
Boyer-Moore is fastest in practice for long patterns in natural language text
why grep is fast. Longer patterns = faster search (O(n/m) best case).
4
Rabin-Karp with double hashing
right choice for multi-pattern search and plagiarism detection — O(n+m) average, simple to extend.
5
Aho-Corasick
only reasonable choice for thousands of simultaneous patterns — antivirus signatures, IDS rules, CDN keyword filtering.

Common mistakes to avoid

5 patterns
×

Implementing naive search in production

Symptom
Search takes 100x longer than expected on certain patterns. On-call gets paged for slow log processing.
Fix
Use str.find() or re.search() in Python, String.contains() in Java, .find() in C++. Never hand-roll string search in production.
×

Using Rabin-Karp without collision handling

Symptom
False positive matches in plagiarism detection or data dedup. Incorrect alerting or duplicated data.
Fix
Use double hashing (two independent moduli) or verify each candidate match by direct string comparison.
×

Assuming Boyer-Moore is always fastest

Symptom
Search is slow on short patterns (<5 chars) or on adversarial input (pattern with repeated characters).
Fix
Boyer-Moore excels only with long patterns (m > 10) in natural language. For short or repetitive patterns, use KMP.
×

Using KMP when you need multi-pattern search

Symptom
Looping KMP over 1000 patterns takes minutes. High CPU usage.
Fix
Use Aho-Corasick for multiple patterns. It processes all patterns in a single pass over the text.
×

Ignoring Unicode when searching

Symptom
Boyer-Moore bad character table breaks on multi-byte UTF-8 characters. False negatives or incorrect matches.
Fix
Use libraries that handle Unicode (ICU, std::regex with Unicode flag). For Java/Python strings, they are Unicode-safe internally.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the worst-case input for the naive string search algorithm?
Q02SENIOR
How does KMP avoid backtracking in the text string?
Q03SENIOR
When would you choose Rabin-Karp over KMP?
Q04SENIOR
What is the bad character heuristic in Boyer-Moore?
Q05SENIOR
Explain the Aho-Corasick algorithm and its typical use case.
Q01 of 05JUNIOR

What is the worst-case input for the naive string search algorithm?

ANSWER
Any pattern that causes a mismatch at the last character on every alignment. Classic example: pattern = 'AAAB' and text = 'AAA...AAAAB' (many As followed by B). Each alignment compares m characters before the final mismatch, leading to O(nm) comparisons.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Does Python's 'in' operator use a smart algorithm?
02
Which string search algorithm should I learn first?
03
Can I use Boyer-Moore for single-character patterns?
04
How do I handle very large texts (multi-gigabyte) in string search?
🔥

That's Strings. Mark it forged?

3 min read · try the examples if you haven't

Previous
Divide and Conquer Technique
1 / 8 · Strings
Next
KMP Algorithm — Knuth-Morris-Pratt String Matching