String Searching Algorithms β Overview and Comparison
- For production Python string search: use str.find() or re β optimised C implementations beat hand-rolled Python regardless of algorithm choice.
- KMP guarantees O(n+m) even on adversarial inputs like AAAB in AAAA...A. Critical for log parsers, security-sensitive matching, bioinformatics.
- Boyer-Moore is fastest in practice for long patterns in natural language text β why grep is fast. Longer patterns = faster search (O(n/m) best case).
Every production system that processes text does string searching. Log parsers, protocol handlers, search engines, DNA alignment tools, plagiarism detectors β all of them have string search at their core. The algorithm you choose matters more than most engineers realise until they hit the wrong one at scale.
The naive approach is O(nm). For a 1GB log file with n=10^9 and a 100-char pattern, that is 10^11 comparisons β 100 seconds. Boyer-Moore on the same problem: sub-linear O(n/m) in practice β under a second.
This guide maps the landscape. Up front: KMP for guaranteed linear time on repetitive patterns, Boyer-Moore for long patterns in English text (fastest in practice), Rabin-Karp for multi-pattern search, Aho-Corasick when you have thousands of patterns simultaneously.
The Naive / Brute Force Approach
The simplest approach: slide the pattern across the text one character at a time and check for a match at each position.
Time complexity: O(nm) worst case β occurs with patterns like AAAB in text AAAA...A Space complexity: O(1)
Despite being slow, the naive approach performs well in practice for typical English text because mismatches occur early.
def naive_search(text: str, pattern: str) -> list[int]: n, m = len(text), len(pattern) positions = [] for i in range(n - m + 1): if text[i:i+m] == pattern: positions.append(i) return positions print(naive_search('AABAACAADAABAABA', 'AABA')) # [0, 9, 12]
Algorithm Landscape β When to Use What
Each algorithm solves the same problem but excels in different scenarios:
KMP (Knuth-Morris-Pratt): Best when the pattern has repeated internal structure. Guarantees O(n+m) with no worst-case degradation. The go-to general purpose algorithm.
Rabin-Karp: Best for multiple pattern search or plagiarism detection. Uses hashing β O(n+m) average, O(nm) worst case but rare in practice.
Boyer-Moore: Best for long patterns in English-like text. Skips characters aggressively β sub-linear O(n/m) in practice. Used in most real-world text editors (grep, vi).
Z-Algorithm: Elegant and easy to implement. O(n+m). Great for problems involving pattern matching within a string concatenation trick.
Aho-Corasick: Best when searching for many patterns simultaneously. O(n + total pattern length + matches). Used in antivirus engines and network intrusion detection.
Complexity Comparison Table
A summary of all major string searching algorithms by preprocessing time, search time, and space requirements.
The Sliding Window Mental Model
All string searching algorithms share a sliding window mental model β a window of size m slides across the text. What differs is how each algorithm decides to slide:
- Naive: Always slides by 1
- KMP: Uses failure function to skip ahead β never re-examines matched characters
- Boyer-Moore: Slides by the maximum of bad character and good suffix rules β can skip m characters at a time
- Rabin-Karp: Computes a rolling hash so the window comparison is O(1) average
- Z-Algorithm: Pre-computes how far each suffix matches the pattern prefix
Understanding this model makes it easy to reason about why each algorithm achieves its complexity.
| Algorithm | Preprocessing | Search Time | Space | Best For |
|---|---|---|---|---|
| Naive | O(1) | O(nm) | O(1) | Short patterns, simple cases |
| KMP | O(m) | O(n) | O(m) | General purpose, repeated patterns |
| Rabin-Karp | O(m) | O(n+m) avg | O(1) | Multiple patterns, plagiarism detection |
| Boyer-Moore | O(m+Ο) | O(n/m) best | O(m+Ο) | Long patterns, English text |
| Z-Algorithm | O(m) | O(n+m) | O(n+m) | Competitive programming |
| Aho-Corasick | O(Ξ£m) | O(n+matches) | O(Ξ£m) | Multi-pattern search |
π― Key Takeaways
- For production Python string search: use str.find() or re β optimised C implementations beat hand-rolled Python regardless of algorithm choice.
- KMP guarantees O(n+m) even on adversarial inputs like AAAB in AAAA...A. Critical for log parsers, security-sensitive matching, bioinformatics.
- Boyer-Moore is fastest in practice for long patterns in natural language text β why grep is fast. Longer patterns = faster search (O(n/m) best case).
- Rabin-Karp with double hashing: right choice for multi-pattern search and plagiarism detection β O(n+m) average, simple to extend.
- Aho-Corasick: only reasonable choice for thousands of simultaneous patterns β antivirus signatures, IDS rules, CDN keyword filtering.
Interview Questions on This Topic
- QWhat is the worst-case input for the naive string search algorithm?
- QHow does KMP avoid backtracking in the text string?
- QWhen would you choose Rabin-Karp over KMP?
- QWhat is the bad character heuristic in Boyer-Moore?
Frequently Asked Questions
Does Python's 'in' operator use a smart algorithm?
Yes β CPython's str.__contains__ uses a two-way string matching algorithm, optimised in C. For production Python code, just use 'in' or str.find(). Manual implementations are for learning and interviews.
Which string search algorithm should I learn first?
KMP β it is the most commonly asked in interviews, has a clean O(n+m) guarantee, and the failure function concept appears in many other problems.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.