Aho-Corasick Algorithm β Multi-Pattern String Matching
- Build once, search many: O(m) preprocessing (m = total pattern length), O(n + k) search. Pay the build cost once, search terabytes.
- Failure links are KMP's failure function on a trie. Same concept, different data structure. If you understand KMP failure function, Aho-Corasick failure links are immediate.
- Output links are not optional: a node for 'she' must also report 'he' if 'he' is a pattern. Missing output links gives silently incorrect results on overlapping patterns.
In production network security, Snort IDS processes multi-gigabit traffic matching against tens of thousands of attack signatures simultaneously. Running KMP once per signature per packet is physically impossible at line rate. Aho-Corasick is what makes it work: build one automaton from all patterns, run traffic through once, catch every match of every signature in a single linear pass.
Published in 1975 by Alfred Aho and Margaret Corasick at Bell Labs, the algorithm generalises KMP's failure function to a multi-pattern trie. The failure links in Aho-Corasick are KMP failure function values applied to a trie structure. Once you understand one, the other is immediate.
Building the Trie
The first step is inserting all patterns into a trie (prefix tree). Each node represents a prefix shared by one or more patterns. Nodes that complete a pattern are marked as output nodes.
from collections import deque class AhoCorasick: def __init__(self): # Each node: {'char': next_node_id}, fail link, output list self.goto = [{}] # goto[node][char] = next_node self.fail = [0] # failure link self.output = [[]] # patterns ending at this node def add_pattern(self, pattern: str, pid: int): node = 0 for ch in pattern: if ch not in self.goto[node]: self.goto.append({}) self.fail.append(0) self.output.append([]) self.goto[node][ch] = len(self.goto) - 1 node = self.goto[node][ch] self.output[node].append(pid) ac = AhoCorasick() for i, p in enumerate(['he', 'she', 'his', 'hers']): ac.add_pattern(p, i) print(f'Trie nodes: {len(ac.goto)}') # 11
Building Failure Links
Failure links are Aho-Corasick's equivalent of KMP's failure function. For each node v (reached by character c from parent p), the failure link points to the longest proper suffix of the string represented by v that is also a prefix of some pattern.
Failure links are built with BFS from the root: 1. Root's children have fail = root 2. For each node v at depth d, and each character c: if v has a child for c, that child's fail = goto(fail[v], c). Otherwise, add a goto shortcut: goto[v][c] = goto[fail[v]][c].
def build(self): """Build failure links using BFS.""" q = deque() # Root's children: fail = root for ch, child in self.goto[0].items(): self.fail[child] = 0 q.append(child) while q: node = q.popleft() for ch, child in self.goto[node].items(): # Compute fail for child f = self.fail[node] while f != 0 and ch not in self.goto[f]: f = self.fail[f] self.fail[child] = self.goto[f].get(ch, 0) if self.fail[child] == child: self.fail[child] = 0 # Merge output: child inherits from its fail link self.output[child] += self.output[self.fail[child]] q.append(child)
Searching the Text
With failure links built, searching is a single linear pass through the text β follow goto transitions, fall back via failure links on mismatch, and collect outputs.
def search(self, text: str) -> list[tuple[int,int]]: """Returns (end_position, pattern_id) for each match.""" node = 0 results = [] for i, ch in enumerate(text): while node != 0 and ch not in self.goto[node]: node = self.fail[node] node = self.goto[node].get(ch, 0) for pid in self.output[node]: results.append((i, pid)) return results # Full example ac = AhoCorasick() patterns = ['he', 'she', 'his', 'hers'] for i, p in enumerate(patterns): ac.add_pattern(p, i) ac.build() matches = ac.search('ahishers') for end, pid in matches: p = patterns[pid] print(f"'{p}' ends at index {end} (starts at {end-len(p)+1})")
'he' ends at index 5 (starts at 4)
'hers' ends at index 7 (starts at 4)
'she' ends at index 6 (starts at 4)
Complexity Analysis
Let n = text length, m = sum of all pattern lengths, k = number of matches.
Build trie: O(m) Build failure links: O(m Γ |alphabet|) using the shortcut approach, or O(m) with optimised BFS Search: O(n + k) Total: O(n + m + k)
This is optimal β you must read all input and report all matches. The number of patterns does not appear in the complexity beyond the preprocessing step. Searching 10,000 patterns takes the same time as searching 1, once built.
π― Key Takeaways
- Build once, search many: O(m) preprocessing (m = total pattern length), O(n + k) search. Pay the build cost once, search terabytes.
- Failure links are KMP's failure function on a trie. Same concept, different data structure. If you understand KMP failure function, Aho-Corasick failure links are immediate.
- Output links are not optional: a node for 'she' must also report 'he' if 'he' is a pattern. Missing output links gives silently incorrect results on overlapping patterns.
- Production use: ClamAV virus signatures, Snort/Suricata IDS, Google RE2, CDN keyword filtering. Multi-pattern at scale means Aho-Corasick.
- For Python: use pyahocorasick (C extension). Pure-Python is too slow β the interpreter overhead kills performance at search scale.
Interview Questions on This Topic
- QHow does Aho-Corasick differ from running KMP once per pattern?
- QWhat is the role of failure links in Aho-Corasick?
- QWhat are output links and why are they needed?
- QGive a real-world use case where Aho-Corasick is the right choice.
Frequently Asked Questions
How is Aho-Corasick related to KMP?
Aho-Corasick generalises KMP to multiple patterns. KMP's failure function is essentially Aho-Corasick applied to a single pattern. The failure links in Aho-Corasick serve exactly the same role β finding the next valid state on mismatch.
What if patterns overlap or one is a prefix of another?
Aho-Corasick handles this through output links β when a node is reached, it reports matches from the node itself and from all nodes reachable via failure/output links. Patterns that are prefixes of others will fire at intermediate nodes.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.