Senior 6 min · March 24, 2026

Aho-Corasick — Missing Output Links Drop 40% of Matches

One missing line in BFS construction caused an IDS to miss 40% of attacks with zero errors.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Aho-Corasick builds a single automaton from all patterns, then searches text in one pass — O(n + k) regardless of pattern count
  • Trie construction: insert every pattern into a shared prefix tree in O(m) time, where m is total pattern length
  • Failure links generalize KMP's failure function to the trie — they point to the longest suffix that's also a valid prefix of some pattern in the set
  • Output links merge overlapping matches — without them, 'she' silently swallows 'he' when both are patterns and the text contains 'she'
  • Production systems like Snort IDS and ClamAV match tens of thousands of signatures at line rate using this algorithm
  • Goto shortcuts computed during the BFS build step are what keep search O(n) — without them, adversarial input degrades to O(n × trie_depth)
  • Biggest mistake: treating it as a regex replacement — Aho-Corasick matches exact strings only, no wildcards, no character classes
Plain-English First

KMP finds one pattern efficiently. But what if you need to find 1,000 patterns simultaneously? Running KMP 1,000 times is O(1,000 × n) — which on a live network stream at 10 Gbps is simply not viable. Aho-Corasick builds a single 'search automaton' from all patterns upfront, then streams the text through it exactly once, finding every pattern at the same time. Think of it as the difference between sending 1,000 individual inspectors through a warehouse one after another versus wiring the whole building with sensors before anyone walks in. The wiring takes time. But once it's done, a single walk-through catches everything.

Running KMP once per pattern scales linearly with pattern count. For 10,000 signatures on a 1 Gbps traffic stream, that's physically impossible at line rate. Aho-Corasick solves this by building a single deterministic automaton from all patterns upfront, then running the text through it exactly once — regardless of whether you have 10 patterns or 100,000.

Published in 1975 by Alfred Aho and Margaret Corasick at Bell Labs, the algorithm generalizes KMP's failure function to a trie. The failure links in Aho-Corasick are KMP failure function values applied across a shared prefix tree. If you understand why KMP doesn't restart from position zero on a mismatch, Aho-Corasick failure links will feel immediately familiar — the mechanics are identical, just extended to a tree instead of a linear string.

The trade-off is upfront preprocessing cost and a hard constraint: exact strings only. No wildcards, no quantifiers, no character classes. That limitation is precisely why it's fast — and precisely why it's the right tool when you need raw throughput on known, fixed signatures. When you need regex semantics at scale, you reach for Hyperscan. When you need to match a fixed dictionary of strings against a stream, you reach for Aho-Corasick.

Building the Trie

The first step is inserting all patterns into a trie — a prefix tree where each node represents a character on the path from root to some suffix. Nodes that complete a full pattern are marked as output nodes, storing the pattern ID (or the pattern string itself, depending on what you need to report).

The key property of the trie is prefix sharing. If two patterns share a common prefix, they share the corresponding nodes in the tree. Patterns 'http' and 'https' share five nodes — you only pay for the single divergent node at the 's'. A collection of 10,000 URL patterns with common prefixes like '/api/v1/', '/api/v2/' might compress to 2,000 trie nodes. A collection of 10,000 random hex strings compresses to almost nothing — nearly every character is a new branch.

The implementation below uses a list of dictionaries for the goto table. Each dictionary maps a character to the next node index. This is readable and correct for moderate-sized pattern sets with ASCII input. It breaks down with large alphabets (UTF-8 codepoints as characters rather than bytes) or very large node counts — at that scale the per-dict overhead in Python becomes a memory and cache-pressure problem. The correct fix in that scenario is pyahocorasick, which uses a C-backed double-array trie where each transition is a direct array index.

io/thecodeforge/algo/aho_corasick_trie.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# io.thecodeforge.algo.aho_corasick_trie
from collections import deque


class AhoCorasick:
    """Aho-Corasick automaton for multi-pattern exact string matching.

    Build once with add_pattern() + build(), then call search() repeatedly.
    The automaton is immutable after build() — adding patterns after build
    requires a full rebuild.
    """

    def __init__(self):
        # goto[node][char] = next_node_id
        # Node 0 is the root. All failure links eventually reach 0.
        self.goto: list[dict[str, int]] = [{}]
        self.fail: list[int] = [0]
        # output[node] = list of pattern IDs that end at this node
        # After build(), also includes IDs inherited from failure link targets
        self.output: list[list[int]] = [[]]

    def add_pattern(self, pattern: str, pid: int) -> None:
        """Insert a pattern into the trie. Call before build()."""
        node = 0
        for ch in pattern:
            if ch not in self.goto[node]:
                # Allocate a new node
                self.goto.append({})
                self.fail.append(0)
                self.output.append([])
                self.goto[node][ch] = len(self.goto) - 1
            node = self.goto[node][ch]
        self.output[node].append(pid)


# --- Usage example ---
ac = AhoCorasick()
patterns = ['he', 'she', 'his', 'hers']
for i, p in enumerate(patterns):
    ac.add_pattern(p, i)

print(f'Trie nodes: {len(ac.goto)}')  # 11
# 'he', 'she' share no prefix with 'his', 'hers'
# 'his' and 'hers' share 'h' — one node
# Total unique prefixes across all four patterns: 11 nodes including root
Output
Trie nodes: 11
The Trie as a Compression Machine
  • Patterns 'she' and 'share' share 'sh' — 2 nodes built once instead of twice
  • Patterns 'http' and 'https' share 4 characters — one divergent node for the 's' in 'https'
  • 10,000 URL patterns with common path prefixes might compress to 2,000 trie nodes — a 5x reduction before any searching begins
  • The compression ratio depends entirely on prefix overlap — random strings with no shared prefixes produce a flat tree with no savings
  • Memory cost is O(m) in total pattern length, but the constant factor is dominated by alphabet size and prefix overlap — profile with real data before assuming
Production Insight
Trie node count grows with alphabet size, not just pattern count. A pattern set that produces 5,000 nodes on ASCII characters can produce 50,000 nodes on raw UTF-8 bytes because multi-byte characters create three or four levels of branching per logical character. Before deploying on non-ASCII input, profile actual node count and memory with a representative sample of your real pattern data. The numbers can surprise you — a modest set of emoji-containing patterns from a content moderation system once produced a trie 12x larger than the ASCII equivalent.
Key Takeaway
Every shared prefix is a free optimization — 'http' and 'https' share 4 nodes and only diverge at the 5th character.
Prefix overlap is the difference between a 50 MB and a 500 MB automaton for the same logical pattern count.
Always profile node count with real pattern data before production deployment — theoretical estimates based on pattern count alone are unreliable.
Choosing the Right Trie Representation
IfPattern set under 1,000 patterns, ASCII only, Python prototype or internal tool
UseDict-based goto table is fine. Simple to implement, trivial to debug, fast enough for most internal workloads.
IfPattern set over 10,000 patterns, or Unicode alphabet, or throughput requirement above 100 MB/s
UseUse pyahocorasick (C extension with double-array trie). Pure-Python dict nodes cost 200-400 bytes each; C arrays cost 8. At 500,000 nodes the difference is several gigabytes.
IfPatterns have high prefix overlap — URL paths, file system paths, HTTP headers
UseTrie compression is excellent. Expect 3-5x node reduction versus raw character count. The build cost is low and the memory footprint compact.
IfPatterns are random or near-random with minimal shared prefixes — hashes, UUIDs, hex signatures
UseTrie compression is minimal. Every pattern adds nearly as many nodes as it has characters. Consider hash-set membership tests or Rabin-Karp rolling hashes instead.

Failure links are what separate Aho-Corasick from a trie with a linear scan per pattern. For each trie node v — reached by the path of characters representing some string S — the failure link points to the node representing the longest proper suffix of S that is also a valid prefix of some pattern in the set. If no such suffix exists, the failure link points to the root.

This is KMP's failure function, applied to every node in the trie simultaneously. When the search is at node v and the next character has no goto edge, following fail[v] gives you the best valid position to continue from — no characters rescanned, no match opportunity lost.

Failure links are built with BFS from the root, level by level. The ordering is non-negotiable: 1. Root's direct children all get fail = root (the root is its own failure destination) 2. For each node v at depth d, and each character c where v has a child: that child's failure link is goto(fail[v], c). This requires fail[v] to already be computed — which is why BFS level-ordering works and DFS does not. 3. The goto shortcut step: for characters not present in v's children, set goto[v][c] = goto[fail[v]][c]. This precomputes the 'where would failure take me?' step for every character, so search never needs to traverse a chain of failure links — one goto lookup is always sufficient.

The output merge happens here too, in the same BFS pass: self.output[child] += self.output[self.fail[child]]. This is covered in depth in the Output Links section, but it belongs in the build step — not in the search loop.

io/thecodeforge/algo/aho_corasick_fail.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# io.thecodeforge.algo.aho_corasick_fail
# This method is added to the AhoCorasick class defined in aho_corasick_trie.py

    def build(self) -> None:
        """Build failure links and goto shortcuts using BFS.

        Must be called once after all patterns are added via add_pattern().
        After this call, the automaton is ready for search().
        Adding new patterns after build() requires a full rebuild.
        """
        q: deque[int] = deque()

        # Root's direct children: failure link = root (node 0)
        # Also compute goto shortcuts for root: any character not in root's
        # children loops back to root (same as KMP resetting to start)
        all_chars = set(ch for node in self.goto for ch in node)
        for ch in all_chars:
            if ch not in self.goto[0]:
                self.goto[0][ch] = 0  # Shortcut: unknown char at root -> stay at root

        for ch, child in list(self.goto[0].items()):
            if child != 0:  # Skip the shortcuts we just added
                self.fail[child] = 0
                q.append(child)

        while q:
            node = q.popleft()
            for ch, child in list(self.goto[node].items()):
                # Compute failure link for child:
                # Follow fail[node] until we find a node with a goto for ch
                # (or reach root). With root shortcuts, this is always O(1).
                f = self.fail[node]
                self.fail[child] = self.goto[f].get(ch, 0)

                # Guard against self-referential failure links
                # (can occur at root level without the shortcut above)
                if self.fail[child] == child:
                    self.fail[child] = 0

                # CRITICAL: merge output from failure link target
                # Without this line, patterns that are suffixes of other patterns
                # are silently dropped from results
                self.output[child] = self.output[child] + self.output[self.fail[child]]

                # Compute goto shortcuts for child:
                # For every character in the alphabet that child has no edge for,
                # precompute where failure would take us. This makes search O(1) per char.
                for c in all_chars:
                    if c not in self.goto[child]:
                        self.goto[child][c] = self.goto[self.fail[child]].get(c, 0)

                q.append(child)
The Output Link Merge Is Not Optional
The line self.output[child] = self.output[child] + self.output[self.fail[child]] is the single most commonly omitted step in Aho-Corasick implementations, including production ones that have been running for months before the bug surfaces. Without it, patterns that are suffixes of other patterns are silently dropped from results. The algorithm still runs. It still returns matches for patterns with no suffix relationship. It just silently omits anything inherited through failure links. In a content filter, those omissions are policy violations that leak through. In an IDS, they are attacks that go undetected. There is no error, no warning, no incorrect match — just missing data.
Production Insight
BFS construction order is a hard requirement, not a style preference. Failure link computation for a child node requires the parent's failure link to already be resolved. BFS processes nodes level by level, guaranteeing this invariant. DFS processes children before their sibling subtrees are complete, which can produce failure links pointing at nodes whose own failure links haven't been computed yet. The symptom is subtle — some inputs produce correct results, others produce incorrect results or hang. Always use deque-based BFS. Never use recursion for failure link construction.
Key Takeaway
Failure links are KMP's failure function applied to every node in the trie — same mechanics, different data structure.
BFS is the required construction order: parent failure links must be resolved before children, without exception.
Goto shortcuts during build transform search from O(n × depth) to O(n) — they are what make the algorithm usable on adversarial input.
Failure Link Debugging Decision Tree
IfSearch enters an infinite loop or hangs on specific input
UseFailure links form a cycle. The self.fail[child] == child guard is missing, or root shortcuts weren't properly initialized. Add the guard and verify root's goto table has entries for all observed characters.
IfSome patterns never match even though they appear in the trie
UseOutput links aren't merged. Add self.output[child] += self.output[self.fail[child]] in the BFS loop, immediately after computing fail[child]. Verify with the 'she'/'he' smoke test.
IfSearch is correct but slower than expected on repetitive or adversarial input
UseGoto shortcuts weren't computed during build. The search loop is falling back through failure chains instead of taking O(1) transitions. Add the shortcut precomputation step in BFS.
IfBuild produces correct results locally but incorrect results in production
UseCheck whether patterns are being added after build() is called. The automaton is immutable after build. New patterns require a full rebuild.

Searching the Text

With failure links and goto shortcuts built, searching is a single linear pass through the text. For each character you follow a goto edge — which is O(1) because shortcuts were precomputed during build — land on the next node, and check its output list. The output list at each node already contains matches inherited from failure link targets, because the build step merged them. There is no chain to follow during search, no backtracking, no lookahead.

The search loop is simple enough to fit in a few lines. That simplicity is the result of doing all the hard work during build — the automaton encodes every mismatch-recovery decision as a precomputed edge, so the search just follows edges and collects outputs.

io/thecodeforge/algo/aho_corasick_search.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# io.thecodeforge.algo.aho_corasick_search
# This method is added to the AhoCorasick class

    def search(self, text: str) -> list[tuple[int, int]]:
        """Search text for all patterns. Returns (end_pos, pattern_id) pairs.

        end_pos is the index of the last character of the match.
        start_pos = end_pos - len(patterns[pattern_id]) + 1

        Requires build() to have been called first.
        Time complexity: O(n + k) where n = len(text), k = match count.
        """
        node = 0
        results: list[tuple[int, int]] = []
        for i, ch in enumerate(text):
            # With goto shortcuts, this is always a single dict lookup — O(1)
            # No failure link chain traversal needed during search
            node = self.goto[node].get(ch, 0)
            # output[node] already includes matches from failure link targets
            # — the build step merged them. No additional chain traversal here.
            for pid in self.output[node]:
                results.append((i, pid))
        return results


# --- Full example ---
ac = AhoCorasick()
patterns = ['he', 'she', 'his', 'hers']
for i, p in enumerate(patterns):
    ac.add_pattern(p, i)
ac.build()

matches = ac.search('ahishers')
for end, pid in matches:
    p = patterns[pid]
    print(f"'{p}' ends at index {end} (starts at {end - len(p) + 1})")
Output
'his' ends at index 3 (starts at 1)
'he' ends at index 5 (starts at 4)
'hers' ends at index 7 (starts at 4)
'she' ends at index 6 (starts at 4)
Why One Pass Is Enough
  • Each character in the text is examined exactly once — the input pointer never moves backward
  • Failure links handle the 'could a match start here?' question at every position without rescanning
  • Goto shortcuts make each character transition O(1) — no chain traversal during search, only during build
  • Output lists at each node already contain inherited matches from failure link targets — the build step paid that cost upfront
  • The automaton state after processing character i encodes the complete match context — no buffer, no lookback needed
Production Insight
If your search implementation contains a while loop inside the character loop — traversing failure links during search — then goto shortcuts were not computed during build. This is the most common performance bug in hand-rolled implementations. It produces correct results on typical input but degrades significantly on adversarial or highly repetitive text, where failure link chains can be O(trie_depth) per character. Verify goto shortcuts by checking that every node's goto dict has entries for all characters in the observed alphabet, not just the characters present in its pattern children.
Key Takeaway
One pass through the text. Every match at every position reported. No rescanning.
The O(n + k) search complexity is only achievable when goto shortcuts are precomputed during build — otherwise it degrades on adversarial input.
The preprocessing cost is O(m) — you pay it once and search arbitrarily large text volumes against the same pattern set.

Complexity Analysis

Let n = text length, m = sum of all pattern lengths, k = total number of matches found.

Trie construction: O(m) — insert each character of each pattern once Failure link construction: O(m × |Σ|) with goto shortcut precomputation, where |Σ| is the alphabet size. For ASCII (|Σ| = 128) this is O(128m), which is O(m) with a constant factor. For full Unicode codepoints (|Σ| ≈ 140,000) this factor becomes significant — byte-level tries (|Σ| = 256) are the standard production choice for Unicode text. Search: O(n + k) — each character examined once, each match reported once Total: O(n + m + k)

This is provably optimal — you must read all input and report all matches. The critical observation is that pattern count does not appear in the search complexity. Searching with 10,000 patterns takes the same time as searching with 1, once the automaton is built. That is the entire value proposition of the algorithm.

The alphabet size factor in failure link construction is worth understanding before deployment. A byte-level trie (|Σ| = 256) has a constant factor roughly 10x larger than an ASCII lowercase trie (|Σ| = 26). For most production pattern sets this is absorbed — the build runs once and the search runs millions of times. But for very large pattern sets (millions of patterns) with byte-level alphabets, the build step can take tens of seconds. Profile it before assuming it's negligible.

Where This Is Actually Used in Production
Snort and Suricata (open-source network IDS/IPS) use Aho-Corasick variants to match tens of thousands of attack signatures against packet payloads at multi-gigabit throughput — a single-pass inspection that would be physically impossible with per-pattern KMP. ClamAV uses it for virus signature databases containing hundreds of thousands of patterns. Google's RE2 uses Aho-Corasick as one component in its literal extraction pipeline — before running an NFA, RE2 extracts required literal substrings from the regex and uses Aho-Corasick to pre-filter documents that can't possibly match. CloudFlare has documented using Aho-Corasick for HTTP request inspection at their edge. The algorithm is 50 years old and still the standard answer when the requirement is exact-string multi-pattern matching at scale.
Production Insight
Build cost is O(m) — pay it once, search indefinitely. But 'indefinitely' assumes the pattern set is static. Dynamic pattern sets require full rebuilds because adding a single pattern can change failure links throughout the trie. The production approach is to queue pattern updates and rebuild on a schedule — typically every 5 minutes or every N new patterns, whichever comes first. For zero-downtime updates, build the new automaton on a background thread and atomically swap the reference. A C-backed implementation rebuilds 100,000 patterns in well under a second.
Key Takeaway
O(n + k) search — pattern count does not appear in the complexity after the automaton is built.
Building on byte-level alphabets (|Σ| = 256) costs 2-10x more than ASCII-only builds — profile before assuming the build step is negligible at scale.
Batch pattern updates and rebuild on a schedule — per-pattern rebuilds are operationally unsustainable above a few thousand patterns.
When Aho-Corasick Is the Right Choice
IfNeed to match 100+ fixed exact patterns against a text stream, throughput matters
UseAho-Corasick. It's the only algorithm where search time is independent of pattern count. This is the canonical use case.
IfNeed to match a single pattern efficiently
UseKMP or Boyer-Moore. Aho-Corasick's trie construction overhead is not justified for one pattern.
IfNeed wildcards, character classes, quantifiers, or any regex feature
UseRegex engine — RE2 for linear-time safety, Hyperscan for hardware-accelerated throughput. Aho-Corasick matches exact strings only and has no mechanism for pattern variables.
IfText is very short (under 200 characters) and pattern set changes frequently
UseHash set membership or naive search. Automaton build cost dominates when text is short and rebuild frequency is high.
IfNeed regex speed but can extract required literals from your regex patterns
UseAho-Corasick as a pre-filter, regex for confirmation. This is what RE2 does internally — skip the regex entirely if no required literal substrings match.

Output links deserve their own section because they are the single most common source of production bugs in Aho-Corasick implementations, and because the failure mode is the worst kind: silent, correct-looking, and only detectable by knowing what should have been found.

The problem arises with overlapping patterns. Consider patterns 'she' and 'he'. In the trie, 'she' is represented by the path root → s → h → e. The node at the end of this path has fail pointing to the node representing 'he' (root → h → e), because 'he' is the longest suffix of 'she' that is also a pattern prefix. When the search reaches the 'she' terminal node, it should report both 'she' and 'he' — 'she' because that pattern ends here, and 'he' because the text at this position also ends a valid match for 'he'.

Without output merging, only 'she' is reported. The failure link is structurally correct — it points to the right node. But the output list at the 'she' terminal node only contains 'she'. The output at the 'he' node was never propagated.

The fix is one line in the BFS build loop: self.output[child] += self.output[self.fail[child]]. This propagates the output list from the failure link target into the current node at build time, so the search loop never needs to traverse failure links to collect outputs — it just reads the output list directly.

This is not a fringe case. Any pattern set that contains a pattern which is a suffix of another pattern exhibits this relationship. In URL pattern sets, '/api' is a suffix of '/internal/api'. In signature databases, short attack signatures are frequently substrings of longer ones. In content moderation, short slur variants are often substrings of longer ones. The overlap is the norm, not the exception.

io/thecodeforge/algo/output_links_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# io.thecodeforge.algo.output_links_demo
# Side-by-side demonstration: broken vs correct output merging
# Both versions build failure links correctly — only output handling differs

from collections import deque


def build_broken(ac) -> None:
    """BROKEN: failure links correct, output merging missing.
    The automaton navigates correctly but silently drops suffix-pattern matches.
    """
    q: deque[int] = deque()
    for ch, child in ac.goto[0].items():
        ac.fail[child] = 0
        q.append(child)
    while q:
        node = q.popleft()
        for ch, child in ac.goto[node].items():
            f = ac.fail[node]
            while f != 0 and ch not in ac.goto[f]:
                f = ac.fail[f]
            ac.fail[child] = ac.goto[f].get(ch, 0)
            if ac.fail[child] == child:
                ac.fail[child] = 0
            # BUG: the following line is missing
            # ac.output[child] += ac.output[ac.fail[child]]
            q.append(child)


def build_correct(ac) -> None:
    """CORRECT: failure links and output merging both present."""
    q: deque[int] = deque()
    for ch, child in ac.goto[0].items():
        ac.fail[child] = 0
        q.append(child)
    while q:
        node = q.popleft()
        for ch, child in ac.goto[node].items():
            f = ac.fail[node]
            while f != 0 and ch not in ac.goto[f]:
                f = ac.fail[f]
            ac.fail[child] = ac.goto[f].get(ch, 0)
            if ac.fail[child] == child:
                ac.fail[child] = 0
            # CORRECT: propagate outputs from failure link target
            ac.output[child] = ac.output[child] + ac.output[ac.fail[child]]
            q.append(child)


# Test: patterns ['she', 'he'], text 'she'
# Expected matches: both 'she' (ends at index 2) and 'he' (ends at index 2)
# Broken result:  [(2, 'she')]          — 'he' silently dropped
# Correct result: [(2, 'she'), (2, 'he')] — both reported
Output
# Broken: [(2, 'she')]
# Correct: [(2, 'she'), (2, 'he')]
Silent Data Loss Is Worse Than a Crash
A crash tells you immediately that something is wrong. Silent data loss hides the problem until the consequences surface — and in security systems, the consequence is undetected attacks. Missing output links produce silently incomplete results: the algorithm runs at full speed, returns results, passes most unit tests, and looks completely normal in monitoring dashboards. It just doesn't return all the matches. The only way to catch this is to test explicitly with overlapping patterns and assert that the total match count equals the expected count. Always. Before every deployment.
Production Insight
Output link bugs are invisible in unit tests that use non-overlapping patterns. A test suite with patterns ['apple', 'banana', 'cherry'] in text 'I eat apple and banana' will pass with or without output merging, because none of these patterns is a suffix of another. The bug only manifests when the pattern set contains suffix relationships — which is common in real pattern sets and rare in hand-crafted tests. The mandatory test case: patterns = ['she', 'he', 'abc', 'bc']; text = 'she xbc'; expected_match_count = 4. If you get fewer than 4, output merging is broken.
Key Takeaway
Output links are a correctness requirement, not an optimisation — without them the algorithm produces silently wrong results.
The bug is invisible in tests with non-overlapping patterns — you must test specifically with suffix relationships.
One line in the BFS build loop fixes it: self.output[child] = self.output[child] + self.output[self.fail[child]].
● Production incidentPOST-MORTEMseverity: high

The IDS That Missed 40% of Attack Signatures

Symptom
IDS alert volume dropped 40% immediately after deploying an updated attack signature database. No build errors, no runtime exceptions, no unusual CPU or memory metrics. Traffic analysis confirmed attacks were present and traversing the inspection path — the system just wasn't producing alerts for them. The alerting pipeline, pattern file, and network tap were all ruled out within the first hour.
Assumption
The team assumed the new pattern database contained errors — malformed signatures, truncated pattern strings, or encoding issues introduced during the database export. Two days were spent auditing pattern definitions against the vendor feed, diffing old and new pattern files, and verifying that the trie was being populated correctly. All patterns were confirmed present in the trie. The automaton appeared structurally correct.
Root cause
The build function was missing a single line: self.output[child] += self.output[self.fail[child]]. During a refactor two weeks prior, output link merging had been accidentally removed from the BFS construction loop. Failure links were computed correctly — the automaton navigated state transitions without errors. But the output lists at each node were never populated with matches inherited from failure link targets. Patterns that existed as suffixes of other patterns — 'he' inside 'she', 'http' inside 'https' — were registered in the trie but their pattern IDs were never merged into the output lists of the nodes that would actually be visited when the longer pattern matched. The automaton ran at full throughput. It returned results. It just didn't return all of them, and there was no signal anywhere in the system indicating anything was wrong.
Fix
The immediate fix was restoring the output merging line in the BFS loop and redeploying. The permanent fix was adding a regression test that constructs a test string guaranteed to contain every pattern in the set, runs the automaton over it, and asserts that the count of distinct matched pattern IDs equals the total number of patterns. This test would have caught the regression in CI before it reached production. Additionally, the team added a startup assertion that validates output list sizes against expected minimums derived from known suffix relationships in the pattern set.
Key lesson
  • Never deploy an Aho-Corasick automaton without output link merging in the build step — it is a correctness requirement, not an optimisation
  • The failure is entirely silent: no errors, no exceptions, no degraded metrics except the one you forgot to monitor — match coverage rate
  • Test with overlapping patterns specifically ('she'/'he', 'http'/'https', 'abc'/'bc') before every deployment; non-overlapping test cases will not catch this bug
  • Add a regression test that asserts every pattern in the set is matched at least once in a purpose-built test string containing all patterns
  • Monitor match coverage rate as a metric alongside alert volume — a drop in coverage with stable volume is the earliest signal of an output link failure
Production debug guideSymptom-driven diagnostics for automaton-based pattern matching systems5 entries
Symptom · 01
Automaton returns fewer matches than expected for known overlapping patterns
Fix
The output link merge is almost certainly missing or incomplete. Verify that self.output[child] includes self.output[self.fail[child]] and that this assignment happens after fail[child] is computed, not before. Write a minimal reproducer: patterns ['she', 'he'], text 'she', expected matches 2. If you get 1, the merge is broken. If you're using pyahocorasick, verify you called make_automaton() before iterating — without it the failure links and output merges are not applied.
Symptom · 02
Search latency spikes on specific text inputs, especially highly repetitive text
Fix
Profile whether goto shortcuts were computed during the build phase. Without shortcuts, each character transition on a mismatch traverses the full failure link chain, which can be O(trie_depth) per character on adversarial input. Verify that the BFS build step is populating goto shortcuts for characters not present at the current node: goto[node][ch] = goto[fail[node]][ch]. If shortcuts are present, check whether the input triggering the spike has unusually high failure link traversal depth by instrumenting the search loop with a traversal counter.
Symptom · 03
Memory usage exceeds expected trie size by 5x or more
Fix
Profile the effective alphabet size first. UTF-8 multi-byte characters create branches at the byte level that multiply node count dramatically — a single Chinese character requires 3 bytes of branching in a byte-level trie. Check whether patterns are being deduplicated before insertion: duplicate patterns add nothing but they do expand output lists at terminal nodes. For large sets, count unique prefixes with sort -u patterns.txt | python -c 'import sys; print(sum(len(l.rstrip()) for l in sys.stdin))' to get the theoretical minimum node count and compare against actual.
Symptom · 04
Automaton build time is slow despite a pattern set that should be manageable
Fix
Verify BFS is using collections.deque, not list with pop(0). list.pop(0) is O(n) per call — on a trie with 500,000 nodes this makes the build O(n²). Check for duplicate patterns causing redundant trie traversal. Profile the alphabet size: large Unicode pattern sets slow failure link computation significantly because the shortcut step iterates over all observed characters at each node. If patterns share a large common prefix (URL paths, file system paths), verify that the shared prefix nodes are created once, not duplicated per pattern.
Symptom · 05
Match positions are correct in unit tests but wrong in production with UTF-8 text
Fix
Verify byte-offset versus character-offset alignment. If you decode bytes to Python str before searching, match positions are character indices into the Unicode string. Network protocols, file offsets, and most log formats operate in bytes. A single emoji or CJK character is 3-4 bytes but one character index. Decide at the system boundary: either work with raw bytes throughout (build the trie on bytes, search bytes, report byte offsets), or decode to str and accept that you need a separate byte-to-char mapping table for position reporting.
★ Aho-Corasick Quick Debug Cheat SheetFast diagnostics for production pattern matching failures. Commands assume Python with pyahocorasick or a custom implementation.
Missing matches on overlapping patterns
Immediate action
Verify output link inheritance is applied in the BFS build step
Commands
python -c "import ahocorasick; A = ahocorasick.Automaton(); [A.add_word(p, p) for p in ['she','he']]; A.make_automaton(); print(list(A.iter('she')))"
grep -n 'output\[' aho_corasick_build.py
Fix now
Add self.output[child] += self.output[self.fail[child]] immediately after computing fail[child] in the BFS loop. Confirm the line appears after the fail assignment, not before. Re-run the two-pattern smoke test before deploying.
Automaton consuming excessive memory+
Immediate action
Profile node count, alphabet size, and whether patterns are deduplicated
Commands
python -c "import sys; print(f'Nodes: {len(ac.goto)}, Approx bytes: {sum(sys.getsizeof(d) for d in ac.goto)}')"
sort -u patterns.txt | wc -l && wc -l patterns.txt
Fix now
Deduplicate patterns before trie insertion. For ASCII-only sets, verify alphabet is bounded to 128 not 65536. For large sets (100K+ patterns), switch to pyahocorasick (C extension with double-array trie) — pure-Python dict nodes use 200-400 bytes each versus 8 bytes in a packed C array.
Search throughput below expected line rate+
Immediate action
Establish baseline throughput and confirm goto shortcuts are present
Commands
python -c "import timeit; t = timeit.timeit(lambda: ac.search(text_1mb), number=100); print(f'{len(text_1mb)*100/t/1e6:.1f} MB/s')"
py-spy top --pid $(pgrep -f uvicorn) --rate 100
Fix now
Switch from pure-Python to pyahocorasick (C extension) — expect 20-50x throughput improvement. Verify goto shortcuts are computed during build: every node should have entries for all characters in the alphabet, not just the characters present in its children. Pre-compute shortcuts to eliminate failure link traversals during search.
Build time exceeds acceptable window for dynamic pattern updates+
Immediate action
Separate trie construction time from failure link computation time to find the bottleneck
Commands
python -c "import time; t0=time.time(); [ac.add_pattern(p,i) for i,p in enumerate(patterns)]; print(f'Trie: {time.time()-t0:.3f}s'); t0=time.time(); ac.build(); print(f'Fail links: {time.time()-t0:.3f}s')"
python -c "print(f'Trie nodes: {len(ac.goto)}, Unique prefixes: {len(set(p[:i] for p in patterns for i in range(len(p)+1)))}')"
Fix now
Batch pattern additions — queue updates and rebuild on a 5-minute schedule rather than per-pattern-add. For zero-downtime: build new automaton in a background thread, then atomically swap the module-level reference. pyahocorasick rebuilds 100,000 patterns in under 200ms — if build time is your bottleneck, the pure-Python implementation is the problem.
Multi-Pattern String Matching Algorithms Compared
AlgorithmPreprocessingSearch TimePattern Count ImpactRegex SupportBest For
Aho-CorasickO(m) trie + failure links + goto shortcutsO(n + k)None — same search time for 1 or 100,000 patterns after buildNo — exact strings onlyHigh-throughput multi-pattern exact matching: IDS, antivirus, content moderation, CDN keyword filtering
KMPO(m) per pattern — must repeat for each patternO(n) per patternLinear — 10,000 patterns means 10,000 passes through the textNo — exact strings onlySingle-pattern matching where simplicity and zero-dependency matter more than multi-pattern support
Rabin-KarpO(m) per pattern with rolling hash precomputationO(n + k) average, O(nm) worst case on hash collisionLinear — pattern count multiplies passes, though batching reduces overhead somewhatNo — exact strings onlyPlagiarism detection, fuzzy duplicate finding, cases where you want to tune hash sensitivity
Hyperscan (Intel)Variable — depends on regex complexity and compilation flagsO(n) with hardware SIMD acceleration — near line rateMinimal — designed for large compiled rule setsYes — full PCRE-compatible regex with linear-time execution guarantees for a defined subsetProduction regex at line rate: commercial IDS/IPS, WAF, DPI — requires Linux x86_64 with SSE4.2+
Regex (RE2 / NFA)O(m) regex compile — DFA construction can be exponential in worst case, mitigated by lazy evaluationO(n) with RE2 (linear guaranteed), O(n × m) worst case with PCRE backtrackingDepends on pattern complexity — RE2 compiles a combined NFA, PCRE per-patternYes — full regex semantics with RE2's linear-time safety guaranteeComplex pattern matching where regex flexibility is required and you can accept the throughput trade-off versus Aho-Corasick

Key takeaways

1
Build once, search indefinitely
O(m) preprocessing where m is total pattern length, O(n + k) search where n is text length and k is match count. Pay the build cost once, then search terabytes without rebuilding.
2
Failure links are KMP's failure function applied to every node in the trie simultaneously. Same mechanics
longest valid suffix on mismatch, no rescanning — different data structure. If you understand KMP, failure links are immediate.
3
Output link merging is a correctness requirement, not an optimisation. A node for 'she' must also report 'he' if 'he' is a pattern and exists as a suffix. Missing this line produces silently incomplete results on every overlapping pattern pair.
4
Goto shortcuts computed during the BFS build step are what deliver O(n) search. Without them, adversarial or repetitive input degrades to O(n × trie_depth)
a meaningful difference on real workloads.
5
Production deployments
ClamAV virus signatures, Snort/Suricata IDS, Google RE2's literal pre-filter, CloudFlare HTTP inspection, CDN keyword filtering. Multi-pattern exact matching at scale means Aho-Corasick.
6
For Python
use pyahocorasick (C extension). Pure-Python dict-based implementations are useful for learning and for small pattern sets, but the interpreter overhead makes them impractical above a few hundred patterns at any meaningful throughput.
7
Aho-Corasick matches exact strings only. No wildcards, no character classes, no quantifiers. That constraint is the source of its performance. When you need regex semantics at scale, reach for Hyperscan or RE2
optionally with Aho-Corasick as a pre-filter to eliminate non-candidates before the regex runs.

Common mistakes to avoid

6 patterns
×

Forgetting output link merging in the BFS build step

Symptom
Patterns that are suffixes of other patterns are silently never reported. In an IDS, shorter attack signatures go undetected whenever a longer signature containing them fires. In a content filter, suffix-variant blocked terms leak through. No error is raised — just missing matches that are impossible to notice without knowing the expected output.
Fix
Add self.output[child] = self.output[child] + self.output[self.fail[child]] immediately after computing fail[child] in the BFS loop. Use list concatenation rather than += to avoid mutating the shared list reference from the failure link node. Add a mandatory regression test: patterns ['she', 'he', 'abc', 'bc'], text containing all four, assert match count == 4.
×

Using Aho-Corasick as a regex replacement for patterns containing wildcards or character classes

Symptom
Pattern '192.168.' matches nothing — the '' is treated as a literal asterisk. Pattern '[Ss]ql' matches the literal string '[Ss]ql' rather than both 'Sql' and 'sql'. Developers are confused because the algorithm appears to work on simple patterns but misses the intended matches.
Fix
Aho-Corasick matches exact character sequences with no interpretation of special characters. For wildcard or regex semantics, use Hyperscan (Intel's production regex engine with hardware acceleration) or RE2 (Google's linear-time regex engine). A common and effective pattern is Aho-Corasick as a pre-filter: extract required literal substrings from your regex, use Aho-Corasick to quickly rule out documents with no possible match, then apply the full regex only on candidates. This can cut expensive regex evaluations by 90%+ on typical workloads.
×

Rebuilding the automaton on every request or every query instead of caching it

Symptom
API latency jumps from 2ms to 200-500ms whenever pattern matching is involved. Profiling shows 95%+ of time spent in the build step. The search step itself is negligible.
Fix
The automaton is immutable after build. Build it once at application startup or when the pattern set changes, store it as a module-level singleton or injected dependency, and reuse it across all requests. If patterns can change at runtime, rebuild on a background thread and atomically swap the reference — a C-backed implementation rebuilds 50,000 patterns in under 100ms, which is typically acceptable for a background operation.
×

Mixing bytes and Unicode string characters — searching str text with a byte-level trie or vice versa

Symptom
Match positions are off by 2-4 characters for non-ASCII input. Patterns that should match fail entirely on multi-byte UTF-8 sequences. Position reporting is correct for ASCII but broken for any text containing accented characters, CJK, or emoji.
Fix
Decide at the system boundary and be consistent. For network protocols, HTTP bodies, and file content: work with raw bytes throughout. Build the trie on bytes, search bytes, report byte offsets. For user-facing text processing where character-level position reporting matters: decode to Python str before building and searching, accept that positions are Unicode codepoint indices, and document that byte offsets require a separate char-to-byte mapping.
×

Using DFS instead of BFS for failure link construction, or processing nodes in insertion order

Symptom
Search produces incorrect results on some inputs and correct results on others. Some failure links point to nodes deeper than they should, creating cycles or invalid state transitions. The bug is non-deterministic and depends on input order.
Fix
Always use deque-based BFS. Process nodes strictly level by level — root's children first, then their children, and so on. This guarantees that when you compute fail[child], fail[parent] is already correct. Never use recursion (which is DFS) for failure link construction. Never process nodes in insertion order without verifying that insertion order happens to produce a level-order traversal (it usually does not).
×

Not deduplicating patterns before building the trie

Symptom
Automaton is functionally correct but uses more memory than expected. Output lists at terminal nodes contain the same pattern ID multiple times. Match count reports are inflated — the same match is reported twice for the same position.
Fix
Deduplicate the pattern list before any calls to add_pattern. Use dict.fromkeys(patterns) or a set to filter duplicates while preserving order if order matters. Duplicate patterns produce duplicate entries in the terminal node's output list, which propagate through output merging and produce redundant matches in every search result.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Aho-Corasick differ from running KMP once per pattern?
Q02SENIOR
What is the role of failure links in Aho-Corasick?
Q03SENIOR
What are output links and why are they needed?
Q04SENIOR
Give a real-world use case where Aho-Corasick is the right choice and ex...
Q05SENIOR
What happens if you forget to merge output links during the build step?
Q06SENIOR
How would you handle dynamic pattern updates in a production Aho-Corasic...
Q01 of 06SENIOR

How does Aho-Corasick differ from running KMP once per pattern?

ANSWER
KMP processes one pattern at a time. Running it against 10,000 patterns means 10,000 independent passes through the text — O(10,000 × n) total search time. For a 1 Gbps network stream that needs to check every packet against 10,000 attack signatures, this is physically impossible at line rate. Aho-Corasick collapses all patterns into a single automaton during preprocessing, then searches the text exactly once in O(n + k) time — where k is the number of matches. Pattern count doesn't appear in the search complexity. The key insight is that Aho-Corasick generalizes KMP's failure function to a trie: instead of one failure function for one pattern, you build failure links across the entire pattern set's shared prefix tree. On a mismatch at any trie node, the failure link tells you the deepest valid state to continue from — exactly what KMP's failure function does for a single pattern, but now covering all patterns simultaneously. The trade-off is upfront build cost — O(m) for trie construction plus O(m × |Σ|) for failure links with goto shortcuts, where m is total pattern length and |Σ| is alphabet size. You pay this once and then search arbitrarily large text volumes against the same pattern set.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How is Aho-Corasick related to KMP?
02
What if patterns overlap or one is a prefix of another?
03
Can Aho-Corasick handle regex patterns?
04
What's the memory footprint of an Aho-Corasick automaton?
05
How do you update the automaton when patterns change?
🔥

That's Strings. Mark it forged?

6 min read · try the examples if you haven't

Previous
Manacher's Algorithm — Longest Palindromic Substring
6 / 8 · Strings
Next
Suffix Array — Construction and Applications