Homeβ€Ί DSAβ€Ί Aho-Corasick Algorithm β€” Multi-Pattern String Matching

Aho-Corasick Algorithm β€” Multi-Pattern String Matching

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Strings β†’ Topic 6 of 8
Learn the Aho-Corasick algorithm for simultaneous multi-pattern string matching.
πŸ”₯ Advanced β€” solid DSA foundation required
In this tutorial, you'll learn:
  • Build once, search many: O(m) preprocessing (m = total pattern length), O(n + k) search. Pay the build cost once, search terabytes.
  • Failure links are KMP's failure function on a trie. Same concept, different data structure. If you understand KMP failure function, Aho-Corasick failure links are immediate.
  • Output links are not optional: a node for 'she' must also report 'he' if 'he' is a pattern. Missing output links gives silently incorrect results on overlapping patterns.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
KMP finds one pattern efficiently. But what if you need to find 1000 patterns simultaneously? Running KMP 1000 times is O(1000n). Aho-Corasick builds a single 'search automaton' from all patterns β€” then searches the text once, finding all patterns at the same time. It's like having a single highly parallel scanner instead of 1000 sequential ones.

In production network security, Snort IDS processes multi-gigabit traffic matching against tens of thousands of attack signatures simultaneously. Running KMP once per signature per packet is physically impossible at line rate. Aho-Corasick is what makes it work: build one automaton from all patterns, run traffic through once, catch every match of every signature in a single linear pass.

Published in 1975 by Alfred Aho and Margaret Corasick at Bell Labs, the algorithm generalises KMP's failure function to a multi-pattern trie. The failure links in Aho-Corasick are KMP failure function values applied to a trie structure. Once you understand one, the other is immediate.

Building the Trie

The first step is inserting all patterns into a trie (prefix tree). Each node represents a prefix shared by one or more patterns. Nodes that complete a pattern are marked as output nodes.

aho_corasick_trie.py Β· PYTHON
123456789101112131415161718192021222324
from collections import deque

class AhoCorasick:
    def __init__(self):
        # Each node: {'char': next_node_id}, fail link, output list
        self.goto = [{}]   # goto[node][char] = next_node
        self.fail = [0]    # failure link
        self.output = [[]] # patterns ending at this node

    def add_pattern(self, pattern: str, pid: int):
        node = 0
        for ch in pattern:
            if ch not in self.goto[node]:
                self.goto.append({})
                self.fail.append(0)
                self.output.append([])
                self.goto[node][ch] = len(self.goto) - 1
            node = self.goto[node][ch]
        self.output[node].append(pid)

ac = AhoCorasick()
for i, p in enumerate(['he', 'she', 'his', 'hers']):
    ac.add_pattern(p, i)
print(f'Trie nodes: {len(ac.goto)}')  # 11
β–Ά Output
Trie nodes: 11

Failure links are Aho-Corasick's equivalent of KMP's failure function. For each node v (reached by character c from parent p), the failure link points to the longest proper suffix of the string represented by v that is also a prefix of some pattern.

Failure links are built with BFS from the root: 1. Root's children have fail = root 2. For each node v at depth d, and each character c: if v has a child for c, that child's fail = goto(fail[v], c). Otherwise, add a goto shortcut: goto[v][c] = goto[fail[v]][c].

aho_corasick_fail.py Β· PYTHON
123456789101112131415161718192021
    def build(self):
        """Build failure links using BFS."""
        q = deque()
        # Root's children: fail = root
        for ch, child in self.goto[0].items():
            self.fail[child] = 0
            q.append(child)

        while q:
            node = q.popleft()
            for ch, child in self.goto[node].items():
                # Compute fail for child
                f = self.fail[node]
                while f != 0 and ch not in self.goto[f]:
                    f = self.fail[f]
                self.fail[child] = self.goto[f].get(ch, 0)
                if self.fail[child] == child:
                    self.fail[child] = 0
                # Merge output: child inherits from its fail link
                self.output[child] += self.output[self.fail[child]]
                q.append(child)

Searching the Text

With failure links built, searching is a single linear pass through the text β€” follow goto transitions, fall back via failure links on mismatch, and collect outputs.

aho_corasick_search.py Β· PYTHON
1234567891011121314151617181920212223
    def search(self, text: str) -> list[tuple[int,int]]:
        """Returns (end_position, pattern_id) for each match."""
        node = 0
        results = []
        for i, ch in enumerate(text):
            while node != 0 and ch not in self.goto[node]:
                node = self.fail[node]
            node = self.goto[node].get(ch, 0)
            for pid in self.output[node]:
                results.append((i, pid))
        return results

# Full example
ac = AhoCorasick()
patterns = ['he', 'she', 'his', 'hers']
for i, p in enumerate(patterns):
    ac.add_pattern(p, i)
ac.build()

matches = ac.search('ahishers')
for end, pid in matches:
    p = patterns[pid]
    print(f"'{p}' ends at index {end} (starts at {end-len(p)+1})")
β–Ά Output
'his' ends at index 3 (starts at 1)
'he' ends at index 5 (starts at 4)
'hers' ends at index 7 (starts at 4)
'she' ends at index 6 (starts at 4)

Complexity Analysis

Let n = text length, m = sum of all pattern lengths, k = number of matches.

Build trie: O(m) Build failure links: O(m Γ— |alphabet|) using the shortcut approach, or O(m) with optimised BFS Search: O(n + k) Total: O(n + m + k)

This is optimal β€” you must read all input and report all matches. The number of patterns does not appear in the complexity beyond the preprocessing step. Searching 10,000 patterns takes the same time as searching 1, once built.

πŸ”₯
Real World UsageSnort (open source network IDS) uses Aho-Corasick to match thousands of attack signatures against network packets simultaneously. ClamAV antivirus uses it for virus signature matching. The algorithm is specifically designed for this 'many patterns, one text' use case.

🎯 Key Takeaways

  • Build once, search many: O(m) preprocessing (m = total pattern length), O(n + k) search. Pay the build cost once, search terabytes.
  • Failure links are KMP's failure function on a trie. Same concept, different data structure. If you understand KMP failure function, Aho-Corasick failure links are immediate.
  • Output links are not optional: a node for 'she' must also report 'he' if 'he' is a pattern. Missing output links gives silently incorrect results on overlapping patterns.
  • Production use: ClamAV virus signatures, Snort/Suricata IDS, Google RE2, CDN keyword filtering. Multi-pattern at scale means Aho-Corasick.
  • For Python: use pyahocorasick (C extension). Pure-Python is too slow β€” the interpreter overhead kills performance at search scale.

Interview Questions on This Topic

  • QHow does Aho-Corasick differ from running KMP once per pattern?
  • QWhat is the role of failure links in Aho-Corasick?
  • QWhat are output links and why are they needed?
  • QGive a real-world use case where Aho-Corasick is the right choice.

Frequently Asked Questions

How is Aho-Corasick related to KMP?

Aho-Corasick generalises KMP to multiple patterns. KMP's failure function is essentially Aho-Corasick applied to a single pattern. The failure links in Aho-Corasick serve exactly the same role β€” finding the next valid state on mismatch.

What if patterns overlap or one is a prefix of another?

Aho-Corasick handles this through output links β€” when a node is reached, it reports matches from the node itself and from all nodes reachable via failure/output links. Patterns that are prefixes of others will fire at intermediate nodes.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousManacher's Algorithm β€” Longest Palindromic SubstringNext β†’Suffix Array β€” Construction and Applications
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged