Z-Algorithm — Direct Match Lengths vs KMP
Replace KMP's failure function with Z-algorithm's Z-array for exact match lengths — see O(n+m) pattern matching in Python with clear examples..
20+ years shipping performance-critical code where algorithms decide the bill. Notes here come from systems that actually shipped.
- Z-Algorithm computes a Z-array where Z[i] = longest prefix match starting at position i, built in O(n) time.
- Pattern matching concatenate pattern + '$' + text, compute Z-array; any Z[i] == len(pattern) is a match.
- Key difference from KMP Z-array gives direct match lengths per position; KMP's failure function tracks fallback positions for shifts.
- Z-box trick maintain interval [l, r] of rightmost prefix match; reuse Z[i-l] to avoid re-scanning, giving linear runtime.
- Use when you need the full match-length profile, multiple pattern queries, or simpler implementation under time pressure.
Imagine a string like 'aabxaabaabx'. The Z-algorithm asks: at each position i, how long is the longest substring starting at i that matches the very beginning of the string? Position 4 gives 'aabx' which matches the start — so Z[4]=4. This single idea, computed efficiently for every position, is enough to find any pattern in any text.
The Z-algorithm is competitive programming's go-to string tool. It solves the same problem as KMP in O(n+m), but the Z-array is directly interpretable — you see match lengths explicitly rather than reasoning about failure functions. Engineers who learned through competitive programming reach for Z-algorithm first because it is memorisable under interview pressure.
The Z-box insight that makes it linear is the exact same structural trick as KMP's failure function: maintain a rightmost matched boundary, initialise from the mirror position, only expand what has not yet been proven. Once you see this pattern, you recognise it in KMP, Manacher, and suffix array construction.
Z-Algorithm — Direct Match Lengths vs KMP
The Z-algorithm precomputes, for each position in a string, the length of the longest substring starting at that position which is also a prefix of the string. This array, called the Z-array, is built in O(n) time using a single linear scan with a sliding window that reuses previously computed matches. Unlike KMP which computes failure functions for pattern shifts, the Z-algorithm directly exposes the longest common prefix between any suffix and the whole string — a more general primitive.
To build the Z-array, maintain an interval [l, r] that represents the rightmost prefix match found so far. For each i > 0, if i ≤ r, you can initialize Z[i] from Z[i - l] (capped by r - i + 1), then extend it by character comparisons. This reuse is what gives O(n) total comparisons. The key property: Z[i] tells you exactly how many characters match the prefix starting at i, which makes substring search trivial by concatenating pattern + '$' + text and scanning for Z-values equal to pattern length.
Use the Z-algorithm when you need exact pattern matching in O(n + m) time with minimal overhead — it's simpler to implement correctly than KMP and doesn't require computing a failure table. It's especially useful in bioinformatics for exact substring search in long genomic sequences, and in text editors for find operations. The tradeoff: it uses O(n) extra space for the Z-array, while KMP uses O(m) for the failure function. For most real-world string matching, Z-algorithm's direct semantics make it easier to debug and extend.
How the Z-Algorithm Works — Plain English and Steps
The Z-algorithm computes a Z-array for a string S where Z[i] is the length of the longest substring starting at S[i] that is also a prefix of S. For example, in S='aabxaa', Z[4]=2 because S[4..5]='aa' matches the prefix 'aa'.
For pattern matching, concatenate pattern + '$' + text (the '$' is a separator not in either string). Compute the Z-array. Any position i in the text portion where Z[i] == len(pattern) is a match.
Algorithm to build the Z-array in O(n): 1. Z[0] is undefined (or set to n by convention). Maintain a window [l, r] — the rightmost Z-box found so far. 2. For each i from 1 to n-1: a. If i > r: compute Z[i] naively by matching S[i..] against S[0..]. Update l=i, r=i+Z[i]-1. b. Else (i is inside the current Z-box [l,r]): - Let k = i - l (position of i within the box, mirrored in prefix). - If Z[k] < r - i + 1: Z[i] = Z[k] (the match is entirely inside the box). - Else: extend naively from r+1 and update l and r.
The Z-box [l,r] allows us to reuse previous match results — this is what gives O(n) time instead of O(n^2).
Worked Example — Z-Array for 'aabcaab'
String S = 'aabcaab' (indices 0-6)
Z[0] = 7 by convention (entire string matches itself).
i=1: S[1..]='abcaab'. Match against prefix 'aabcaab'. S[1]='a'==S[0]='a', S[2]='b'==S[1]='a'? No. Z[1]=1. l=1,r=1. i=2: S[2..]='bcaab'. S[2]='b'!=S[0]='a'. Z[2]=0. i=3: S[3..]='caab'. S[3]='c'!=S[0]='a'. Z[3]=0. i=4: i>r(=1). S[4..]='aab'. S[4]='a'==S[0], S[5]='a'==S[1], S[6]='b'==S[2]? Yes! Z[4]=3. l=4,r=6. i=5: i=5 is inside [4,6]. k=5-4=1. Z[k]=Z[1]=1. Check: r-i+1=6-5+1=2. Z[1]=1 < 2. Z[5]=Z[1]=1. i=6: i=6 is inside [4,6]. k=6-4=2. Z[k]=Z[2]=0. Z[2]=0 < r-i+1=1. Z[6]=0.
Z-array: [7, 1, 0, 0, 3, 1, 0]
Pattern matching example: find 'aa' in text 'aabcaab'. Concatenated: 'aa$aabcaab' (length 10) Z-array: [10,1,0,2,1,0,0,3,1,0] Pattern length = 2. Look for Z[i] == 2 in the text portion (indices >= 3). Z[3]=2 -> match at text index 3-3=0. Z[7]=3 >= 2 -> match at text index 7-3=4. Matches at indices 0 and 4 in the original text.
Z-Algorithm Implementation
The implementation maintains the rightmost Z-box [l, r] to avoid redundant comparisons.
Why the Z-Array Calculation Doesn't Need a PhD
The naive way to build a Z-array is O(n²). Compare each position against the prefix character by character. Works fine for a 50-character string. Useless for production data where strings hit millions of characters.
The Z-algorithm avoids this by exploiting previously computed matches. It maintains a window [left, right] — the rightmost substring that matches the prefix. When you're inside that window, you can copy Z-values from previous positions instead of re-comparing. This is the 'why' behind the linear runtime.
Here's the trick: if the current index i is within the window, Z[i] starts at min(Z[i - left], right - i + 1). You then extend it character by character. Only when you hit a mismatch or exceed the window do you update the window. This means each character gets compared at most twice — once when entering the window, once when exiting. That's how you get O(n+m) total.
The implementation below shows this using a window. No recursion. No magic. Just pointer arithmetic and a while loop.
Real-World Applications Where Z Algorithm Saves Your Neck
Pattern matching is the obvious use case, but Z-algorithm shines in places where naive approaches would melt your server.
Genome Sequence Matching: DNA strings are gigabytes long. Searching for a gene sequence (the pattern) across a genome (the text) with Z-algorithm gives you linear time. The separator character trick works because biological data has a limited alphabet — pick something that never appears naturally.
Spam Detection: Filters maintain blacklists of known spam phrases. Instead of scanning every email with a regex engine, concatenate the blacklist patterns with separators and run Z-algorithm once per email. I've seen this reduce scan time from seconds to milliseconds in production.
Plagiarism Checkers: Compare a student's submission against a corpus. Build a combined string with pattern == document, text == corpus. Z-algorithm finds exact substring matches in O(total characters). Much faster than edit distance for the initial filter.
Intrusion Detection Systems: Network packets get checked against known attack signatures. Z-algorithm handles large signature databases efficiently because it's single-pass and doesn't backtrack.
Where it fails: If your pattern or text changes frequently, rebuilding the Z-array costs O(n) each time. For dynamic data, consider suffix arrays or Aho-Corasick for multiple patterns.