Homeβ€Ί DSAβ€Ί KMP Algorithm β€” Knuth-Morris-Pratt String Matching Explained

KMP Algorithm β€” Knuth-Morris-Pratt String Matching Explained

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Strings β†’ Topic 2 of 8
Master the KMP string matching algorithm.
βš™οΈ Intermediate β€” basic DSA knowledge assumed
In this tutorial, you'll learn:
  • The text pointer i never moves left β€” this single invariant is why KMP is O(n) in the search phase, not O(nm). If your 'KMP' has i decrementing, it's not KMP.
  • LPS[j-1] is the fallback position on mismatch β€” not 0. Resetting to 0 on every mismatch is the naive algorithm. The difference between these two lines is the entire algorithm.
  • Build LPS in O(m) with the same two-pointer technique: length pointer and i pointer, with lps[length-1] as the fallback when they don't match.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Imagine you're reading a novel looking for the phrase 'ABCABD'. You start matching: A-B-C-A-B... then the next letter is wrong. A naive approach restarts from scratch. KMP says: wait β€” you already matched 'AB' at the start of the pattern, and the last two letters you matched were also 'AB'. So skip back to position 2, not position 0. That intelligence β€” knowing how much of the pattern is a prefix that's also a suffix β€” is the entire insight behind KMP.

KMP is one of those algorithms that every senior engineer has encountered in a 1am incident. You're processing a 500MB log file looking for a pattern β€” maybe a specific error signature or a token format β€” and your naive substr search is burning through CPU. KMP is the fix, and understanding why it works is what separates engineers who reach for the right tool versus engineers who reach for regex and hope for the best.

Published in 1977 by Knuth, Morris, and Pratt, KMP solves string pattern matching in O(n+m) time β€” guaranteed, not amortised, not average-case. It works by never re-examining a character in the text. Once the text pointer moves right, it never goes left. This single invariant is why the algorithm runs in linear time regardless of what adversarial pattern you throw at it. The mechanism that enforces this invariant is the LPS array β€” and once you truly understand LPS, you'll see it hidden inside Z-algorithm, Aho-Corasick, and even suffix arrays.

The Core Insight β€” The LPS Array

The LPS (Longest Proper Prefix which is also Suffix) array is where most people get lost the first time. Let me give you the mental model that actually sticks.

Every prefix of the pattern has some structure: the longest prefix of it that is also a suffix. That overlap is what LPS[i] stores.

For pattern AABAAB: - lps[0] = 0 β€” single char, no proper prefix - lps[1] = 1 β€” "AA" β†’ prefix "A" = suffix "A" - lps[2] = 0 β€” "AAB" β†’ nothing matches - lps[3] = 1 β€” "AABA" β†’ prefix "A" = suffix "A" - lps[4] = 2 β€” "AABAA" β†’ prefix "AA" = suffix "AA" - lps[5] = 3 β€” "AABAAB" β†’ prefix "AAB" = suffix "AAB"

So lps = [0, 1, 0, 1, 2, 3]

Here is why this matters in practice: you're at text position i, pattern position j. Characters match β€” advance both. Mismatch at j? Instead of resetting j to 0 (naive approach β€” throws away all previous work), set j = lps[j-1]. You're saying: "I know the last lps[j-1] characters of what I matched still form a valid prefix β€” start comparing from there."

This is the O(1) reuse of previously computed information. Same idea as caching, except it's baked into the algorithm structure rather than a data store.

kmp_lps.py Β· PYTHON
1234567891011121314151617181920212223
def build_lps(pattern: str) -> list[int]:
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of previous longest prefix suffix
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Fall back β€” don't increment i
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

print(build_lps('AABAAB'))   # [0, 1, 0, 1, 2, 3]
print(build_lps('ABCABD'))   # [0, 0, 0, 1, 2, 0]
print(build_lps('AABAABAAA'))# [0, 1, 0, 1, 2, 3, 4, 5, 2]
β–Ά Output
[0, 1, 0, 1, 2, 3]
[0, 0, 0, 1, 2, 0]
[0, 1, 0, 1, 2, 3, 4, 5, 2]

The KMP Search Algorithm

With the LPS array built, the search phase is straightforward. Two pointers i (text) and j (pattern) advance together. On a match, both advance. On a mismatch, j falls back to lps[j-1] β€” but i never goes backwards. This guarantees O(n) search time.

kmp_search.py Β· PYTHON
1234567891011121314151617181920212223242526
def kmp_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0:
        return []

    lps = build_lps(pattern)
    matches = []
    i = j = 0  # i = text index, j = pattern index

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            matches.append(i - j)  # match found at index i-j
            j = lps[j - 1]         # look for next match
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]     # fall back using LPS
            else:
                i += 1             # no fallback possible
    return matches

text    = 'AABAACAADAABAABA'
pattern = 'AABA'
print(kmp_search(text, pattern))  # [0, 9, 12]
β–Ά Output
[0, 9, 12]

Tracing Through a Mismatch

Let's trace KMP on text='AABAABAAB' and pattern='AABAAB' (lps=[0,1,0,1,2,3]):

  1. i=0,j=0: A==A β†’ match, i=1,j=1
  2. i=1,j=1: A==A β†’ match, i=2,j=2
  3. i=2,j=2: B==B β†’ match, i=3,j=3
  4. i=3,j=3: A==A β†’ match, i=4,j=4
  5. i=4,j=4: A==A β†’ match, i=5,j=5
  6. i=5,j=5: B==B β†’ match, i=6,j=6 β†’ j==m β†’ MATCH at index 0. j=lps[5]=3
  7. i=6,j=3: A==A β†’ match, i=7,j=4
  8. i=7,j=4: A==A β†’ match, i=8,j=5
  9. i=8,j=5: B==B β†’ match, i=9,j=6 β†’ j==m β†’ MATCH at index 3

Critical observation: after the first match, i did not reset to 1. It continued from 6. KMP never goes backwards in the text.

πŸ”₯
Why i never decrementsThe LPS array encodes all the information needed to resume matching. When we fall back j using lps[j-1], we're saying 'the last lps[j-1] characters I matched in the pattern are still valid β€” I don't need to re-examine them in the text.'

Complexity Analysis

Build LPS: O(m) time, O(m) space

Search: O(n) time β€” i only ever increments, and j can only decrement as many times as it has incremented. So the total number of operations is bounded by 2n.

Total: O(n+m) time, O(m) space

Comparison with naive: - Naive worst case: O(nm) β€” pattern 'AAAB' in text 'AAAA...A' - KMP worst case: O(n+m) β€” always

KMP's guarantee matters for adversarial inputs like bioinformatics (DNA sequences with lots of repetition).

Practical Applications

KMP appears in real-world systems in several forms:

Text editors: Find/replace uses Boyer-Moore-Horspool (faster in practice) but KMP is the theoretical baseline.

Bioinformatics: DNA sequence alignment β€” long patterns with highly repetitive alphabets (A, T, C, G) make KMP's guarantees valuable.

Network intrusion detection: Snort IDS uses Aho-Corasick (which is built on the same failure function idea as KMP) to match thousands of attack signatures simultaneously.

Python's str.find(): CPython uses a hybrid algorithm, but KMP's failure function concept underlies it.

Competitive programming: KMP appears in problems involving periodicity, string compression, and checking if one string is a rotation of another.

When you actually need to implement this yourself (not use a library): custom protocol parsers, embedded systems with no stdlib, log stream processors that need guaranteed latency, and of course β€” coding interviews. Knowing KMP cold without hesitation is the difference between "knows string algorithms" and "can implement them".

kmp_applications.py Β· PYTHON
12345678910111213141516171819202122
# Classic interview application: is s2 a rotation of s1?
def is_rotation(s1: str, s2: str) -> bool:
    if len(s1) != len(s2):
        return False
    # s2 is a rotation of s1 iff s2 appears in s1+s1
    return bool(kmp_search(s1 + s1, s2))

print(is_rotation('abcde', 'cdeab'))  # True
print(is_rotation('abcde', 'abced'))  # False

# Find period of a string using LPS
def string_period(s: str) -> int:
    lps = build_lps(s)
    n = len(s)
    period = n - lps[-1]
    if n % period == 0:
        return period
    return n  # no repeated period

print(string_period('abababab'))  # 2
print(string_period('abcabc'))    # 3
print(string_period('abcdef'))    # 6 (no period)
β–Ά Output
True
False
2
3
6

🎯 Key Takeaways

  • The text pointer i never moves left β€” this single invariant is why KMP is O(n) in the search phase, not O(nm). If your 'KMP' has i decrementing, it's not KMP.
  • LPS[j-1] is the fallback position on mismatch β€” not 0. Resetting to 0 on every mismatch is the naive algorithm. The difference between these two lines is the entire algorithm.
  • Build LPS in O(m) with the same two-pointer technique: length pointer and i pointer, with lps[length-1] as the fallback when they don't match.
  • In production, use Python's str.find() or re module β€” they use optimised C implementations. KMP matters when you're implementing search in a constrained environment (embedded, custom protocol parser, interview).
  • KMP is the foundation for Aho-Corasick (extend to multiple patterns) and appears in problems about string periodicity: the minimum period of s is len(s) - lps[-1].

⚠ Common Mistakes to Avoid

  • βœ•Setting j=0 on every mismatch β€” this is the O(nm) naive algorithm. You need j = lps[j-1] when j != 0, and only increment i when j == 0.
  • βœ•Using lps[j] instead of lps[j-1] on mismatch β€” off-by-one that causes incorrect fallback and subtle bugs that only manifest on patterns with heavy repetition.
  • βœ•Not handling the case where j=0 on mismatch β€” in this case you cannot fall back further, so only i increments. Missing this gives an infinite loop.
  • βœ•Forgetting that lps is built from the PATTERN, not the text β€” a very common confusion when first reading the algorithm.

Interview Questions on This Topic

  • QExplain what the LPS array stores and how it is computed.
  • QWhy does KMP run in O(n+m) rather than O(nm)?
  • QHow would you use KMP to check if one string is a rotation of another?
  • QWhat is the LPS array for the pattern 'ABACABA'?

Frequently Asked Questions

What does LPS stand for?

Longest Proper Prefix which is also a Suffix. 'Proper' means the prefix/suffix cannot be the entire string itself.

Is KMP always faster than naive search?

Not always in practice. For short patterns or typical English text, the naive algorithm often performs well because mismatches occur early. KMP's advantage shows on adversarial inputs with lots of partial matches.

What is the difference between KMP and Z-Algorithm?

Both solve the same problem in O(n+m). Z-algorithm computes the length of the longest substring starting at each position that matches a prefix of the string. KMP's failure function is an older, more battle-tested formulation. In an interview: KMP is more commonly expected and shows understanding of failure functions which generalise to Aho-Corasick. In competitive programming: Z-algorithm is often preferred because it's easier to code from memory and the array is more directly interpretable. Pick one, understand it deeply, know that the other exists.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousString Searching Algorithms OverviewNext β†’Z-Algorithm for Pattern Matching
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged