Home DSA Suffix Array — Construction and Applications

Suffix Array — Construction and Applications

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Strings → Topic 7 of 8
Learn suffix arrays — how to build them, compute the LCP array, and use them for pattern matching, longest repeated substring, and string problems in O(n log n) time.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn:
  • Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
  • Pattern matching: binary search in O(m log n) per query.
  • LCP array enables longest repeated substring, distinct substring count in O(n).
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
A suffix array is simply all suffixes of a string sorted alphabetically, with their starting positions stored. It sounds trivial — but this sorted structure enables you to binary-search for any pattern in O(m log n) time, find the longest repeated substring, count distinct substrings, and solve dozens of hard string problems efficiently.

In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.

Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.

What is a Suffix Array?

For string S = 'banana', the suffixes are: - 0: banana - 1: anana - 2: nana - 3: ana - 4: na - 5: a

Sorted lexicographically: - 5: a - 3: ana - 1: anana - 0: banana - 4: na - 2: nana

Suffix Array SA = [5, 3, 1, 0, 4, 2]

suffix_array_naive.py · PYTHON
12345678
def suffix_array_naive(s: str) -> list[int]:
    """O(n² log n) — simple but correct."""
    n = len(s)
    suffixes = sorted(range(n), key=lambda i: s[i:])
    return suffixes

print(suffix_array_naive('banana'))  # [5, 3, 1, 0, 4, 2]
print(suffix_array_naive('mississippi'))  # [10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
▶ Output
[5, 3, 1, 0, 4, 2]
[10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]

O(n log n) Construction — Prefix Doubling

The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.

suffix_array_nlogn.py · PYTHON
123456789101112131415161718192021
def suffix_array(s: str) -> list[int]:
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    k = 1
    while k < n:
        def key(i):
            return (rank[i], rank[i+k] if i+k < n else -1)
        sa.sort(key=key)
        tmp[sa[0]] = 0
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i-1]] + (1 if key(sa[i]) != key(sa[i-1]) else 0)
        rank = tmp[:]
        if rank[sa[-1]] == n - 1:
            break
        k *= 2
    return sa

print(suffix_array('banana'))       # [5, 3, 1, 0, 4, 2]
print(suffix_array('abracadabra'))  # [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
▶ Output
[5, 3, 1, 0, 4, 2]
[10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]

Pattern Matching with Suffix Array

Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.

sa_search.py · PYTHON
123456789101112131415
import bisect

def sa_search(s: str, sa: list[int], pattern: str) -> list[int]:
    m = len(pattern)
    # Find leftmost position where suffix >= pattern
    lo = bisect.bisect_left(sa, 0, key=lambda i: s[i:i+m] >= pattern)
    hi = bisect.bisect_right(sa, 0, key=lambda i: s[i:i+m] > pattern)
    # Simpler approach:
    suffixes = [s[i:] for i in sa]
    lo = bisect.bisect_left(suffixes, pattern)
    hi = bisect.bisect_right(suffixes, pattern + chr(255))
    return sorted(sa[lo:hi])

sa = suffix_array('banana')
print(sa_search('banana', sa, 'ana'))  # [1, 3]
▶ Output
[1, 3]

The LCP Array

LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.

Applications: - Longest repeated substring: max(LCP) - Number of distinct substrings: n*(n+1)/2 - sum(LCP) - Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary

🎯 Key Takeaways

  • Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
  • Pattern matching: binary search in O(m log n) per query.
  • LCP array enables longest repeated substring, distinct substring count in O(n).
  • Suffix arrays are the practical alternative to suffix trees — same power, less memory.
  • Kasai's algorithm computes the LCP array in O(n) from the suffix array.

Interview Questions on This Topic

  • QWhat is a suffix array and how is it different from a suffix tree?
  • QHow do you find the longest repeated substring using a suffix array?
  • QWhat is the LCP array and what problems does it enable?
  • QWhat is the time complexity of building a suffix array using prefix doubling?

Frequently Asked Questions

When should I use a suffix array vs a suffix tree?

Suffix arrays are preferred in practice — they use O(n) space vs O(n) for suffix trees but with much smaller constants, and are easier to implement. Suffix trees have better theoretical query times for some problems but are rarely worth the implementation complexity.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAho-Corasick Multi-Pattern String MatchingNext →Rolling Hash and Polynomial String Hashing
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged