Senior 3 min · March 24, 2026

Suffix Array—Why Suffix Trees Crashed 64GB Genome Aligner

Suffix trees consumed 45GB for 3.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Suffix array: sorted starting indices of all suffixes of a string
  • Built in O(n log n) via prefix doubling or O(n) with SA-IS
  • Pattern matching: binary search over suffixes in O(m log n)
  • LCP array adds longest repeated substring and distinct substring count in O(n)
  • Production insight: suffix arrays use 4x less memory than suffix trees for genome-scale data
Plain-English First

A suffix array is simply all suffixes of a string sorted alphabetically, with their starting positions stored. It sounds trivial — but this sorted structure enables you to binary-search for any pattern in O(m log n) time, find the longest repeated substring, count distinct substrings, and solve dozens of hard string problems efficiently.

In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.

Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.

What is a Suffix Array?

For string S = 'banana', the suffixes are
  • 0: banana
  • 1: anana
  • 2: nana
  • 3: ana
  • 4: na
  • 5: a
Sorted lexicographically
  • 5: a
  • 3: ana
  • 1: anana
  • 0: banana
  • 4: na
  • 2: nana

Suffix Array SA = [5, 3, 1, 0, 4, 2]

suffix_array_naive.pyPYTHON
1
2
3
4
5
6
7
8
def suffix_array_naive(s: str) -> list[int]:
    """O(n² log n) — simple but correct."""
    n = len(s)
    suffixes = sorted(range(n), key=lambda i: s[i:])
    return suffixes

print(suffix_array_naive('banana'))  # [5, 3, 1, 0, 4, 2]
print(suffix_array_naive('mississippi'))  # [10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
Output
[5, 3, 1, 0, 4, 2]
[10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
Production Insight
The naive O(n² log n) construction is fine for n < 10⁴.
For DNA sequences (n ~ 10⁸), this will never finish.
Rule: always estimate n before choosing construction algorithm.
Key Takeaway
Suffix array = sorted starting positions of all suffixes.
Naive O(n² log n) is a correctness baseline only.
Use prefix doubling for real problems.

O(n log n) Construction — Prefix Doubling

The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.

suffix_array_nlogn.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def suffix_array(s: str) -> list[int]:
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    k = 1
    while k < n:
        def key(i):
            return (rank[i], rank[i+k] if i+k < n else -1)
        sa.sort(key=key)
        tmp[sa[0]] = 0
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i-1]] + (1 if key(sa[i]) != key(sa[i-1]) else 0)
        rank = tmp[:]
        if rank[sa[-1]] == n - 1:
            break
        k *= 2
    return sa

print(suffix_array('banana'))       # [5, 3, 1, 0, 4, 2]
print(suffix_array('abracadabra'))  # [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
Output
[5, 3, 1, 0, 4, 2]
[10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
Production Insight
Sorting by (rank[i], rank[i+k]) is O(log n) rounds of O(n log n) sorting.
In Python, built-in sort with tuple keys adds overhead — use radix sort for large n.
Tip: early break when all ranks are distinct saves rounds.
Key Takeaway
Prefix doubling: O(n log n) time, O(n) space.
Each round sorts suffixes by first 2^k characters.
Break early when ranks are unique — no further sorting needed.

Pattern Matching with Suffix Array

Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.

sa_search.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import bisect

def sa_search(s: str, sa: list[int], pattern: str) -> list[int]:
    m = len(pattern)
    # Find leftmost position where suffix >= pattern
    lo = bisect.bisect_left(sa, 0, key=lambda i: s[i:i+m] >= pattern)
    hi = bisect.bisect_right(sa, 0, key=lambda i: s[i:i+m] > pattern)
    # Simpler approach:
    suffixes = [s[i:] for i in sa]
    lo = bisect.bisect_left(suffixes, pattern)
    hi = bisect.bisect_right(suffixes, pattern + chr(255))
    return sorted(sa[lo:hi])

sa = suffix_array('banana')
print(sa_search('banana', sa, 'ana'))  # [1, 3]
Output
[1, 3]
Production Insight
Using Python's bisect with string comparison creates O(m) per comparison — O(m log n) total.
For large patterns, use LCP-based binary search to reduce to O(m + log n).
Common bug: pattern longer than suffix at mid leads to index error — always take min(m, n - SA[mid]) characters.
Key Takeaway
Binary search over suffix array finds all pattern occurrences in O(m log n).
The substring comparison s[i:i+m] is the bottleneck.
LCP-based search speeds this to O(m + log n).

The LCP Array

LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.

Applications
  • Longest repeated substring: max(LCP)
  • Number of distinct substrings: n*(n+1)/2
  • sum(LCP)
  • Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary
Production Insight
Kasai's algorithm computes LCP in O(n) using the rank array.
If LCP array has negative values, you swapped SA indices with ranks.
In production, store LCP in an array of integers — 1 byte per integer is enough for most strings (max value < 255).
Key Takeaway
LCP array = longest common prefix between adjacent suffixes in SA.
Kasai's algorithm computes it in O(n).
LCP unlocks longest repeated substring, distinct substring count, and more.

Building the LCP Array — Kasai's Algorithm

Kasai's algorithm computes the LCP array given the suffix array and the string in O(n) time. For each suffix in original order, it finds the LCP with its adjacent suffix in SA and uses the property that LCP of next suffix is at least previous LCP - 1.

kasai_lcp.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def lcp_array(s: str, sa: list[int]) -> list[int]:
    n = len(s)
    rank = [0] * n
    for i, pos in enumerate(sa):
        rank[pos] = i
    lcp = [0] * n  # lcp[i] = LCP between SA[i] and SA[i-1]; lcp[0] = 0
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i+k] == s[j+k]:
            k += 1
        lcp[rank[i]] = k
        if k > 0:
            k -= 1
    return lcp

sa = suffix_array('banana')
lcp = lcp_array('banana', sa)
print(lcp)  # [0, 1, 3, 0, 0, 2]
Output
[0, 1, 3, 0, 0, 2]
Production Insight
Kasai's algorithm is O(n) but each while loop can execute O(n) total across all iterations.
If you forget to decrement k when a mismatch occurs, the algorithm becomes O(n²).
Memory: lcp array of ints is fine; but if n > 2³¹, use 64-bit integers.
Key Takeaway
Kasai's algorithm builds LCP in O(n) using the rank array.
The key invariant: LCP of next suffix in original order is at least previous LCP - 1.
Decrement k on mismatch to maintain correctness.

Suffix Array vs Suffix Tree: When to Use Which

Suffix trees and suffix arrays are both data structures for substring queries. Suffix trees are built in O(n) time and offer O(m) pattern matching, but have high memory overhead (20–30 bytes per character). Suffix arrays have lower constants, are simpler to implement, and can be built in O(n) with the SA-IS algorithm. In practice, suffix arrays are preferred for large strings (e.g., genomes), while suffix trees are used when O(m) query time is critical and memory is abundant.

Production Insight
For genome-sized strings (n = 3e9), suffix trees are often impractical — memory alone kills them.
Suffix arrays are the default in bioinformatics tools.
If you need O(m) queries on a static string with less memory, consider suffix arrays with LCP + RMQ for constant-time LCP queries.
Key Takeaway
Suffix arrays win on memory and simplicity.
Suffix trees win on asymptotic query time.
For modern large-scale problems, start with suffix arrays.
● Production incidentPOST-MORTEMseverity: high

Genome Aligner Runs Out of Memory on 3 Billion Bases

Symptom
Out-of-memory errors during genome indexing on a 64GB machine, even after swapping. The process was killed by the OOM killer.
Assumption
The team assumed suffix trees were the only efficient way to find all maximal repeats, inheriting a textbook algorithm without questioning memory bounds.
Root cause
Suffix trees store pointers for each character in the tree structure, which is roughly 20–30 bytes per node — for a string of length n, the tree has O(n) nodes but with high constants. With n=3.5 billion, the tree used ~45GB.
Fix
Replaced the suffix tree with a suffix array built via prefix doubling (O(n log n) time, O(n) space). The suffix array stored only 4-byte integers for each suffix, using ~14GB. Combined with the LCP array for repeat finding, total memory was under 16GB.
Key lesson
  • Suffix arrays are almost always the right choice for memory-constrained large DNA sequences.
  • Know your memory constants: suffix trees can be 3–5x heavier than suffix arrays for the same n.
  • Always benchmark space before committing to an algorithm at genome scale.
Production debug guideSymptom → Action4 entries
Symptom · 01
Suffix array is not sorted lexicographically (correctness test fails)
Fix
Check the key function in prefix doubling: ensure rank pairs are compared as tuples. Use a simple O(n² log n) implementation to verify against naive sort.
Symptom · 02
Binary search on suffix array returns wrong or missing pattern occurrences
Fix
Verify that the suffix array is built on the entire string (including sentinel). Test binary search with a small pattern where you know exact occurrences. Ensure comparison function uses string slicing not counting excessive characters.
Symptom · 03
LCP array values are incorrect (e.g., longest repeated substring gives wrong length)
Fix
Run Kasai's algorithm on a small string and manually compute LCP values. Common mistake: LCP[0] should be 0 (or undefined). Ensure Kasai's algorithm uses the rank array from the same suffix array.
Symptom · 04
Pattern matching performance degrades to O(n log n) per query
Fix
Check if binary search is comparing entire strings each time instead of using LCP to skip characters. Use the LCP-based binary search optimization for O(m + log n) queries.
★ Quick Debug Cheat Sheet for Suffix Array IssuesFive common failure modes and their immediate fixes.
Suffix array order is not lexicographic for strings with repeated patterns
Immediate action
Test with a known string like 'banana' and compare expected SA = [5,3,1,0,4,2]
Commands
print(suffix_array_naive('banana')) # compare with expected
# Verify rank updates in prefix doubling: print(rank after each iteration)
Fix now
Ensure your key function uses tuple (rank[i], rank[i+k]) correctly; off-by-one errors in k doubling are common.
LCP array has negative values or exceeds n+
Immediate action
Check the rank array: for each suffix index, rank[SA[i]] must be i
Commands
assert all(rank[SA[i]] == i for i in range(n))
Re-run Kasai's algorithm step by step on 'aba' which has LCP = [?, 1, 0] expected
Fix now
The LCP array length is n, with LCP[0] undefined (usually set to 0). Kasai's algorithm uses ranks, not SA indices.
Binary search for pattern finds zero occurrences even though pattern exists+
Immediate action
Check the base string: is the pattern present? Print all suffixes and manually find matches.
Commands
suffixes = [s[SA[i]:] for i in range(n)]; print(suffixes)
Check binary search bounds: ensure lo=0, hi=n , pattern compare uses first m chars
Fix now
If pattern is empty, binary search fails. Also ensure your comparison is s[SA[mid]:SA[mid]+m] >= pattern, not >.
Memory usage spikes during construction for large strings (n > 1e5)+
Immediate action
Use Python's __slots__ or prefer array('I') to avoid Python list overhead
Commands
from array import array; sa = array('I', range(n)) # 4-byte per element
Check intermediate data structures: do you store all suffixes as strings? Replace with integer indices and lazy comparisons.
Fix now
Prefix doubling only needs two integer arrays of size n (rank and temp rank). Drop the naive string-based suffix generation.
Construction takes more than O(n log n) time (e.g., O(n²))+
Immediate action
Profile: identify which step is slow. Often the sorting step (O(n log n)) is fine; the issue is comparison function.
Commands
import time; start = time.time(); result = suffix_array(s); print('Time:', time.time()-start)
Compare with naive O(n² log n) for small n; ensure prefix doubling uses radix sort for true O(n log n).
Fix now
Replace Python's built-in sort with counting sort on ranks (radix sort) to achieve O(n log n) with low constants.

Key takeaways

1
Suffix array = sorted starting positions of all suffixes
O(n log n) to build.
2
Pattern matching
binary search in O(m log n) per query.
3
LCP array enables longest repeated substring, distinct substring count in O(n).
4
Suffix arrays are the practical alternative to suffix trees
same power, less memory.
5
Kasai's algorithm computes the LCP array in O(n) from the suffix array.

Common mistakes to avoid

5 patterns
×

Using naive O(n² log n) construction for large n

Symptom
Construction takes hours or runs out of memory for strings > 10⁶ characters.
Fix
Implement prefix doubling or use an existing library (e.g., SA-IS in C). Always check n beforehand.
×

Forgetting to handle sentinel characters when concatenating strings

Symptom
Longest common substring between two strings finds false positives because suffixes cross the concatenation boundary without separator.
Fix
Insert a unique sentinel character (e.g., '$') between strings that is smaller than any real character. The sentinel ensures LCP across strings does not pollute results.
×

Assuming binary search on suffix array is O(log n) per comparison

Symptom
Pattern matching is slower than expected — each comparison may take O(m) time, leading to O(m log n) total.
Fix
Use LCP-based binary search to avoid full string comparisons in each step. Or if n is small, use TST (ternary search tree) for faster prefix search.
×

Confusing LCP array index with suffix array index

Symptom
LCP[i] is interpreted as the LCP of suffix at position i in original string instead of suffix at SA[i] and SA[i-1].
Fix
Remember: LCP[rank[i]] is the LCP of suffix starting at i with its predecessor in SA. Always use rank array to map original index to SA position.
×

Using too large data type for suffix array indices

Symptom
Memory blow-up for large n because each index stored as 8 bytes (Python int overhead) instead of 4 bytes.
Fix
Use array('I') or numpy array to store 32-bit integers. In Python, native ints are objects with significant overhead.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is a suffix array and how is it different from a suffix tree?
Q02SENIOR
How do you find the longest repeated substring using a suffix array?
Q03JUNIOR
What is the LCP array and what problems does it enable?
Q04JUNIOR
What is the time complexity of building a suffix array using prefix doub...
Q05SENIOR
Explain how Kasai's algorithm computes the LCP array in O(n). Describe t...
Q01 of 05SENIOR

What is a suffix array and how is it different from a suffix tree?

ANSWER
A suffix array is an array of integers giving the starting positions of all suffixes of a string, sorted in lexicographic order. A suffix tree is a compressed trie of all suffixes. The main differences: suffix arrays use O(n) space with small constants (4 bytes per character), while suffix trees use 20–30 bytes per character. Suffix trees support O(m) pattern matching, suffix arrays need O(m log n) with binary search (or O(m + log n) with LCP). Construction: suffix arrays can be built in O(n log n) easily, suffix trees require complex O(n) algorithms. In practice, suffix arrays are preferred for large strings due to memory and implementation simplicity.
FAQ · 3 QUESTIONS

Frequently Asked Questions

01
When should I use a suffix array vs a suffix tree?
02
Can suffix arrays be built in O(n) time?
03
How do I use suffix arrays for multiple pattern matching?
🔥

That's Strings. Mark it forged?

3 min read · try the examples if you haven't

Previous
Aho-Corasick Multi-Pattern String Matching
7 / 8 · Strings
Next
Rolling Hash and Polynomial String Hashing