Skip to content
Home DSA Suffix Array—Why Suffix Trees Crashed 64GB Genome Aligner

Suffix Array—Why Suffix Trees Crashed 64GB Genome Aligner

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Strings → Topic 7 of 8
Suffix trees consumed 45GB for 3.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn
Suffix trees consumed 45GB for 3.
  • Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
  • Pattern matching: binary search in O(m log n) per query.
  • LCP array enables longest repeated substring, distinct substring count in O(n).
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Suffix array: sorted starting indices of all suffixes of a string
  • Built in O(n log n) via prefix doubling or O(n) with SA-IS
  • Pattern matching: binary search over suffixes in O(m log n)
  • LCP array adds longest repeated substring and distinct substring count in O(n)
  • Production insight: suffix arrays use 4x less memory than suffix trees for genome-scale data
🚨 START HERE

Quick Debug Cheat Sheet for Suffix Array Issues

Five common failure modes and their immediate fixes.
🟡

Suffix array order is not lexicographic for strings with repeated patterns

Immediate ActionTest with a known string like 'banana' and compare expected SA = [5,3,1,0,4,2]
Commands
print(suffix_array_naive('banana')) # compare with expected
# Verify rank updates in prefix doubling: print(rank after each iteration)
Fix NowEnsure your key function uses tuple (rank[i], rank[i+k]) correctly; off-by-one errors in k doubling are common.
🟡

LCP array has negative values or exceeds n

Immediate ActionCheck the rank array: for each suffix index, rank[SA[i]] must be i
Commands
assert all(rank[SA[i]] == i for i in range(n))
Re-run Kasai's algorithm step by step on 'aba' which has LCP = [?, 1, 0] expected
Fix NowThe LCP array length is n, with LCP[0] undefined (usually set to 0). Kasai's algorithm uses ranks, not SA indices.
🟡

Binary search for pattern finds zero occurrences even though pattern exists

Immediate ActionCheck the base string: is the pattern present? Print all suffixes and manually find matches.
Commands
suffixes = [s[SA[i]:] for i in range(n)]; print(suffixes)
Check binary search bounds: ensure lo=0, hi=n , pattern compare uses first m chars
Fix NowIf pattern is empty, binary search fails. Also ensure your comparison is s[SA[mid]:SA[mid]+m] >= pattern, not >.
🟠

Memory usage spikes during construction for large strings (n > 1e5)

Immediate ActionUse Python's __slots__ or prefer array('I') to avoid Python list overhead
Commands
from array import array; sa = array('I', range(n)) # 4-byte per element
Check intermediate data structures: do you store all suffixes as strings? Replace with integer indices and lazy comparisons.
Fix NowPrefix doubling only needs two integer arrays of size n (rank and temp rank). Drop the naive string-based suffix generation.
🟡

Construction takes more than O(n log n) time (e.g., O(n²))

Immediate ActionProfile: identify which step is slow. Often the sorting step (O(n log n)) is fine; the issue is comparison function.
Commands
import time; start = time.time(); result = suffix_array(s); print('Time:', time.time()-start)
Compare with naive O(n² log n) for small n; ensure prefix doubling uses radix sort for true O(n log n).
Fix NowReplace Python's built-in sort with counting sort on ranks (radix sort) to achieve O(n log n) with low constants.
Production Incident

Genome Aligner Runs Out of Memory on 3 Billion Bases

A bioinformatics team's suffix tree implementation consumed 45GB RAM for the human genome, crashing on the cluster.
SymptomOut-of-memory errors during genome indexing on a 64GB machine, even after swapping. The process was killed by the OOM killer.
AssumptionThe team assumed suffix trees were the only efficient way to find all maximal repeats, inheriting a textbook algorithm without questioning memory bounds.
Root causeSuffix trees store pointers for each character in the tree structure, which is roughly 20–30 bytes per node — for a string of length n, the tree has O(n) nodes but with high constants. With n=3.5 billion, the tree used ~45GB.
FixReplaced the suffix tree with a suffix array built via prefix doubling (O(n log n) time, O(n) space). The suffix array stored only 4-byte integers for each suffix, using ~14GB. Combined with the LCP array for repeat finding, total memory was under 16GB.
Key Lesson
Suffix arrays are almost always the right choice for memory-constrained large DNA sequences.Know your memory constants: suffix trees can be 3–5x heavier than suffix arrays for the same n.Always benchmark space before committing to an algorithm at genome scale.
Production Debug Guide

Symptom → Action

Suffix array is not sorted lexicographically (correctness test fails)Check the key function in prefix doubling: ensure rank pairs are compared as tuples. Use a simple O(n² log n) implementation to verify against naive sort.
Binary search on suffix array returns wrong or missing pattern occurrencesVerify that the suffix array is built on the entire string (including sentinel). Test binary search with a small pattern where you know exact occurrences. Ensure comparison function uses string slicing not counting excessive characters.
LCP array values are incorrect (e.g., longest repeated substring gives wrong length)Run Kasai's algorithm on a small string and manually compute LCP values. Common mistake: LCP[0] should be 0 (or undefined). Ensure Kasai's algorithm uses the rank array from the same suffix array.
Pattern matching performance degrades to O(n log n) per queryCheck if binary search is comparing entire strings each time instead of using LCP to skip characters. Use the LCP-based binary search optimization for O(m + log n) queries.

In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.

Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.

What is a Suffix Array?

For string S = 'banana', the suffixes are
  • 0: banana
  • 1: anana
  • 2: nana
  • 3: ana
  • 4: na
  • 5: a
Sorted lexicographically
  • 5: a
  • 3: ana
  • 1: anana
  • 0: banana
  • 4: na
  • 2: nana

Suffix Array SA = [5, 3, 1, 0, 4, 2]

suffix_array_naive.py · PYTHON
12345678
def suffix_array_naive(s: str) -> list[int]:
    """O(n² log n) — simple but correct."""
    n = len(s)
    suffixes = sorted(range(n), key=lambda i: s[i:])
    return suffixes

print(suffix_array_naive('banana'))  # [5, 3, 1, 0, 4, 2]
print(suffix_array_naive('mississippi'))  # [10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
▶ Output
[5, 3, 1, 0, 4, 2]
[10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
📊 Production Insight
The naive O(n² log n) construction is fine for n < 10⁴.
For DNA sequences (n ~ 10⁸), this will never finish.
Rule: always estimate n before choosing construction algorithm.
🎯 Key Takeaway
Suffix array = sorted starting positions of all suffixes.
Naive O(n² log n) is a correctness baseline only.
Use prefix doubling for real problems.

O(n log n) Construction — Prefix Doubling

The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.

suffix_array_nlogn.py · PYTHON
123456789101112131415161718192021
def suffix_array(s: str) -> list[int]:
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    k = 1
    while k < n:
        def key(i):
            return (rank[i], rank[i+k] if i+k < n else -1)
        sa.sort(key=key)
        tmp[sa[0]] = 0
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i-1]] + (1 if key(sa[i]) != key(sa[i-1]) else 0)
        rank = tmp[:]
        if rank[sa[-1]] == n - 1:
            break
        k *= 2
    return sa

print(suffix_array('banana'))       # [5, 3, 1, 0, 4, 2]
print(suffix_array('abracadabra'))  # [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
▶ Output
[5, 3, 1, 0, 4, 2]
[10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
📊 Production Insight
Sorting by (rank[i], rank[i+k]) is O(log n) rounds of O(n log n) sorting.
In Python, built-in sort with tuple keys adds overhead — use radix sort for large n.
Tip: early break when all ranks are distinct saves rounds.
🎯 Key Takeaway
Prefix doubling: O(n log n) time, O(n) space.
Each round sorts suffixes by first 2^k characters.
Break early when ranks are unique — no further sorting needed.

Pattern Matching with Suffix Array

Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.

sa_search.py · PYTHON
123456789101112131415
import bisect

def sa_search(s: str, sa: list[int], pattern: str) -> list[int]:
    m = len(pattern)
    # Find leftmost position where suffix >= pattern
    lo = bisect.bisect_left(sa, 0, key=lambda i: s[i:i+m] >= pattern)
    hi = bisect.bisect_right(sa, 0, key=lambda i: s[i:i+m] > pattern)
    # Simpler approach:
    suffixes = [s[i:] for i in sa]
    lo = bisect.bisect_left(suffixes, pattern)
    hi = bisect.bisect_right(suffixes, pattern + chr(255))
    return sorted(sa[lo:hi])

sa = suffix_array('banana')
print(sa_search('banana', sa, 'ana'))  # [1, 3]
▶ Output
[1, 3]
📊 Production Insight
Using Python's bisect with string comparison creates O(m) per comparison — O(m log n) total.
For large patterns, use LCP-based binary search to reduce to O(m + log n).
Common bug: pattern longer than suffix at mid leads to index error — always take min(m, n - SA[mid]) characters.
🎯 Key Takeaway
Binary search over suffix array finds all pattern occurrences in O(m log n).
The substring comparison s[i:i+m] is the bottleneck.
LCP-based search speeds this to O(m + log n).

The LCP Array

LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.

Applications
  • Longest repeated substring: max(LCP)
  • Number of distinct substrings: n*(n+1)/2
  • sum(LCP)
  • Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary
📊 Production Insight
Kasai's algorithm computes LCP in O(n) using the rank array.
If LCP array has negative values, you swapped SA indices with ranks.
In production, store LCP in an array of integers — 1 byte per integer is enough for most strings (max value < 255).
🎯 Key Takeaway
LCP array = longest common prefix between adjacent suffixes in SA.
Kasai's algorithm computes it in O(n).
LCP unlocks longest repeated substring, distinct substring count, and more.

Building the LCP Array — Kasai's Algorithm

Kasai's algorithm computes the LCP array given the suffix array and the string in O(n) time. For each suffix in original order, it finds the LCP with its adjacent suffix in SA and uses the property that LCP of next suffix is at least previous LCP - 1.

kasai_lcp.py · PYTHON
12345678910111213141516171819202122
def lcp_array(s: str, sa: list[int]) -> list[int]:
    n = len(s)
    rank = [0] * n
    for i, pos in enumerate(sa):
        rank[pos] = i
    lcp = [0] * n  # lcp[i] = LCP between SA[i] and SA[i-1]; lcp[0] = 0
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i+k] == s[j+k]:
            k += 1
        lcp[rank[i]] = k
        if k > 0:
            k -= 1
    return lcp

sa = suffix_array('banana')
lcp = lcp_array('banana', sa)
print(lcp)  # [0, 1, 3, 0, 0, 2]
▶ Output
[0, 1, 3, 0, 0, 2]
📊 Production Insight
Kasai's algorithm is O(n) but each while loop can execute O(n) total across all iterations.
If you forget to decrement k when a mismatch occurs, the algorithm becomes O(n²).
Memory: lcp array of ints is fine; but if n > 2³¹, use 64-bit integers.
🎯 Key Takeaway
Kasai's algorithm builds LCP in O(n) using the rank array.
The key invariant: LCP of next suffix in original order is at least previous LCP - 1.
Decrement k on mismatch to maintain correctness.

Suffix Array vs Suffix Tree: When to Use Which

Suffix trees and suffix arrays are both data structures for substring queries. Suffix trees are built in O(n) time and offer O(m) pattern matching, but have high memory overhead (20–30 bytes per character). Suffix arrays have lower constants, are simpler to implement, and can be built in O(n) with the SA-IS algorithm. In practice, suffix arrays are preferred for large strings (e.g., genomes), while suffix trees are used when O(m) query time is critical and memory is abundant.

📊 Production Insight
For genome-sized strings (n = 3e9), suffix trees are often impractical — memory alone kills them.
Suffix arrays are the default in bioinformatics tools.
If you need O(m) queries on a static string with less memory, consider suffix arrays with LCP + RMQ for constant-time LCP queries.
🎯 Key Takeaway
Suffix arrays win on memory and simplicity.
Suffix trees win on asymptotic query time.
For modern large-scale problems, start with suffix arrays.

🎯 Key Takeaways

  • Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
  • Pattern matching: binary search in O(m log n) per query.
  • LCP array enables longest repeated substring, distinct substring count in O(n).
  • Suffix arrays are the practical alternative to suffix trees — same power, less memory.
  • Kasai's algorithm computes the LCP array in O(n) from the suffix array.

⚠ Common Mistakes to Avoid

    Using naive O(n² log n) construction for large n
    Symptom

    Construction takes hours or runs out of memory for strings > 10⁶ characters.

    Fix

    Implement prefix doubling or use an existing library (e.g., SA-IS in C). Always check n beforehand.

    Forgetting to handle sentinel characters when concatenating strings
    Symptom

    Longest common substring between two strings finds false positives because suffixes cross the concatenation boundary without separator.

    Fix

    Insert a unique sentinel character (e.g., '$') between strings that is smaller than any real character. The sentinel ensures LCP across strings does not pollute results.

    Assuming binary search on suffix array is O(log n) per comparison
    Symptom

    Pattern matching is slower than expected — each comparison may take O(m) time, leading to O(m log n) total.

    Fix

    Use LCP-based binary search to avoid full string comparisons in each step. Or if n is small, use TST (ternary search tree) for faster prefix search.

    Confusing LCP array index with suffix array index
    Symptom

    LCP[i] is interpreted as the LCP of suffix at position i in original string instead of suffix at SA[i] and SA[i-1].

    Fix

    Remember: LCP[rank[i]] is the LCP of suffix starting at i with its predecessor in SA. Always use rank array to map original index to SA position.

    Using too large data type for suffix array indices
    Symptom

    Memory blow-up for large n because each index stored as 8 bytes (Python int overhead) instead of 4 bytes.

    Fix

    Use array('I') or numpy array to store 32-bit integers. In Python, native ints are objects with significant overhead.

Interview Questions on This Topic

  • QWhat is a suffix array and how is it different from a suffix tree?Mid-levelReveal
    A suffix array is an array of integers giving the starting positions of all suffixes of a string, sorted in lexicographic order. A suffix tree is a compressed trie of all suffixes. The main differences: suffix arrays use O(n) space with small constants (4 bytes per character), while suffix trees use 20–30 bytes per character. Suffix trees support O(m) pattern matching, suffix arrays need O(m log n) with binary search (or O(m + log n) with LCP). Construction: suffix arrays can be built in O(n log n) easily, suffix trees require complex O(n) algorithms. In practice, suffix arrays are preferred for large strings due to memory and implementation simplicity.
  • QHow do you find the longest repeated substring using a suffix array?Mid-levelReveal
    Build the suffix array and LCP array. The longest repeated substring corresponds to the maximum value in the LCP array. Each LCP[i] gives the length of the longest common prefix between SA[i] and SA[i-1] — the two suffixes that share that prefix. The substring itself is s[SA[i]:SA[i]+LCP[i]]. This runs in O(n) after construction.
  • QWhat is the LCP array and what problems does it enable?JuniorReveal
    The LCP (Longest Common Prefix) array is an array where LCP[i] = length of LCP between suffix at SA[i] and suffix at SA[i-1]. It enables: longest repeated substring (max LCP), number of distinct substrings (n*(n+1)/2 - sum(LCP)), longest common substring of two strings (using sentinel concatenation), and efficient prefix matching with RMQ.
  • QWhat is the time complexity of building a suffix array using prefix doubling?JuniorReveal
    O(n log n) worst-case. Each doubling step sorts n elements by a pair of integer ranks using O(n log n) comparison-based sort, leading to O(n log² n) if naive. With radix sort, each step is O(n), total O(n log n). Space: O(n) for the suffix array and rank arrays.
  • QExplain how Kasai's algorithm computes the LCP array in O(n). Describe the key invariant.SeniorReveal
    Kasai's algorithm uses the rank array (inverse of SA). It iterates through original string positions i = 0 to n-1. For each i, it finds the previous suffix in SA using rank[i] - 1, computes LCP by comparing characters starting from current k (previously computed LCP minus one). Key invariant: LCP[i+1] >= LCP[i] - 1. This ensures the total number of character comparisons is O(n).

Frequently Asked Questions

When should I use a suffix array vs a suffix tree?

Suffix arrays are preferred in practice — they use O(n) space vs O(n) for suffix trees but with much smaller constants, and are easier to implement. Suffix trees have better theoretical query times for some problems but are rarely worth the implementation complexity.

Can suffix arrays be built in O(n) time?

Yes, the SA-IS algorithm (Suffix Array Induced Sorting) can build a suffix array in O(n) time and O(n) space. However, it is more complex to implement. For most applications, prefix doubling with O(n log n) is sufficient.

How do I use suffix arrays for multiple pattern matching?

For multiple patterns, you can perform binary search for each pattern — O(k m log n) total. If you have many patterns, consider using an Aho-Corasick automaton on the text or build a suffix array and use the LCP to quickly skip common prefixes.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAho-Corasick Multi-Pattern String MatchingNext →Rolling Hash and Polynomial String Hashing
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged