Suffix Array—Why Suffix Trees Crashed 64GB Genome Aligner
Suffix trees consumed 45GB for 3.
- Suffix array: sorted starting indices of all suffixes of a string
- Built in O(n log n) via prefix doubling or O(n) with SA-IS
- Pattern matching: binary search over suffixes in O(m log n)
- LCP array adds longest repeated substring and distinct substring count in O(n)
- Production insight: suffix arrays use 4x less memory than suffix trees for genome-scale data
A suffix array is simply all suffixes of a string sorted alphabetically, with their starting positions stored. It sounds trivial — but this sorted structure enables you to binary-search for any pattern in O(m log n) time, find the longest repeated substring, count distinct substrings, and solve dozens of hard string problems efficiently.
In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.
Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.
What is a Suffix Array?
- 0: banana
- 1: anana
- 2: nana
- 3: ana
- 4: na
- 5: a
- 5: a
- 3: ana
- 1: anana
- 0: banana
- 4: na
- 2: nana
Suffix Array SA = [5, 3, 1, 0, 4, 2]
O(n log n) Construction — Prefix Doubling
The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.
Pattern Matching with Suffix Array
Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.
The LCP Array
LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.
- Longest repeated substring: max(LCP)
- Number of distinct substrings: n*(n+1)/2
- sum(LCP)
- Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary
Building the LCP Array — Kasai's Algorithm
Kasai's algorithm computes the LCP array given the suffix array and the string in O(n) time. For each suffix in original order, it finds the LCP with its adjacent suffix in SA and uses the property that LCP of next suffix is at least previous LCP - 1.
Suffix Array vs Suffix Tree: When to Use Which
Suffix trees and suffix arrays are both data structures for substring queries. Suffix trees are built in O(n) time and offer O(m) pattern matching, but have high memory overhead (20–30 bytes per character). Suffix arrays have lower constants, are simpler to implement, and can be built in O(n) with the SA-IS algorithm. In practice, suffix arrays are preferred for large strings (e.g., genomes), while suffix trees are used when O(m) query time is critical and memory is abundant.
Genome Aligner Runs Out of Memory on 3 Billion Bases
- Suffix arrays are almost always the right choice for memory-constrained large DNA sequences.
- Know your memory constants: suffix trees can be 3–5x heavier than suffix arrays for the same n.
- Always benchmark space before committing to an algorithm at genome scale.
Key takeaways
Common mistakes to avoid
5 patternsUsing naive O(n² log n) construction for large n
Forgetting to handle sentinel characters when concatenating strings
Assuming binary search on suffix array is O(log n) per comparison
Confusing LCP array index with suffix array index
Using too large data type for suffix array indices
Interview Questions on This Topic
What is a suffix array and how is it different from a suffix tree?
Frequently Asked Questions
That's Strings. Mark it forged?
3 min read · try the examples if you haven't