Suffix Array—Why Suffix Trees Crashed 64GB Genome Aligner
- Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
- Pattern matching: binary search in O(m log n) per query.
- LCP array enables longest repeated substring, distinct substring count in O(n).
- Suffix array: sorted starting indices of all suffixes of a string
- Built in O(n log n) via prefix doubling or O(n) with SA-IS
- Pattern matching: binary search over suffixes in O(m log n)
- LCP array adds longest repeated substring and distinct substring count in O(n)
- Production insight: suffix arrays use 4x less memory than suffix trees for genome-scale data
Quick Debug Cheat Sheet for Suffix Array Issues
Suffix array order is not lexicographic for strings with repeated patterns
print(suffix_array_naive('banana')) # compare with expected# Verify rank updates in prefix doubling: print(rank after each iteration)LCP array has negative values or exceeds n
assert all(rank[SA[i]] == i for i in range(n))Re-run Kasai's algorithm step by step on 'aba' which has LCP = [?, 1, 0] expectedBinary search for pattern finds zero occurrences even though pattern exists
suffixes = [s[SA[i]:] for i in range(n)]; print(suffixes)Check binary search bounds: ensure lo=0, hi=n , pattern compare uses first m charsMemory usage spikes during construction for large strings (n > 1e5)
from array import array; sa = array('I', range(n)) # 4-byte per elementCheck intermediate data structures: do you store all suffixes as strings? Replace with integer indices and lazy comparisons.Construction takes more than O(n log n) time (e.g., O(n²))
import time; start = time.time(); result = suffix_array(s); print('Time:', time.time()-start)Compare with naive O(n² log n) for small n; ensure prefix doubling uses radix sort for true O(n log n).Production Incident
Production Debug GuideSymptom → Action
In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.
Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.
What is a Suffix Array?
- 0: banana
- 1: anana
- 2: nana
- 3: ana
- 4: na
- 5: a
- 5: a
- 3: ana
- 1: anana
- 0: banana
- 4: na
- 2: nana
Suffix Array SA = [5, 3, 1, 0, 4, 2]
def suffix_array_naive(s: str) -> list[int]: """O(n² log n) — simple but correct.""" n = len(s) suffixes = sorted(range(n), key=lambda i: s[i:]) return suffixes print(suffix_array_naive('banana')) # [5, 3, 1, 0, 4, 2] print(suffix_array_naive('mississippi')) # [10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
[10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
O(n log n) Construction — Prefix Doubling
The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.
def suffix_array(s: str) -> list[int]: n = len(s) sa = list(range(n)) rank = [ord(c) for c in s] tmp = [0] * n k = 1 while k < n: def key(i): return (rank[i], rank[i+k] if i+k < n else -1) sa.sort(key=key) tmp[sa[0]] = 0 for i in range(1, n): tmp[sa[i]] = tmp[sa[i-1]] + (1 if key(sa[i]) != key(sa[i-1]) else 0) rank = tmp[:] if rank[sa[-1]] == n - 1: break k *= 2 return sa print(suffix_array('banana')) # [5, 3, 1, 0, 4, 2] print(suffix_array('abracadabra')) # [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
[10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
Pattern Matching with Suffix Array
Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.
import bisect def sa_search(s: str, sa: list[int], pattern: str) -> list[int]: m = len(pattern) # Find leftmost position where suffix >= pattern lo = bisect.bisect_left(sa, 0, key=lambda i: s[i:i+m] >= pattern) hi = bisect.bisect_right(sa, 0, key=lambda i: s[i:i+m] > pattern) # Simpler approach: suffixes = [s[i:] for i in sa] lo = bisect.bisect_left(suffixes, pattern) hi = bisect.bisect_right(suffixes, pattern + chr(255)) return sorted(sa[lo:hi]) sa = suffix_array('banana') print(sa_search('banana', sa, 'ana')) # [1, 3]
The LCP Array
LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.
- Longest repeated substring: max(LCP)
- Number of distinct substrings: n*(n+1)/2
- sum(LCP)
- Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary
Building the LCP Array — Kasai's Algorithm
Kasai's algorithm computes the LCP array given the suffix array and the string in O(n) time. For each suffix in original order, it finds the LCP with its adjacent suffix in SA and uses the property that LCP of next suffix is at least previous LCP - 1.
def lcp_array(s: str, sa: list[int]) -> list[int]: n = len(s) rank = [0] * n for i, pos in enumerate(sa): rank[pos] = i lcp = [0] * n # lcp[i] = LCP between SA[i] and SA[i-1]; lcp[0] = 0 k = 0 for i in range(n): if rank[i] == 0: k = 0 continue j = sa[rank[i] - 1] while i + k < n and j + k < n and s[i+k] == s[j+k]: k += 1 lcp[rank[i]] = k if k > 0: k -= 1 return lcp sa = suffix_array('banana') lcp = lcp_array('banana', sa) print(lcp) # [0, 1, 3, 0, 0, 2]
Suffix Array vs Suffix Tree: When to Use Which
Suffix trees and suffix arrays are both data structures for substring queries. Suffix trees are built in O(n) time and offer O(m) pattern matching, but have high memory overhead (20–30 bytes per character). Suffix arrays have lower constants, are simpler to implement, and can be built in O(n) with the SA-IS algorithm. In practice, suffix arrays are preferred for large strings (e.g., genomes), while suffix trees are used when O(m) query time is critical and memory is abundant.
🎯 Key Takeaways
- Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
- Pattern matching: binary search in O(m log n) per query.
- LCP array enables longest repeated substring, distinct substring count in O(n).
- Suffix arrays are the practical alternative to suffix trees — same power, less memory.
- Kasai's algorithm computes the LCP array in O(n) from the suffix array.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is a suffix array and how is it different from a suffix tree?Mid-levelReveal
- QHow do you find the longest repeated substring using a suffix array?Mid-levelReveal
- QWhat is the LCP array and what problems does it enable?JuniorReveal
- QWhat is the time complexity of building a suffix array using prefix doubling?JuniorReveal
- QExplain how Kasai's algorithm computes the LCP array in O(n). Describe the key invariant.SeniorReveal
Frequently Asked Questions
When should I use a suffix array vs a suffix tree?
Suffix arrays are preferred in practice — they use O(n) space vs O(n) for suffix trees but with much smaller constants, and are easier to implement. Suffix trees have better theoretical query times for some problems but are rarely worth the implementation complexity.
Can suffix arrays be built in O(n) time?
Yes, the SA-IS algorithm (Suffix Array Induced Sorting) can build a suffix array in O(n) time and O(n) space. However, it is more complex to implement. For most applications, prefix doubling with O(n log n) is sufficient.
How do I use suffix arrays for multiple pattern matching?
For multiple patterns, you can perform binary search for each pattern — O(k m log n) total. If you have many patterns, consider using an Aho-Corasick automaton on the text or build a suffix array and use the LCP to quickly skip common prefixes.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.