Suffix Array — Construction and Applications
- Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
- Pattern matching: binary search in O(m log n) per query.
- LCP array enables longest repeated substring, distinct substring count in O(n).
In 2001, bioinformatics needed to align 3 billion base pairs (the human genome) against itself to find all repeated regions. A suffix tree would use 45GB of RAM. The suffix array solved the same problem in 4GB — and the algorithms to build and query it are cleaner to implement. That is the practical origin story of suffix arrays: a space-efficient replacement for suffix trees that turned bioinformatics from theoretically possible to computationally tractable.
Today, every modern genome aligner (BWA, Bowtie2, HISAT2) uses a suffix array variant called the FM-index. Competitive programming hard problems regularly involve suffix arrays. And the underlying ideas transfer directly to understanding database index structures and substring search engines.
What is a Suffix Array?
For string S = 'banana', the suffixes are: - 0: banana - 1: anana - 2: nana - 3: ana - 4: na - 5: a
Sorted lexicographically: - 5: a - 3: ana - 1: anana - 0: banana - 4: na - 2: nana
Suffix Array SA = [5, 3, 1, 0, 4, 2]
def suffix_array_naive(s: str) -> list[int]: """O(n² log n) — simple but correct.""" n = len(s) suffixes = sorted(range(n), key=lambda i: s[i:]) return suffixes print(suffix_array_naive('banana')) # [5, 3, 1, 0, 4, 2] print(suffix_array_naive('mississippi')) # [10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
[10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2]
O(n log n) Construction — Prefix Doubling
The efficient construction uses prefix doubling (Manber-Myers algorithm): sort suffixes by their first 2^k characters, doubling k each round. Each round uses the ranking from the previous round, enabling O(n log n) total.
def suffix_array(s: str) -> list[int]: n = len(s) sa = list(range(n)) rank = [ord(c) for c in s] tmp = [0] * n k = 1 while k < n: def key(i): return (rank[i], rank[i+k] if i+k < n else -1) sa.sort(key=key) tmp[sa[0]] = 0 for i in range(1, n): tmp[sa[i]] = tmp[sa[i-1]] + (1 if key(sa[i]) != key(sa[i-1]) else 0) rank = tmp[:] if rank[sa[-1]] == n - 1: break k *= 2 return sa print(suffix_array('banana')) # [5, 3, 1, 0, 4, 2] print(suffix_array('abracadabra')) # [10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
[10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2]
Pattern Matching with Suffix Array
Because SA is sorted, binary search finds all occurrences of pattern P in O(m log n) time — compare P against SA[mid] using the first m characters.
import bisect def sa_search(s: str, sa: list[int], pattern: str) -> list[int]: m = len(pattern) # Find leftmost position where suffix >= pattern lo = bisect.bisect_left(sa, 0, key=lambda i: s[i:i+m] >= pattern) hi = bisect.bisect_right(sa, 0, key=lambda i: s[i:i+m] > pattern) # Simpler approach: suffixes = [s[i:] for i in sa] lo = bisect.bisect_left(suffixes, pattern) hi = bisect.bisect_right(suffixes, pattern + chr(255)) return sorted(sa[lo:hi]) sa = suffix_array('banana') print(sa_search('banana', sa, 'ana')) # [1, 3]
The LCP Array
LCP[i] = length of the longest common prefix between SA[i] and SA[i-1]. Combined with SA, LCP enables answering many string queries in O(1) with O(n) preprocessing.
Applications: - Longest repeated substring: max(LCP) - Number of distinct substrings: n*(n+1)/2 - sum(LCP) - Longest common substring of two strings: concatenate with sentinel and find max LCP across boundary
🎯 Key Takeaways
- Suffix array = sorted starting positions of all suffixes — O(n log n) to build.
- Pattern matching: binary search in O(m log n) per query.
- LCP array enables longest repeated substring, distinct substring count in O(n).
- Suffix arrays are the practical alternative to suffix trees — same power, less memory.
- Kasai's algorithm computes the LCP array in O(n) from the suffix array.
Interview Questions on This Topic
- QWhat is a suffix array and how is it different from a suffix tree?
- QHow do you find the longest repeated substring using a suffix array?
- QWhat is the LCP array and what problems does it enable?
- QWhat is the time complexity of building a suffix array using prefix doubling?
Frequently Asked Questions
When should I use a suffix array vs a suffix tree?
Suffix arrays are preferred in practice — they use O(n) space vs O(n) for suffix trees but with much smaller constants, and are easier to implement. Suffix trees have better theoretical query times for some problems but are rarely worth the implementation complexity.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.