Burrows-Wheeler Transform β Reversible Rearrangement for Compression
- BWT rearranges text so similar contexts are adjacent β grouping characters before the same suffix.
- Forward: sort all rotations, take last column. Output last column + original row index.
- Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.
bzip2 regularly outcompresses gzip by 10-30% on text data. The reason is the Burrows-Wheeler Transform β a reversible rearrangement that clusters similar contexts together, making subsequent compression dramatically more effective. Michael Burrows and David Wheeler published it in 1994 as a DEC Systems Research Center technical report, and it became the core of bzip2.
The BWT also has a second life in bioinformatics. The FM-index β a compressed data structure built on BWT β allows searching a 3-billion-base genome for any short read in milliseconds while using only 1-2 GB of RAM. The entire field of short-read genome alignment (BWA, Bowtie2, HISAT2 β the tools that process most DNA sequencing data) is built on Burrows-Wheeler.
Forward BWT
- Form all rotations of the string (append a unique end-of-string marker $)
- Sort all rotations lexicographically
- Output the last column (L) of the sorted matrix and the row index of the original string
def bwt(s: str) -> tuple[str, int]: """Compute BWT of s. Returns (transformed, original_row_index).""" s = s + '$' # append end marker (lexicographically smallest) n = len(s) rotations = sorted(s[i:] + s[:i] for i in range(n)) last_col = ''.join(r[-1] for r in rotations) original_idx = rotations.index(s) return last_col, original_idx result, idx = bwt('banana') print(f'BWT(banana) = {result!r}, original row = {idx}') result2, idx2 = bwt('mississippi') print(f'BWT(mississippi) = {result2!r}, original row = {idx2}')
BWT(mississippi) = 'ipssm$pissii', original row = 2
Why BWT Helps Compression
Notice BWT(banana) = 'annb$aa' β the 'a's and 'n's are grouped. This happens because rotations starting with 'an' are sorted together, and those all have the character preceding 'an' in their last column.
In English text, the character before 'an' is usually 'b' (ban), 'c' (can), 'm' (man), 'p' (pan), etc. β similar characters cluster. bzip2 achieves 10-30% better ratio than gzip on English text because of this clustering.
Inverse BWT
Recovery uses the key insight: sorting the last column L gives the first column F. Since each row is a rotation, (F[i], L[i]) are consecutive characters.
def ibwt(last: str, orig_idx: int) -> str: """Reconstruct original string from BWT.""" n = len(last) # Build first column by sorting last first = sorted(last) # LF mapping: last[i] is the character just before first[i] # Count occurrences to build precise mapping from collections import defaultdict table = [''] * n for _ in range(n): table = [last[i] + table[i] for i in range(n)] table = sorted(table) return table[orig_idx].rstrip('$') print(ibwt('annb$aa', 3)) # banana print(ibwt('ipssm$pissii', 2)) # mississippi
mississippi
BWT Pipeline β bzip2
bzip2 uses BWT as the first stage of a multi-step pipeline: 1. BWT: Rearrange for character clustering 2. Move-to-Front (MTF): Encode characters by their position in a recently-seen list β clusters become runs of small numbers 3. RLE: Compress runs of zeros (from MTF) 4. Huffman coding: Entropy-code the result
This pipeline achieves better compression than gzip (DEFLATE) on typical text at the cost of more memory and slower speed.
π― Key Takeaways
- BWT rearranges text so similar contexts are adjacent β grouping characters before the same suffix.
- Forward: sort all rotations, take last column. Output last column + original row index.
- Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.
- Used in bzip2 pipeline: BWT β MTF β RLE β Huffman. Beats gzip on text by 10-30%.
- BWT is also used in bioinformatics for genome indexing (FM-index, BWA aligner).
Interview Questions on This Topic
- QWhy does BWT improve compression compared to compressing the original string directly?
- QHow is the BWT reversed β what information do you need?
- QWhat is the Move-to-Front transform and why does it follow BWT in bzip2?
- QHow is BWT used in genomic sequence alignment?
Frequently Asked Questions
Is bzip2 better than gzip?
On typical text, bzip2 compresses 10-30% better than gzip. However, gzip (DEFLATE) is 3-5x faster and uses less memory. For maximum ratio: bzip2 or xz (LZMA). For speed+ratio balance: zstd. For compatibility: gzip.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.