DSA Advanced

Burrows-Wheeler Transform — Reversible Rearrangement for Compression

📅 March 24, 2026 ⏱ 8 min read 🎯 Advanced

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Data Compression → Topic 3 of 4

Learn the Burrows-Wheeler Transform (BWT) used in bzip2.

🔥 Advanced — solid DSA foundation required

In this tutorial, you'll learn:

BWT rearranges text so similar contexts are adjacent — grouping characters before the same suffix.
Forward: sort all rotations, take last column. Output last column + original row index.
Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡ Quick Answer

The Burrows-Wheeler Transform is a magic trick: take a string, rearrange it in a specific way, and the result looks completely different but can be perfectly reversed. More importantly, the rearranged version has runs of similar characters — making it much easier to compress. It is the secret behind bzip2, which often beats gzip on text.

bzip2 regularly outcompresses gzip by 10-30% on text data. The reason is the Burrows-Wheeler Transform — a reversible rearrangement that clusters similar contexts together, making subsequent compression dramatically more effective. Michael Burrows and David Wheeler published it in 1994 as a DEC Systems Research Center technical report, and it became the core of bzip2.

The BWT also has a second life in bioinformatics. The FM-index — a compressed data structure built on BWT — allows searching a 3-billion-base genome for any short read in milliseconds while using only 1-2 GB of RAM. The entire field of short-read genome alignment (BWA, Bowtie2, HISAT2 — the tools that process most DNA sequencing data) is built on Burrows-Wheeler.

Forward BWT

Form all rotations of the string (append a unique end-of-string marker $)
Sort all rotations lexicographically
Output the last column (L) of the sorted matrix and the row index of the original string

bwt_forward.py · PYTHON

12345678910111213

def bwt(s: str) -> tuple[str, int]:
    """Compute BWT of s. Returns (transformed, original_row_index)."""
    s = s + '$'  # append end marker (lexicographically smallest)
    n = len(s)
    rotations = sorted(s[i:] + s[:i] for i in range(n))
    last_col = ''.join(r[-1] for r in rotations)
    original_idx = rotations.index(s)
    return last_col, original_idx

result, idx = bwt('banana')
print(f'BWT(banana) = {result!r}, original row = {idx}')
result2, idx2 = bwt('mississippi')
print(f'BWT(mississippi) = {result2!r}, original row = {idx2}')

▶ Output

BWT(banana) = 'annb$aa', original row = 3
BWT(mississippi) = 'ipssm$pissii', original row = 2

Why BWT Helps Compression

Notice BWT(banana) = 'annb$aa' — the 'a's and 'n's are grouped. This happens because rotations starting with 'an' are sorted together, and those all have the character preceding 'an' in their last column.

In English text, the character before 'an' is usually 'b' (ban), 'c' (can), 'm' (man), 'p' (pan), etc. — similar characters cluster. bzip2 achieves 10-30% better ratio than gzip on English text because of this clustering.

Inverse BWT

Recovery uses the key insight: sorting the last column L gives the first column F. Since each row is a rotation, (F[i], L[i]) are consecutive characters.

bwt_inverse.py · PYTHON

12345678910111213141516

def ibwt(last: str, orig_idx: int) -> str:
    """Reconstruct original string from BWT."""
    n = len(last)
    # Build first column by sorting last
    first = sorted(last)
    # LF mapping: last[i] is the character just before first[i]
    # Count occurrences to build precise mapping
    from collections import defaultdict
    table = [''] * n
    for _ in range(n):
        table = [last[i] + table[i] for i in range(n)]
        table = sorted(table)
    return table[orig_idx].rstrip('$')

print(ibwt('annb$aa', 3))        # banana
print(ibwt('ipssm$pissii', 2))  # mississippi

▶ Output

banana
mississippi

BWT Pipeline — bzip2

bzip2 uses BWT as the first stage of a multi-step pipeline: 1. BWT: Rearrange for character clustering 2. Move-to-Front (MTF): Encode characters by their position in a recently-seen list — clusters become runs of small numbers 3. RLE: Compress runs of zeros (from MTF) 4. Huffman coding: Entropy-code the result

This pipeline achieves better compression than gzip (DEFLATE) on typical text at the cost of more memory and slower speed.

🎯 Key Takeaways

BWT rearranges text so similar contexts are adjacent — grouping characters before the same suffix.
Forward: sort all rotations, take last column. Output last column + original row index.
Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.
Used in bzip2 pipeline: BWT → MTF → RLE → Huffman. Beats gzip on text by 10-30%.
BWT is also used in bioinformatics for genome indexing (FM-index, BWA aligner).

Interview Questions on This Topic

QWhy does BWT improve compression compared to compressing the original string directly?
QHow is the BWT reversed — what information do you need?
QWhat is the Move-to-Front transform and why does it follow BWT in bzip2?
QHow is BWT used in genomic sequence alignment?

Frequently Asked Questions

Is bzip2 better than gzip?

On typical text, bzip2 compresses 10-30% better than gzip. However, gzip (DEFLATE) is 3-5x faster and uses less memory. For maximum ratio: bzip2 or xz (LZMA). For speed+ratio balance: zstd. For compatibility: gzip.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged