Homeβ€Ί DSAβ€Ί Burrows-Wheeler Transform β€” Reversible Rearrangement for Compression

Burrows-Wheeler Transform β€” Reversible Rearrangement for Compression

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Data Compression β†’ Topic 3 of 4
Learn the Burrows-Wheeler Transform (BWT) used in bzip2.
πŸ”₯ Advanced β€” solid DSA foundation required
In this tutorial, you'll learn:
  • BWT rearranges text so similar contexts are adjacent β€” grouping characters before the same suffix.
  • Forward: sort all rotations, take last column. Output last column + original row index.
  • Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
The Burrows-Wheeler Transform is a magic trick: take a string, rearrange it in a specific way, and the result looks completely different but can be perfectly reversed. More importantly, the rearranged version has runs of similar characters β€” making it much easier to compress. It is the secret behind bzip2, which often beats gzip on text.

bzip2 regularly outcompresses gzip by 10-30% on text data. The reason is the Burrows-Wheeler Transform β€” a reversible rearrangement that clusters similar contexts together, making subsequent compression dramatically more effective. Michael Burrows and David Wheeler published it in 1994 as a DEC Systems Research Center technical report, and it became the core of bzip2.

The BWT also has a second life in bioinformatics. The FM-index β€” a compressed data structure built on BWT β€” allows searching a 3-billion-base genome for any short read in milliseconds while using only 1-2 GB of RAM. The entire field of short-read genome alignment (BWA, Bowtie2, HISAT2 β€” the tools that process most DNA sequencing data) is built on Burrows-Wheeler.

Forward BWT

  1. Form all rotations of the string (append a unique end-of-string marker $)
  2. Sort all rotations lexicographically
  3. Output the last column (L) of the sorted matrix and the row index of the original string
bwt_forward.py Β· PYTHON
12345678910111213
def bwt(s: str) -> tuple[str, int]:
    """Compute BWT of s. Returns (transformed, original_row_index)."""
    s = s + '$'  # append end marker (lexicographically smallest)
    n = len(s)
    rotations = sorted(s[i:] + s[:i] for i in range(n))
    last_col = ''.join(r[-1] for r in rotations)
    original_idx = rotations.index(s)
    return last_col, original_idx

result, idx = bwt('banana')
print(f'BWT(banana) = {result!r}, original row = {idx}')
result2, idx2 = bwt('mississippi')
print(f'BWT(mississippi) = {result2!r}, original row = {idx2}')
β–Ά Output
BWT(banana) = 'annb$aa', original row = 3
BWT(mississippi) = 'ipssm$pissii', original row = 2

Why BWT Helps Compression

Notice BWT(banana) = 'annb$aa' β€” the 'a's and 'n's are grouped. This happens because rotations starting with 'an' are sorted together, and those all have the character preceding 'an' in their last column.

In English text, the character before 'an' is usually 'b' (ban), 'c' (can), 'm' (man), 'p' (pan), etc. β€” similar characters cluster. bzip2 achieves 10-30% better ratio than gzip on English text because of this clustering.

Inverse BWT

Recovery uses the key insight: sorting the last column L gives the first column F. Since each row is a rotation, (F[i], L[i]) are consecutive characters.

bwt_inverse.py Β· PYTHON
12345678910111213141516
def ibwt(last: str, orig_idx: int) -> str:
    """Reconstruct original string from BWT."""
    n = len(last)
    # Build first column by sorting last
    first = sorted(last)
    # LF mapping: last[i] is the character just before first[i]
    # Count occurrences to build precise mapping
    from collections import defaultdict
    table = [''] * n
    for _ in range(n):
        table = [last[i] + table[i] for i in range(n)]
        table = sorted(table)
    return table[orig_idx].rstrip('$')

print(ibwt('annb$aa', 3))        # banana
print(ibwt('ipssm$pissii', 2))  # mississippi
β–Ά Output
banana
mississippi

BWT Pipeline β€” bzip2

bzip2 uses BWT as the first stage of a multi-step pipeline: 1. BWT: Rearrange for character clustering 2. Move-to-Front (MTF): Encode characters by their position in a recently-seen list β€” clusters become runs of small numbers 3. RLE: Compress runs of zeros (from MTF) 4. Huffman coding: Entropy-code the result

This pipeline achieves better compression than gzip (DEFLATE) on typical text at the cost of more memory and slower speed.

🎯 Key Takeaways

  • BWT rearranges text so similar contexts are adjacent β€” grouping characters before the same suffix.
  • Forward: sort all rotations, take last column. Output last column + original row index.
  • Inverse: recoverable because (first_col, last_col) pairs reconstruct the full rotation table.
  • Used in bzip2 pipeline: BWT β†’ MTF β†’ RLE β†’ Huffman. Beats gzip on text by 10-30%.
  • BWT is also used in bioinformatics for genome indexing (FM-index, BWA aligner).

Interview Questions on This Topic

  • QWhy does BWT improve compression compared to compressing the original string directly?
  • QHow is the BWT reversed β€” what information do you need?
  • QWhat is the Move-to-Front transform and why does it follow BWT in bzip2?
  • QHow is BWT used in genomic sequence alignment?

Frequently Asked Questions

Is bzip2 better than gzip?

On typical text, bzip2 compresses 10-30% better than gzip. However, gzip (DEFLATE) is 3-5x faster and uses less memory. For maximum ratio: bzip2 or xz (LZMA). For speed+ratio balance: zstd. For compatibility: gzip.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousLZ77 and LZ78 β€” Dictionary Compression AlgorithmsNext β†’Arithmetic Coding β€” Beyond Huffman
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged