Home DSA LZ77 and LZ78 — Dictionary Compression Algorithms

LZ77 and LZ78 — Dictionary Compression Algorithms

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Data Compression → Topic 2 of 4
Learn LZ77 and LZ78 compression algorithms — how sliding window and dictionary-based compression achieve great ratios for repetitive data.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn:
  • LZ77: sliding window back-references — token is (offset, length, next_char).
  • LZ78: growing phrase dictionary — token is (dict_index, new_char).
  • DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
Instead of compressing character by character, LZ algorithms say 'this chunk of text appeared 50 characters ago — just refer back to it instead of repeating it.' LZ77 uses a sliding window dictionary (recent data). LZ78 builds an explicit dictionary as it goes. Both exploit repetition — the more repetitive the data, the better the compression.

Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.

The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.

LZ77 — Sliding Window Compression

LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, find the longest match in the search buffer. Output (offset, length, next_char).

lz77.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
def lz77_encode(data: str, window: int = 20, lookahead: int = 10) -> list:
    """
    Returns list of (offset, length, next_char) tokens.
    offset=0, length=0 means literal character.
    """
    tokens = []
    i = 0
    while i < len(data):
        best_offset = best_length = 0
        # Search in window for longest match
        start = max(0, i - window)
        for j in range(start, i):
            length = 0
            while (length < lookahead and
                   i + length < len(data) and
                   data[j + length] == data[i + length]):
                length += 1
            if length > best_length:
                best_length = length
                best_offset = i - j
        if best_length >= 2:
            next_char = data[i + best_length] if i + best_length < len(data) else ''
            tokens.append((best_offset, best_length, next_char))
            i += best_length + 1
        else:
            tokens.append((0, 0, data[i]))
            i += 1
    return tokens

def lz77_decode(tokens: list) -> str:
    result = ''
    for offset, length, char in tokens:
        if offset == 0 and length == 0:
            result += char
        else:
            start = len(result) - offset
            for k in range(length):
                result += result[start + k]
            result += char
    return result

text = 'abracadabrabracabra'
tokens = lz77_encode(text)
print(f'Tokens: {tokens}')
print(f'Decoded: {lz77_decode(tokens)}')
print(f'Original: {len(text)} chars, Tokens: {len(tokens)} (each needs ~3 bytes)')
▶ Output
Tokens: [(0,0,'a'),(0,0,'b'),(0,0,'r'),(0,0,'a'),(0,0,'c'),(0,0,'a'),(0,0,'d'),(7,4,'b'),(3,8,'a')]
Decoded: abracadabrabracabra
Original: 19 chars, Tokens: 9 (each needs ~3 bytes)

LZ78 — Dictionary-Based Compression

LZ78 builds an explicit dictionary of phrases. Each token is (dictionary_index, new_char). The dictionary grows as encoding proceeds.

lz78.py · PYTHON
123456789101112131415161718192021
def lz78_encode(data: str) -> list[tuple]:
    dictionary = {}
    tokens = []
    current = ''
    for char in data:
        extended = current + char
        if extended in dictionary:
            current = extended
        else:
            idx = dictionary.get(current, 0)
            tokens.append((idx, char))
            dictionary[extended] = len(dictionary) + 1
            current = ''
    if current:
        tokens.append((dictionary.get(current, 0), ''))
    return tokens

text = 'abababab'
tokens = lz78_encode(text)
print(f'Input: {text}')
print(f'LZ78 tokens: {tokens}')
▶ Output
Input: abababab
LZ78 tokens: [(0,'a'),(0,'b'),(1,'b'),(2,'a'),(3,'b')]

Real-World Derivatives

DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.

LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.

zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.

Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.

🔥
Why LZ Algorithms DominateLZ algorithms are universal — they achieve the entropy limit for any stationary source given enough data. Huffman coding alone can only approach single-symbol entropy, while LZ captures multi-symbol patterns. This is why DEFLATE (LZ77 + Huffman) beats pure Huffman significantly.

🎯 Key Takeaways

  • LZ77: sliding window back-references — token is (offset, length, next_char).
  • LZ78: growing phrase dictionary — token is (dict_index, new_char).
  • DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
  • LZ algorithms are universal compressors — asymptotically optimal for any data source.
  • Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).

Interview Questions on This Topic

  • QWhat is the fundamental difference between LZ77 and LZ78?
  • QHow does DEFLATE combine LZ77 and Huffman coding?
  • QWhy do LZ algorithms significantly outperform Huffman-only compression?
  • QWhat data compresses best with LZ algorithms?

Frequently Asked Questions

Why does PNG use DEFLATE instead of JPEG compression?

PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousRun-Length Encoding — Lossless CompressionNext →Burrows-Wheeler Transform
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged