LZ77 and LZ78 — Dictionary Compression Algorithms
- LZ77: sliding window back-references — token is (offset, length, next_char).
- LZ78: growing phrase dictionary — token is (dict_index, new_char).
- DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.
The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.
LZ77 — Sliding Window Compression
LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, find the longest match in the search buffer. Output (offset, length, next_char).
def lz77_encode(data: str, window: int = 20, lookahead: int = 10) -> list: """ Returns list of (offset, length, next_char) tokens. offset=0, length=0 means literal character. """ tokens = [] i = 0 while i < len(data): best_offset = best_length = 0 # Search in window for longest match start = max(0, i - window) for j in range(start, i): length = 0 while (length < lookahead and i + length < len(data) and data[j + length] == data[i + length]): length += 1 if length > best_length: best_length = length best_offset = i - j if best_length >= 2: next_char = data[i + best_length] if i + best_length < len(data) else '' tokens.append((best_offset, best_length, next_char)) i += best_length + 1 else: tokens.append((0, 0, data[i])) i += 1 return tokens def lz77_decode(tokens: list) -> str: result = '' for offset, length, char in tokens: if offset == 0 and length == 0: result += char else: start = len(result) - offset for k in range(length): result += result[start + k] result += char return result text = 'abracadabrabracabra' tokens = lz77_encode(text) print(f'Tokens: {tokens}') print(f'Decoded: {lz77_decode(tokens)}') print(f'Original: {len(text)} chars, Tokens: {len(tokens)} (each needs ~3 bytes)')
Decoded: abracadabrabracabra
Original: 19 chars, Tokens: 9 (each needs ~3 bytes)
LZ78 — Dictionary-Based Compression
LZ78 builds an explicit dictionary of phrases. Each token is (dictionary_index, new_char). The dictionary grows as encoding proceeds.
def lz78_encode(data: str) -> list[tuple]: dictionary = {} tokens = [] current = '' for char in data: extended = current + char if extended in dictionary: current = extended else: idx = dictionary.get(current, 0) tokens.append((idx, char)) dictionary[extended] = len(dictionary) + 1 current = '' if current: tokens.append((dictionary.get(current, 0), '')) return tokens text = 'abababab' tokens = lz78_encode(text) print(f'Input: {text}') print(f'LZ78 tokens: {tokens}')
LZ78 tokens: [(0,'a'),(0,'b'),(1,'b'),(2,'a'),(3,'b')]
Real-World Derivatives
DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.
LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.
zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.
Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.
🎯 Key Takeaways
- LZ77: sliding window back-references — token is (offset, length, next_char).
- LZ78: growing phrase dictionary — token is (dict_index, new_char).
- DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
- LZ algorithms are universal compressors — asymptotically optimal for any data source.
- Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).
Interview Questions on This Topic
- QWhat is the fundamental difference between LZ77 and LZ78?
- QHow does DEFLATE combine LZ77 and Huffman coding?
- QWhy do LZ algorithms significantly outperform Huffman-only compression?
- QWhat data compresses best with LZ algorithms?
Frequently Asked Questions
Why does PNG use DEFLATE instead of JPEG compression?
PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.