gzip Compression — Binary Data Expansion with LZ77
- LZ77: sliding window back-references — token is (offset, length, next_char).
- LZ78: growing phrase dictionary — token is (dict_index, new_char).
- DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
- LZ77 uses a sliding window of recent data; output is (offset, length, next_char)
- LZ78 builds an explicit phrase dictionary; output is (dictionary_index, next_char)
- Both are universal: they can compress any stationary source asymptotically optimally
- DEFLATE = LZ77 + Huffman coding powers gzip, ZIP, PNG, and HTTP compression
- Modern variants: zstd (better ratios, faster), Snappy (speed over ratio), Brotli (web focus)
- Performance: LZ77 compresses better with larger windows (e.g., 32KB) but uses more memory
Quick Debug Cheat Sheet: Dictionary Compression
gzip output larger than input
file data.bingzip -l data.gz # shows compression ratioPNG file is huge (not compressing well)
identify -verbose image.png | grep Typepngcheck -v image.png # shows filter strategy and chunk sizesCustom LZ77 implementation gives wrong decompression
hexdump -C compressed.bin | head -20python -c "import io.thecodeforge.lz77; print(list(io.thecodeforge.lz77.lz77_decode(...)))"Production Incident
Production Debug GuideSymptoms, root causes, and actions for common LZ77/LZ78 failures
Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.
The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.
LZ77 — Sliding Window Compression
LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, it finds the longest match in the search buffer. The output is a triple: (offset, length, next_char). Offset is the distance back to the match; length is how many characters match; next_char is the literal following the match (or empty if at end).
The window size is a key parameter. Larger windows catch longer repetitions but increase search time and memory. In practice, DEFLATE uses 32KB maximum window. LZ77 is the core of all modern ZIP, gzip, and PNG compression.
def lz77_encode(data: str, window: int = 20, lookahead: int = 10) -> list: """ Returns list of (offset, length, next_char) tokens. offset=0, length=0 means literal character. """ tokens = [] i = 0 while i < len(data): best_offset = best_length = 0 # Search in window for longest match start = max(0, i - window) for j in range(start, i): length = 0 while (length < lookahead and i + length < len(data) and data[j + length] == data[i + length]): length += 1 if length > best_length: best_length = length best_offset = i - j if best_length >= 2: next_char = data[i + best_length] if i + best_length < len(data) else '' tokens.append((best_offset, best_length, next_char)) i += best_length + 1 else: tokens.append((0, 0, data[i])) i += 1 return tokens def lz77_decode(tokens: list) -> str: result = '' for offset, length, char in tokens: if offset == 0 and length == 0: result += char else: start = len(result) - offset for k in range(length): result += result[start + k] result += char return result text = 'abracadabrabracabra' tokens = lz77_encode(text) print(f'Tokens: {tokens}') print(f'Decoded: {lz77_decode(tokens)}') print(f'Original: {len(text)} chars, Tokens: {len(tokens)} (each needs ~3 bytes)')
Decoded: abracadabrabracabra
Original: 19 chars, Tokens: 9 (each needs ~3 bytes)
LZ78 — Dictionary-Based Compression
LZ78 builds an explicit dictionary of phrases. Instead of a sliding window, the encoder maintains a growing table of previously seen substrings. Each token is (dictionary_index, new_char). The index refers to a previously added phrase; new_char extends it. The decoder builds the same dictionary as it reads tokens — no side-channel transmission needed.
The key difference from LZ77: LZ78 can refer to any past phrase, even very old ones, as long as they are in the dictionary. However, the dictionary has a fixed maximum size in practice. LZ78 is the basis for GIF (via LZW), TIFF, and Unix compress.
Practical implementations limit dictionary size (e.g., 12-bit indices = 4096 entries). When full, the dictionary is either frozen or reset.
def lz78_encode(data: str) -> list[tuple]: dictionary = {} tokens = [] current = '' for char in data: extended = current + char if extended in dictionary: current = extended else: idx = dictionary.get(current, 0) tokens.append((idx, char)) dictionary[extended] = len(dictionary) + 1 current = '' if current: tokens.append((dictionary.get(current, 0), '')) return tokens text = 'abababab' tokens = lz78_encode(text) print(f'Input: {text}') print(f'LZ78 tokens: {tokens}')
LZ78 tokens: [(0,'a'),(0,'b'),(1,'b'),(2,'a'),(3,'b')]
Real-World Derivatives
DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.
LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.
zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.
Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.
LZ77 vs LZ78 — Key Differences and Trade-offs
Both algorithms belong to the same family, but they differ fundamentally in how they encode repetition. LZ77 stores references to a fixed-size sliding window; LZ78 stores references to an ever-growing dictionary of phrases. This leads to different strengths:
- LZ77 produces variable-length tokens depending on offset range, requiring entropy coding (Huffman) afterwards for efficiency.
- LZ78 tokens are fixed-length (dictionary index bit width), simplifying the output but limiting the dictionary size.
- LZ77 decompression is minimal overhead; LZ78 requires rebuilding the dictionary.
- LZ78 can match patterns across very long distances (as long as they entered the dictionary), while LZ77 is limited by window size.
In practice, LZ77 dominates (DEFLATE, zstd, brotli, LZMA). LZ78 appears in legacy GIF.
Compression in Practice: Choosing the Right Algorithm
Selecting between LZ variants depends on the data, constraints, and use case. Here's a pragmatic guide:
- Web delivery: Use Brotli (LZ77 + custom entropy) for text assets; gzip fallback.
- File archives: Use zstd for speed+ratio; LZMA for maximum ratio but slower.
- Databases (LSM trees): Use Snappy or LZ4 for per-block compression — low CPU overhead.
- Network compression (e.g., SSH): Use zlib (DEFLATE) for balanced compression.
- Image formats: PNG uses DEFLATE; GIF uses LZW; both are LZ-based.
When implementing custom compression: start with an LZ77 variant and add entropy coding (Huffman, arithmetic) if needed.
- LZ77: Looks back a fixed distance — like remembering the last few sentences.
- LZ78: Builds a phrase book — once a pattern is added, it's accessible forever.
- LZ77 is simpler to implement and stream.
- LZ78 can become memory-heavy and less effective when dictionary saturates.
| Feature | LZ77 | LZ78 |
|---|---|---|
| Dictionary mechanism | Sliding window (fixed size) | Explicit growing table |
| Output token | (offset, length, next_char) | (dictionary_index, new_char) |
| Token length | Variable (offset range, length range) | Fixed (index bits) |
| Distance of repetition | Limited by window size | Unlimited (until dictionary full) |
| Memory on decoder | Only sliding window buffer | Dictionary rebuild table |
| Stream-friendly | Yes (no state beyond window) | Yes (but dictionary grows unbounded) |
| Typical implementations | DEFLATE, zstd, Brotli, Snappy | LZW (GIF, compress) |
| Compression ratio potential | Higher with large windows + entropy coding | Good for highly repetitive short sequences |
| Speed | Fast (window search) | Medium (dictionary lookup) |
| Popularity today | Dominant (all web, archives) | Diminishing (legacy only) |
🎯 Key Takeaways
- LZ77: sliding window back-references — token is (offset, length, next_char).
- LZ78: growing phrase dictionary — token is (dict_index, new_char).
- DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
- LZ algorithms are universal compressors — asymptotically optimal for any data source.
- Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).
- In production: test on real data; never compress encrypted or already compressed data.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the fundamental difference between LZ77 and LZ78?SeniorReveal
- QHow does DEFLATE combine LZ77 and Huffman coding?Mid-levelReveal
- QWhy do LZ algorithms significantly outperform Huffman-only compression?SeniorReveal
- QWhat data compresses best with LZ algorithms?Mid-levelReveal
Frequently Asked Questions
Why does PNG use DEFLATE instead of JPEG compression?
PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.
Can LZ77 compression be parallelised?
Yes, but not trivially. Each thread could process a separate block with its own sliding window. This is how pigz (parallel gzip) works. The blocks are independent; however, compression ratio drops because windows are smaller. zstd offers multi-threaded mode that divides the data into independent jobs and re-encodes with a larger window across blocks.
What is the maximum window size in standard LZ77 implementations?
DEFLATE (gzip, ZIP, PNG) uses a 32KB window. zstd supports up to 128MB window. LZMA (7-Zip) can use up to 1GB. Larger windows increase memory and speed trade-offs, but can significantly improve ratio on data with long-range repetition.
Why is the LZW patent expired now?
LZW was patented by Unisys in 1985. The patents expired in 2003 (US) and 2004 (worldwide). Since then, LZW can be used freely. This is why GIF is now unencumbered, though PNG is preferred for its better compression and colour support.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.