gzip Compression — Binary Data Expansion with LZ77
LZ77’s sliding window can exceed 1:1 ratio on binary data, silently inflating files.
- LZ77 uses a sliding window of recent data; output is (offset, length, next_char)
- LZ78 builds an explicit phrase dictionary; output is (dictionary_index, next_char)
- Both are universal: they can compress any stationary source asymptotically optimally
- DEFLATE = LZ77 + Huffman coding powers gzip, ZIP, PNG, and HTTP compression
- Modern variants: zstd (better ratios, faster), Snappy (speed over ratio), Brotli (web focus)
- Performance: LZ77 compresses better with larger windows (e.g., 32KB) but uses more memory
Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.
The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.
LZ77 — Sliding Window Compression
LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, it finds the longest match in the search buffer. The output is a triple: (offset, length, next_char). Offset is the distance back to the match; length is how many characters match; next_char is the literal following the match (or empty if at end).
The window size is a key parameter. Larger windows catch longer repetitions but increase search time and memory. In practice, DEFLATE uses 32KB maximum window. LZ77 is the core of all modern ZIP, gzip, and PNG compression.
LZ78 — Dictionary-Based Compression
LZ78 builds an explicit dictionary of phrases. Instead of a sliding window, the encoder maintains a growing table of previously seen substrings. Each token is (dictionary_index, new_char). The index refers to a previously added phrase; new_char extends it. The decoder builds the same dictionary as it reads tokens — no side-channel transmission needed.
The key difference from LZ77: LZ78 can refer to any past phrase, even very old ones, as long as they are in the dictionary. However, the dictionary has a fixed maximum size in practice. LZ78 is the basis for GIF (via LZW), TIFF, and Unix compress.
Practical implementations limit dictionary size (e.g., 12-bit indices = 4096 entries). When full, the dictionary is either frozen or reset.
Real-World Derivatives
DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.
LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.
zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.
Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.
LZ77 vs LZ78 — Key Differences and Trade-offs
Both algorithms belong to the same family, but they differ fundamentally in how they encode repetition. LZ77 stores references to a fixed-size sliding window; LZ78 stores references to an ever-growing dictionary of phrases. This leads to different strengths:
- LZ77 produces variable-length tokens depending on offset range, requiring entropy coding (Huffman) afterwards for efficiency.
- LZ78 tokens are fixed-length (dictionary index bit width), simplifying the output but limiting the dictionary size.
- LZ77 decompression is minimal overhead; LZ78 requires rebuilding the dictionary.
- LZ78 can match patterns across very long distances (as long as they entered the dictionary), while LZ77 is limited by window size.
In practice, LZ77 dominates (DEFLATE, zstd, brotli, LZMA). LZ78 appears in legacy GIF.
Compression in Practice: Choosing the Right Algorithm
Selecting between LZ variants depends on the data, constraints, and use case. Here's a pragmatic guide:
- Web delivery: Use Brotli (LZ77 + custom entropy) for text assets; gzip fallback.
- File archives: Use zstd for speed+ratio; LZMA for maximum ratio but slower.
- Databases (LSM trees): Use Snappy or LZ4 for per-block compression — low CPU overhead.
- Network compression (e.g., SSH): Use zlib (DEFLATE) for balanced compression.
- Image formats: PNG uses DEFLATE; GIF uses LZW; both are LZ-based.
When implementing custom compression: start with an LZ77 variant and add entropy coding (Huffman, arithmetic) if needed.
| Feature | LZ77 | LZ78 |
|---|---|---|
| Dictionary mechanism | Sliding window (fixed size) | Explicit growing table |
| Output token | (offset, length, next_char) | (dictionary_index, new_char) |
| Token length | Variable (offset range, length range) | Fixed (index bits) |
| Distance of repetition | Limited by window size | Unlimited (until dictionary full) |
| Memory on decoder | Only sliding window buffer | Dictionary rebuild table |
| Stream-friendly | Yes (no state beyond window) | Yes (but dictionary grows unbounded) |
| Typical implementations | DEFLATE, zstd, Brotli, Snappy | LZW (GIF, compress) |
| Compression ratio potential | Higher with large windows + entropy coding | Good for highly repetitive short sequences |
| Speed | Fast (window search) | Medium (dictionary lookup) |
| Popularity today | Dominant (all web, archives) | Diminishing (legacy only) |
Key Takeaways
- LZ77: sliding window back-references — token is (offset, length, next_char).
- LZ78: growing phrase dictionary — token is (dict_index, new_char).
- DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
- LZ algorithms are universal compressors — asymptotically optimal for any data source.
- Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).
- In production: test on real data; never compress encrypted or already compressed data.
Common Mistakes to Avoid
- Using gzip on already compressed data
Symptom: Compressed file is larger than original, CPU wasted
Fix: Check entropy before compressing. Usefilecommand to detect if data is already compressed. Skip compression if ratio would be <1.0. - Claiming LZ77 compresses anything because it's 'universal'
Symptom: Incompressible data (random, encrypted) bloats instead of shrinks
Fix: LZ algorithms exploit repetition only. Always test a sample before committing to compression. - Forgetting to reset LZW dictionary after it fills
Symptom: Compression ratio drops as dictionary stops growing, then stays flat
Fix: Monitor dictionary usage and reset periodically (e.g., every 10KB of input). Unix compress does this automatically. - Choosing window size too small for data
Symptom: Compression ratio far below expected (e.g., 1.2x instead of 3x)
Fix: Increase window size. For text files, 32KB is minimum; for log files, 64KB–256KB often helps. - Implementing LZ77 without minimum match length check
Symptom: Too many short matches (length=2) overhead exceeds saving, inflating output
Fix: Enforce minimum match length of at least 2 (or 3 for some implementations).
Interview Questions on This Topic
- QWhat is the fundamental difference between LZ77 and LZ78?SeniorReveal
- QHow does DEFLATE combine LZ77 and Huffman coding?Mid-levelReveal
- QWhy do LZ algorithms significantly outperform Huffman-only compression?SeniorReveal
- QWhat data compresses best with LZ algorithms?Mid-levelReveal
Frequently Asked Questions
Why does PNG use DEFLATE instead of JPEG compression?
PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.
Can LZ77 compression be parallelised?
Yes, but not trivially. Each thread could process a separate block with its own sliding window. This is how pigz (parallel gzip) works. The blocks are independent; however, compression ratio drops because windows are smaller. zstd offers multi-threaded mode that divides the data into independent jobs and re-encodes with a larger window across blocks.
What is the maximum window size in standard LZ77 implementations?
DEFLATE (gzip, ZIP, PNG) uses a 32KB window. zstd supports up to 128MB window. LZMA (7-Zip) can use up to 1GB. Larger windows increase memory and speed trade-offs, but can significantly improve ratio on data with long-range repetition.
Why is the LZW patent expired now?
LZW was patented by Unisys in 1985. The patents expired in 2003 (US) and 2004 (worldwide). Since then, LZW can be used freely. This is why GIF is now unencumbered, though PNG is preferred for its better compression and colour support.
That's Data Compression. Mark it forged?
3 min read · try the examples if you haven't