gzip Compression — Binary Data Expansion with LZ77
LZ77’s sliding window can exceed 1:1 ratio on binary data, silently inflating files.
20+ years shipping performance-critical code where algorithms decide the bill. Lessons pulled from things that broke in production.
- LZ77 uses a sliding window of recent data; output is (offset, length, next_char)
- LZ78 builds an explicit phrase dictionary; output is (dictionary_index, next_char)
- Both are universal: they can compress any stationary source asymptotically optimally
- DEFLATE = LZ77 + Huffman coding powers gzip, ZIP, PNG, and HTTP compression
- Modern variants: zstd (better ratios, faster), Snappy (speed over ratio), Brotli (web focus)
- Performance: LZ77 compresses better with larger windows (e.g., 32KB) but uses more memory
Instead of compressing character by character, LZ algorithms say 'this chunk of text appeared 50 characters ago — just refer back to it instead of repeating it.' LZ77 uses a sliding window dictionary (recent data). LZ78 builds an explicit dictionary as it goes. Both exploit repetition — the more repetitive the data, the better the compression.
Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.
The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.
How LZ77 Compression Actually Works — Sliding Windows and Back-References
LZ77 and LZ78 are dictionary-based compression algorithms that replace repeated sequences with back-references. LZ77 uses a sliding window over already-seen data: it searches for the longest match between the current input and the window, then emits a (length, distance) pair instead of the raw bytes. LZ78 builds an explicit dictionary of phrases as it goes, emitting (index, next character) pairs. Both achieve compression by exploiting redundancy, but LZ77's window size directly limits how far back it can look — typically 32KB in gzip.
In practice, LZ77 is the core of gzip (RFC 1951). It operates in O(n) time with a hash-chain or binary-tree lookup for matches, but worst-case O(n²) if poorly implemented. The key property: compression ratio improves with larger windows and more redundancy, but memory usage scales with window size. LZ78's dictionary grows unbounded unless pruned, making it less common in modern formats.
Use LZ77 when you need a fast, streaming compressor with bounded memory — HTTP compression, file archives, and network protocols. It's not ideal for small payloads (under a few hundred bytes) where the overhead of back-references can actually expand data. Understanding this trade-off is critical: gzip can inflate tiny JSON responses by 20-30% due to header overhead and failed matches.
LZ77 — Sliding Window Compression
LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, it finds the longest match in the search buffer. The output is a triple: (offset, length, next_char). Offset is the distance back to the match; length is how many characters match; next_char is the literal following the match (or empty if at end).
The window size is a key parameter. Larger windows catch longer repetitions but increase search time and memory. In practice, DEFLATE uses 32KB maximum window. LZ77 is the core of all modern ZIP, gzip, and PNG compression.
LZ78 — Dictionary-Based Compression
LZ78 builds an explicit dictionary of phrases. Instead of a sliding window, the encoder maintains a growing table of previously seen substrings. Each token is (dictionary_index, new_char). The index refers to a previously added phrase; new_char extends it. The decoder builds the same dictionary as it reads tokens — no side-channel transmission needed.
The key difference from LZ77: LZ78 can refer to any past phrase, even very old ones, as long as they are in the dictionary. However, the dictionary has a fixed maximum size in practice. LZ78 is the basis for GIF (via LZW), TIFF, and Unix compress.
Practical implementations limit dictionary size (e.g., 12-bit indices = 4096 entries). When full, the dictionary is either frozen or reset.
Real-World Derivatives
DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.
LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.
zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.
Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.
LZ77 vs LZ78 — Key Differences and Trade-offs
Both algorithms belong to the same family, but they differ fundamentally in how they encode repetition. LZ77 stores references to a fixed-size sliding window; LZ78 stores references to an ever-growing dictionary of phrases. This leads to different strengths:
- LZ77 produces variable-length tokens depending on offset range, requiring entropy coding (Huffman) afterwards for efficiency.
- LZ78 tokens are fixed-length (dictionary index bit width), simplifying the output but limiting the dictionary size.
- LZ77 decompression is minimal overhead; LZ78 requires rebuilding the dictionary.
- LZ78 can match patterns across very long distances (as long as they entered the dictionary), while LZ77 is limited by window size.
In practice, LZ77 dominates (DEFLATE, zstd, brotli, LZMA). LZ78 appears in legacy GIF.
Compression in Practice: Choosing the Right Algorithm
Selecting between LZ variants depends on the data, constraints, and use case. Here's a pragmatic guide:
- Web delivery: Use Brotli (LZ77 + custom entropy) for text assets; gzip fallback.
- File archives: Use zstd for speed+ratio; LZMA for maximum ratio but slower.
- Databases (LSM trees): Use Snappy or LZ4 for per-block compression — low CPU overhead.
- Network compression (e.g., SSH): Use zlib (DEFLATE) for balanced compression.
- Image formats: PNG uses DEFLATE; GIF uses LZW; both are LZ-based.
When implementing custom compression: start with an LZ77 variant and add entropy coding (Huffman, arithmetic) if needed.
- LZ77: Looks back a fixed distance — like remembering the last few sentences.
- LZ78: Builds a phrase book — once a pattern is added, it's accessible forever.
- LZ77 is simpler to implement and stream.
- LZ78 can become memory-heavy and less effective when dictionary saturates.
The Hidden Cost: Why LZ77 and LZ78 Can Bite You in Production
Everyone loves compression ratios until the CPU spins at 100% for 30 seconds on a 10MB file. The sliding window in LZ77 and the growing dictionary in LZ78 have real costs that junior engineers ignore until pager duty calls.
LZ77's sliding window means you're doing string comparisons against every position in the buffer for every input byte. That's O(n * m) in the worst case, where m is your window size. Dial the window too large — say 32KB for a 1MB file — and you'll watch your throughput crater. Production systems like nginx and Apache tune this constantly.
LZ78's dictionary has its own pain point: memory. Every new substring adds an entry. On repetitive data like logs or configuration files, that dictionary balloons. You'll hit the OOM killer before you hit your compression target. Real-world derivatives like GIF limit the dictionary to 4096 entries for exactly this reason — they know the cost.
The trade-off isn't just compression ratio vs speed. It's predictability. LZ77 gives you bounded memory and unbounded CPU. LZ78 gives you bounded CPU but unbounded memory. Pick your poison based on the production environment, not the textbook.
Why Your Compressor Needs a Minimum Match Length
The textbooks show LZ77 outputting (offset, length, nextChar) for every match. They never tell you that matching a 2-byte sequence costs more than just emitting the raw bytes. This is the hidden gotcha that separates a working prototype from a production compressor.
Every back-reference requires you to encode, store, and decode an offset and a length. On a modern CPU, that's not free. If you emit a reference for a 2-byte match, you've just spent 2-3 bytes of metadata to save 1 byte of data. Net loss. Real compressors like zlib and LZMA enforce a minimum match length — typically 2 or 3 bytes. Anything shorter gets emitted as a literal.
The impact is measurable. On random data, minimum match length of 3 means most output is literals. That's fine — random data doesn't compress anyway. On structured data like JSON or XML, the minimum match threshold filters out noise matches that would bloat your compressed stream. I've seen 15% compression ratio improvements just by tuning this value from 1 to 3.
Production tip: profile your data. If your compressor produces more back-references than literals on common input, your minimum match length is probably too low. Raise it. Watch your ratio improve. Every byte counts when you're shipping firmware over cellular.
gzip compression silently inflates file size on binary data
- LZ-family algorithms rely on repetition; they can inflate incompressible data.
- Always test compression ratio on representative data before production.
- Use compression level or ratio thresholds to avoid wasted CPU and storage.
file data.bingzip -l data.gz # shows compression ratioKey takeaways
Common mistakes to avoid
5 patternsUsing gzip on already compressed data
file command to detect if data is already compressed. Skip compression if ratio would be <1.0.Claiming LZ77 compresses anything because it's 'universal'
Forgetting to reset LZW dictionary after it fills
Choosing window size too small for data
Implementing LZ77 without minimum match length check
Practice These on LeetCode
Interview Questions on This Topic
What is the fundamental difference between LZ77 and LZ78?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Lessons pulled from things that broke in production.
That's Data Compression. Mark it forged?
6 min read · try the examples if you haven't