Intermediate 3 min · March 24, 2026

gzip Compression — Binary Data Expansion with LZ77

LZ77’s sliding window can exceed 1:1 ratio on binary data, silently inflating files.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • LZ77 uses a sliding window of recent data; output is (offset, length, next_char)
  • LZ78 builds an explicit phrase dictionary; output is (dictionary_index, next_char)
  • Both are universal: they can compress any stationary source asymptotically optimally
  • DEFLATE = LZ77 + Huffman coding powers gzip, ZIP, PNG, and HTTP compression
  • Modern variants: zstd (better ratios, faster), Snappy (speed over ratio), Brotli (web focus)
  • Performance: LZ77 compresses better with larger windows (e.g., 32KB) but uses more memory

Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.

The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.

LZ77 — Sliding Window Compression

LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, it finds the longest match in the search buffer. The output is a triple: (offset, length, next_char). Offset is the distance back to the match; length is how many characters match; next_char is the literal following the match (or empty if at end).

The window size is a key parameter. Larger windows catch longer repetitions but increase search time and memory. In practice, DEFLATE uses 32KB maximum window. LZ77 is the core of all modern ZIP, gzip, and PNG compression.

LZ78 — Dictionary-Based Compression

LZ78 builds an explicit dictionary of phrases. Instead of a sliding window, the encoder maintains a growing table of previously seen substrings. Each token is (dictionary_index, new_char). The index refers to a previously added phrase; new_char extends it. The decoder builds the same dictionary as it reads tokens — no side-channel transmission needed.

The key difference from LZ77: LZ78 can refer to any past phrase, even very old ones, as long as they are in the dictionary. However, the dictionary has a fixed maximum size in practice. LZ78 is the basis for GIF (via LZW), TIFF, and Unix compress.

Practical implementations limit dictionary size (e.g., 12-bit indices = 4096 entries). When full, the dictionary is either frozen or reset.

Real-World Derivatives

DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.

LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.

zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.

Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.

LZ77 vs LZ78 — Key Differences and Trade-offs

Both algorithms belong to the same family, but they differ fundamentally in how they encode repetition. LZ77 stores references to a fixed-size sliding window; LZ78 stores references to an ever-growing dictionary of phrases. This leads to different strengths:

  • LZ77 produces variable-length tokens depending on offset range, requiring entropy coding (Huffman) afterwards for efficiency.
  • LZ78 tokens are fixed-length (dictionary index bit width), simplifying the output but limiting the dictionary size.
  • LZ77 decompression is minimal overhead; LZ78 requires rebuilding the dictionary.
  • LZ78 can match patterns across very long distances (as long as they entered the dictionary), while LZ77 is limited by window size.

In practice, LZ77 dominates (DEFLATE, zstd, brotli, LZMA). LZ78 appears in legacy GIF.

Compression in Practice: Choosing the Right Algorithm

Selecting between LZ variants depends on the data, constraints, and use case. Here's a pragmatic guide:

  • Web delivery: Use Brotli (LZ77 + custom entropy) for text assets; gzip fallback.
  • File archives: Use zstd for speed+ratio; LZMA for maximum ratio but slower.
  • Databases (LSM trees): Use Snappy or LZ4 for per-block compression — low CPU overhead.
  • Network compression (e.g., SSH): Use zlib (DEFLATE) for balanced compression.
  • Image formats: PNG uses DEFLATE; GIF uses LZW; both are LZ-based.

When implementing custom compression: start with an LZ77 variant and add entropy coding (Huffman, arithmetic) if needed.

LZ77 vs LZ78 Comparison
FeatureLZ77LZ78
Dictionary mechanismSliding window (fixed size)Explicit growing table
Output token(offset, length, next_char)(dictionary_index, new_char)
Token lengthVariable (offset range, length range)Fixed (index bits)
Distance of repetitionLimited by window sizeUnlimited (until dictionary full)
Memory on decoderOnly sliding window bufferDictionary rebuild table
Stream-friendlyYes (no state beyond window)Yes (but dictionary grows unbounded)
Typical implementationsDEFLATE, zstd, Brotli, SnappyLZW (GIF, compress)
Compression ratio potentialHigher with large windows + entropy codingGood for highly repetitive short sequences
SpeedFast (window search)Medium (dictionary lookup)
Popularity todayDominant (all web, archives)Diminishing (legacy only)

Key Takeaways

  • LZ77: sliding window back-references — token is (offset, length, next_char).
  • LZ78: growing phrase dictionary — token is (dict_index, new_char).
  • DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
  • LZ algorithms are universal compressors — asymptotically optimal for any data source.
  • Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).
  • In production: test on real data; never compress encrypted or already compressed data.

Common Mistakes to Avoid

  • Using gzip on already compressed data
    Symptom: Compressed file is larger than original, CPU wasted
    Fix: Check entropy before compressing. Use file command to detect if data is already compressed. Skip compression if ratio would be <1.0.
  • Claiming LZ77 compresses anything because it's 'universal'
    Symptom: Incompressible data (random, encrypted) bloats instead of shrinks
    Fix: LZ algorithms exploit repetition only. Always test a sample before committing to compression.
  • Forgetting to reset LZW dictionary after it fills
    Symptom: Compression ratio drops as dictionary stops growing, then stays flat
    Fix: Monitor dictionary usage and reset periodically (e.g., every 10KB of input). Unix compress does this automatically.
  • Choosing window size too small for data
    Symptom: Compression ratio far below expected (e.g., 1.2x instead of 3x)
    Fix: Increase window size. For text files, 32KB is minimum; for log files, 64KB–256KB often helps.
  • Implementing LZ77 without minimum match length check
    Symptom: Too many short matches (length=2) overhead exceeds saving, inflating output
    Fix: Enforce minimum match length of at least 2 (or 3 for some implementations).

Interview Questions on This Topic

  • QWhat is the fundamental difference between LZ77 and LZ78?SeniorReveal
    LZ77 uses a sliding window of recently seen data. The encoder finds the longest match in a fixed-size buffer and outputs (offset, length, next_char). The window is the implicit dictionary — no explicit table is transmitted. LZ78 builds an explicit dictionary of phrases as it proceeds. Each token is (dictionary_index, new_char). The dictionary grows over time and is rebuilt identically on the decoder side. LZ77's memory is bounded by window size; LZ78's dictionary grows until it reaches a preset maximum.
  • QHow does DEFLATE combine LZ77 and Huffman coding?Mid-levelReveal
    DEFLATE first applies LZ77 to find repeated sequences and produces a stream of literals (uncompressed bytes) and length/distance pairs. Then Huffman coding is applied on that stream: literals and lengths share one Huffman tree, distances use another. Huffman coding reduces the variable-length LZ77 tokens to even fewer bits by assigning shorter codes to more frequent tokens. This two-stage approach is what gives DEFLATE its strong compression ratio.
  • QWhy do LZ algorithms significantly outperform Huffman-only compression?SeniorReveal
    Huffman coding alone compresses each symbol independently based on its frequency. It cannot capture patterns longer than one character. LZ algorithms identify repeated multi-character sequences and encode them with a single reference. For example, the word 'the' might appear many times in text. Huffman just gives 't', 'h', 'e' each a short code. LZ77 will encode the whole 'the' once and then refer back to it with a short (offset, length) token. This captures cross-symbol redundancy that Huffman misses. Combined with Huffman on the LZ output, you approach the entropy limit for many data sources.
  • QWhat data compresses best with LZ algorithms?Mid-levelReveal
    Highly repetitive data with patterns that repeat within the window size (LZ77) or within the dictionary (LZ78). Examples: source code (variable names, boilerplate), logs (repeated timestamps, messages), natural language text (common words, phrases), images with large uniform areas (PNG screenshots). Data that contains no repetition — random noise, encrypted data, already compressed files — will not compress or may even expand. LZ algorithms are 'universal' in theory but rely on stationary sources with redundancy.

Frequently Asked Questions

Why does PNG use DEFLATE instead of JPEG compression?

PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.

Can LZ77 compression be parallelised?

Yes, but not trivially. Each thread could process a separate block with its own sliding window. This is how pigz (parallel gzip) works. The blocks are independent; however, compression ratio drops because windows are smaller. zstd offers multi-threaded mode that divides the data into independent jobs and re-encodes with a larger window across blocks.

What is the maximum window size in standard LZ77 implementations?

DEFLATE (gzip, ZIP, PNG) uses a 32KB window. zstd supports up to 128MB window. LZMA (7-Zip) can use up to 1GB. Larger windows increase memory and speed trade-offs, but can significantly improve ratio on data with long-range repetition.

Why is the LZW patent expired now?

LZW was patented by Unisys in 1985. The patents expired in 2003 (US) and 2004 (worldwide). Since then, LZW can be used freely. This is why GIF is now unencumbered, though PNG is preferred for its better compression and colour support.

🔥

That's Data Compression. Mark it forged?

3 min read · try the examples if you haven't

Previous
Run-Length Encoding — Lossless Compression
2 / 4 · Data Compression
Next
Burrows-Wheeler Transform