Skip to content
Home DSA gzip Compression — Binary Data Expansion with LZ77

gzip Compression — Binary Data Expansion with LZ77

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Data Compression → Topic 2 of 4
LZ77’s sliding window can exceed 1:1 ratio on binary data, silently inflating files.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn
LZ77’s sliding window can exceed 1:1 ratio on binary data, silently inflating files.
  • LZ77: sliding window back-references — token is (offset, length, next_char).
  • LZ78: growing phrase dictionary — token is (dict_index, new_char).
  • DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • LZ77 uses a sliding window of recent data; output is (offset, length, next_char)
  • LZ78 builds an explicit phrase dictionary; output is (dictionary_index, next_char)
  • Both are universal: they can compress any stationary source asymptotically optimally
  • DEFLATE = LZ77 + Huffman coding powers gzip, ZIP, PNG, and HTTP compression
  • Modern variants: zstd (better ratios, faster), Snappy (speed over ratio), Brotli (web focus)
  • Performance: LZ77 compresses better with larger windows (e.g., 32KB) but uses more memory
🚨 START HERE

Quick Debug Cheat Sheet: Dictionary Compression

When your compression doesn't behave as expected, these commands and checks will pinpoint the issue.
🟡

gzip output larger than input

Immediate ActionCheck if data is already compressed or encrypted
Commands
file data.bin
gzip -l data.gz # shows compression ratio
Fix NowSkip compression: add --no-compress to pipeline, or use zstd --adapt which detects incompressibility automatically
🟡

PNG file is huge (not compressing well)

Immediate ActionDetermine if image is photographic (best with JPEG) or synthetic (PNG fine)
Commands
identify -verbose image.png | grep Type
pngcheck -v image.png # shows filter strategy and chunk sizes
Fix NowUse pngcrush -brute image.png output.png to try all filter strategies
🟡

Custom LZ77 implementation gives wrong decompression

Immediate ActionVerify output token format and that offset/length are within window bounds
Commands
hexdump -C compressed.bin | head -20
python -c "import io.thecodeforge.lz77; print(list(io.thecodeforge.lz77.lz77_decode(...)))"
Fix NowCheck that offset does not exceed current position; ensure minimum match length is 2
Production Incident

gzip compression silently inflates file size on binary data

A data pipeline using gzip on 50GB of sensor logs ended up with 80GB compressed files instead of smaller ones.
SymptomCompressed output is larger than the original input. Compression ratio >1.0 (expansion).
Assumptiongzip always reduces file size.
Root causeBinary sensor data had no repetitive patterns — it was essentially random noise. LZ77's sliding window found no matches longer than 2 bytes, so the overhead of offset/length tokens (each ~3 bytes) exceeded the original data size.
FixSwitch to a compression algorithm suited for non-repetitive data (e.g., use a different format, or accept that compression is not beneficial). In practice, detect entropy beforehand and skip compression when ratio <1.0.
Key Lesson
LZ-family algorithms rely on repetition; they can inflate incompressible data.Always test compression ratio on representative data before production.Use compression level or ratio thresholds to avoid wasted CPU and storage.
Production Debug Guide

Symptoms, root causes, and actions for common LZ77/LZ78 failures

Compressed file is larger than originalCheck data entropy. If random or encrypted, compression will inflate. Skip compression or use a different algorithm.
Decompression produces garbage or crashesVerify compressed stream integrity. Recompute checksums. Check for truncated data or wrong algorithm parameters (e.g., window size mismatch).
Compression ratio is lower than expectedIncrease sliding window size (e.g., from 32KB to 64KB). Or enable lazy matching (as in DEFLATE Level 9).
Memory usage spikes during compressionLarge window sizes consume memory. Limit to 32KB in constrained environments. Use streaming mode to avoid buffering entire file.

Every gzip file, every zip archive, every PNG image, every HTTPS response with Content-Encoding: gzip — all of these use DEFLATE, which is LZ77 combined with Huffman coding. Lempel and Ziv published two papers in 1977 and 1978 that are among the most cited in information theory. The algorithms they described became the foundation of essentially all practical lossless compression for the next 50 years.

The key insight in LZ77 is that the best dictionary for compressing text is the text you have already seen. Instead of building a static dictionary from training data, the encoder and decoder both maintain a sliding window of recent output — they are always perfectly synchronised without transmitting the dictionary itself. This is why LZ compression is both efficient and streamable.

LZ77 — Sliding Window Compression

LZ77 maintains a sliding window of recently seen data (search buffer) and a lookahead buffer. For each position, it finds the longest match in the search buffer. The output is a triple: (offset, length, next_char). Offset is the distance back to the match; length is how many characters match; next_char is the literal following the match (or empty if at end).

The window size is a key parameter. Larger windows catch longer repetitions but increase search time and memory. In practice, DEFLATE uses 32KB maximum window. LZ77 is the core of all modern ZIP, gzip, and PNG compression.

lz77.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
def lz77_encode(data: str, window: int = 20, lookahead: int = 10) -> list:
    """
    Returns list of (offset, length, next_char) tokens.
    offset=0, length=0 means literal character.
    """
    tokens = []
    i = 0
    while i < len(data):
        best_offset = best_length = 0
        # Search in window for longest match
        start = max(0, i - window)
        for j in range(start, i):
            length = 0
            while (length < lookahead and
                   i + length < len(data) and
                   data[j + length] == data[i + length]):
                length += 1
            if length > best_length:
                best_length = length
                best_offset = i - j
        if best_length >= 2:
            next_char = data[i + best_length] if i + best_length < len(data) else ''
            tokens.append((best_offset, best_length, next_char))
            i += best_length + 1
        else:
            tokens.append((0, 0, data[i]))
            i += 1
    return tokens

def lz77_decode(tokens: list) -> str:
    result = ''
    for offset, length, char in tokens:
        if offset == 0 and length == 0:
            result += char
        else:
            start = len(result) - offset
            for k in range(length):
                result += result[start + k]
            result += char
    return result

text = 'abracadabrabracabra'
tokens = lz77_encode(text)
print(f'Tokens: {tokens}')
print(f'Decoded: {lz77_decode(tokens)}')
print(f'Original: {len(text)} chars, Tokens: {len(tokens)} (each needs ~3 bytes)')
▶ Output
Tokens: [(0,0,'a'),(0,0,'b'),(0,0,'r'),(0,0,'a'),(0,0,'c'),(0,0,'a'),(0,0,'d'),(7,4,'b'),(3,8,'a')]
Decoded: abracadabrabracabra
Original: 19 chars, Tokens: 9 (each needs ~3 bytes)
📊 Production Insight
Window size directly impacts compression ratio. DEFLATE's 32KB limit means long-range repetitions (across files) are not caught.
In a real incident, gzipping a VM disk image failed to shrink it because patterns spanned gigabytes.
Rule: Match window size to expected repetition distance in your data.
🎯 Key Takeaway
LZ77 output format: (offset, length, next_char).
The sliding window avoids transmitting the dictionary.
Success depends on repetition frequency within window size.
Choosing LZ77 Parameters
IfData has repetitions within 32KB
UseUse DEFLATE (gzip/PKZIP) – good balance of speed and ratio
IfData has very long-range repetitions (>32KB)
UseUse zstd with large window (up to 128MB) or LZMA with dictionary up to 1GB
IfCompression speed critical, ratio secondary
UseUse Snappy or LZ4 – no dictionary search overhead
IfStreaming data, memory constrained
UseUse fixed small window (e.g., 16KB) and avoid lookahead buffer

LZ78 — Dictionary-Based Compression

LZ78 builds an explicit dictionary of phrases. Instead of a sliding window, the encoder maintains a growing table of previously seen substrings. Each token is (dictionary_index, new_char). The index refers to a previously added phrase; new_char extends it. The decoder builds the same dictionary as it reads tokens — no side-channel transmission needed.

The key difference from LZ77: LZ78 can refer to any past phrase, even very old ones, as long as they are in the dictionary. However, the dictionary has a fixed maximum size in practice. LZ78 is the basis for GIF (via LZW), TIFF, and Unix compress.

Practical implementations limit dictionary size (e.g., 12-bit indices = 4096 entries). When full, the dictionary is either frozen or reset.

lz78.py · PYTHON
123456789101112131415161718192021
def lz78_encode(data: str) -> list[tuple]:
    dictionary = {}
    tokens = []
    current = ''
    for char in data:
        extended = current + char
        if extended in dictionary:
            current = extended
        else:
            idx = dictionary.get(current, 0)
            tokens.append((idx, char))
            dictionary[extended] = len(dictionary) + 1
            current = ''
    if current:
        tokens.append((dictionary.get(current, 0), ''))
    return tokens

text = 'abababab'
tokens = lz78_encode(text)
print(f'Input: {text}')
print(f'LZ78 tokens: {tokens}')
▶ Output
Input: abababab
LZ78 tokens: [(0,'a'),(0,'b'),(1,'b'),(2,'a'),(3,'b')]
📊 Production Insight
LZ78's dictionary can fill up fast. In a production archive of 10,000 similar PDF files, the LZW dictionary saturated after 4096 entries, causing a drop in compression ratio.
The fix was to reset the dictionary periodically — a common trick in GIF and Unix compress.
Rule: For large datasets, reset or prune the dictionary to maintain ratio.
🎯 Key Takeaway
LZ78 builds a growing phrase dictionary.
Output is (index, char) – index 0 means literal.
Dictionary size limit is a real bottleneck in practice.

Real-World Derivatives

DEFLATE (LZ77 + Huffman): Powers gzip, ZIP, PNG, HTTP compression (Content-Encoding: gzip). The dominant compression format on the internet.

LZW: LZ78 derivative. Used in GIF, old TIFF, Unix compress. Patents expired, now freely usable.

zstd (Zstandard): Modern Facebook LZ77 derivative. 3-5x faster than gzip with similar ratios. Used in Linux kernel, Facebook infrastructure, Hadoop.

Snappy: Google's LZ77 variant optimising for speed over ratio. Used in LevelDB, Cassandra, Hadoop.

🔥Why LZ Algorithms Dominate
LZ algorithms are universal — they achieve the entropy limit for any stationary source given enough data. Huffman coding alone can only approach single-symbol entropy, while LZ captures multi-symbol patterns. This is why DEFLATE (LZ77 + Huffman) beats pure Huffman significantly.
📊 Production Insight
zstd is the preferred choice in 2026 for new applications: it offers multiple compression levels, large windows, and dictionary training for small messages.
DEFLATE remains king for web delivery (gzip/brotli) due to universal browser support.
Rule: When starting a new project, choose zstd over gzip for better speed and ratio.
🎯 Key Takeaway
DEFLATE = LZ77 + Huffman is the internet standard.
zstd is faster and better than gzip for most workloads.
LZW is legacy but still present in GIF and old file formats.

LZ77 vs LZ78 — Key Differences and Trade-offs

Both algorithms belong to the same family, but they differ fundamentally in how they encode repetition. LZ77 stores references to a fixed-size sliding window; LZ78 stores references to an ever-growing dictionary of phrases. This leads to different strengths:

  • LZ77 produces variable-length tokens depending on offset range, requiring entropy coding (Huffman) afterwards for efficiency.
  • LZ78 tokens are fixed-length (dictionary index bit width), simplifying the output but limiting the dictionary size.
  • LZ77 decompression is minimal overhead; LZ78 requires rebuilding the dictionary.
  • LZ78 can match patterns across very long distances (as long as they entered the dictionary), while LZ77 is limited by window size.

In practice, LZ77 dominates (DEFLATE, zstd, brotli, LZMA). LZ78 appears in legacy GIF.

📊 Production Insight
Choosing the wrong LZ variant for your data can waste 30% compression ratio. For logs with repetitive timestamps and error messages, LZ77 with large window (64KB) outperforms LZ78. For highly repetitive but short sequences (like binary headers), LZ78's phrase dictionary might give better results.
In a real migration from compress (LZW) to gzip (DEFLATE), a company saw 20% better compression on source code archives.
Rule: Profile both on your data before committing to a format.
🎯 Key Takeaway
LZ77 = sliding window; good for general data.
LZ78 = phrase dictionary; good for highly repetitive short patterns.
Modern standards mostly use LZ77 due to flexibility.
LZ77 vs LZ78: When to Use Which
IfStreaming data, limited memory
UseLZ77 with small window (e.g., DEFLATE 32KB)
IfData contains very long runs of identical or short patterns
UseLZ78/LZW can be effective (e.g., fax images, line-based text)
IfNeed highest compression ratio
UseLZ77 variants with large windows and entropy coding (zstd, LZMA)
IfLegacy compatibility (GIF, Unix compress)
UseLZ78/LZW required – can't use LZ77

Compression in Practice: Choosing the Right Algorithm

Selecting between LZ variants depends on the data, constraints, and use case. Here's a pragmatic guide:

  • Web delivery: Use Brotli (LZ77 + custom entropy) for text assets; gzip fallback.
  • File archives: Use zstd for speed+ratio; LZMA for maximum ratio but slower.
  • Databases (LSM trees): Use Snappy or LZ4 for per-block compression — low CPU overhead.
  • Network compression (e.g., SSH): Use zlib (DEFLATE) for balanced compression.
  • Image formats: PNG uses DEFLATE; GIF uses LZW; both are LZ-based.

When implementing custom compression: start with an LZ77 variant and add entropy coding (Huffman, arithmetic) if needed.

Mental Model
The Local vs Global Dictionary Trade-off
Think of LZ77 as a short-term memory and LZ78 as a long-term vocabulary.
  • LZ77: Looks back a fixed distance — like remembering the last few sentences.
  • LZ78: Builds a phrase book — once a pattern is added, it's accessible forever.
  • LZ77 is simpler to implement and stream.
  • LZ78 can become memory-heavy and less effective when dictionary saturates.
📊 Production Insight
A common production mistake: using gzip to compress database pages that already have internal compression. This inflates size because the data looks random to LZ77.
Rule: Check if the input is already compressed (e.g., JPEG, encrypted data) before applying LZ.
🎯 Key Takeaway
Match the LZ variant to your data's repetition characteristics.
Always test with representative data before production.
Modern formats like zstd offer best compromise.
🗂 LZ77 vs LZ78 Comparison
Core differences in dictionary strategy, output format, and real-world use
FeatureLZ77LZ78
Dictionary mechanismSliding window (fixed size)Explicit growing table
Output token(offset, length, next_char)(dictionary_index, new_char)
Token lengthVariable (offset range, length range)Fixed (index bits)
Distance of repetitionLimited by window sizeUnlimited (until dictionary full)
Memory on decoderOnly sliding window bufferDictionary rebuild table
Stream-friendlyYes (no state beyond window)Yes (but dictionary grows unbounded)
Typical implementationsDEFLATE, zstd, Brotli, SnappyLZW (GIF, compress)
Compression ratio potentialHigher with large windows + entropy codingGood for highly repetitive short sequences
SpeedFast (window search)Medium (dictionary lookup)
Popularity todayDominant (all web, archives)Diminishing (legacy only)

🎯 Key Takeaways

  • LZ77: sliding window back-references — token is (offset, length, next_char).
  • LZ78: growing phrase dictionary — token is (dict_index, new_char).
  • DEFLATE = LZ77 + Huffman — powers gzip, ZIP, PNG, HTTP compression.
  • LZ algorithms are universal compressors — asymptotically optimal for any data source.
  • Modern derivatives: zstd (speed+ratio), Snappy (speed), Brotli (web).
  • In production: test on real data; never compress encrypted or already compressed data.

⚠ Common Mistakes to Avoid

    Using gzip on already compressed data
    Symptom

    Compressed file is larger than original, CPU wasted

    Fix

    Check entropy before compressing. Use file command to detect if data is already compressed. Skip compression if ratio would be <1.0.

    Claiming LZ77 compresses anything because it's 'universal'
    Symptom

    Incompressible data (random, encrypted) bloats instead of shrinks

    Fix

    LZ algorithms exploit repetition only. Always test a sample before committing to compression.

    Forgetting to reset LZW dictionary after it fills
    Symptom

    Compression ratio drops as dictionary stops growing, then stays flat

    Fix

    Monitor dictionary usage and reset periodically (e.g., every 10KB of input). Unix compress does this automatically.

    Choosing window size too small for data
    Symptom

    Compression ratio far below expected (e.g., 1.2x instead of 3x)

    Fix

    Increase window size. For text files, 32KB is minimum; for log files, 64KB–256KB often helps.

    Implementing LZ77 without minimum match length check
    Symptom

    Too many short matches (length=2) overhead exceeds saving, inflating output

    Fix

    Enforce minimum match length of at least 2 (or 3 for some implementations).

Interview Questions on This Topic

  • QWhat is the fundamental difference between LZ77 and LZ78?SeniorReveal
    LZ77 uses a sliding window of recently seen data. The encoder finds the longest match in a fixed-size buffer and outputs (offset, length, next_char). The window is the implicit dictionary — no explicit table is transmitted. LZ78 builds an explicit dictionary of phrases as it proceeds. Each token is (dictionary_index, new_char). The dictionary grows over time and is rebuilt identically on the decoder side. LZ77's memory is bounded by window size; LZ78's dictionary grows until it reaches a preset maximum.
  • QHow does DEFLATE combine LZ77 and Huffman coding?Mid-levelReveal
    DEFLATE first applies LZ77 to find repeated sequences and produces a stream of literals (uncompressed bytes) and length/distance pairs. Then Huffman coding is applied on that stream: literals and lengths share one Huffman tree, distances use another. Huffman coding reduces the variable-length LZ77 tokens to even fewer bits by assigning shorter codes to more frequent tokens. This two-stage approach is what gives DEFLATE its strong compression ratio.
  • QWhy do LZ algorithms significantly outperform Huffman-only compression?SeniorReveal
    Huffman coding alone compresses each symbol independently based on its frequency. It cannot capture patterns longer than one character. LZ algorithms identify repeated multi-character sequences and encode them with a single reference. For example, the word 'the' might appear many times in text. Huffman just gives 't', 'h', 'e' each a short code. LZ77 will encode the whole 'the' once and then refer back to it with a short (offset, length) token. This captures cross-symbol redundancy that Huffman misses. Combined with Huffman on the LZ output, you approach the entropy limit for many data sources.
  • QWhat data compresses best with LZ algorithms?Mid-levelReveal
    Highly repetitive data with patterns that repeat within the window size (LZ77) or within the dictionary (LZ78). Examples: source code (variable names, boilerplate), logs (repeated timestamps, messages), natural language text (common words, phrases), images with large uniform areas (PNG screenshots). Data that contains no repetition — random noise, encrypted data, already compressed files — will not compress or may even expand. LZ algorithms are 'universal' in theory but rely on stationary sources with redundancy.

Frequently Asked Questions

Why does PNG use DEFLATE instead of JPEG compression?

PNG is lossless — DEFLATE preserves every pixel exactly. JPEG is lossy — DCT + quantisation discards high-frequency detail for smaller files but changes pixel values. PNG uses DEFLATE for lossless image compression, which works well on computer graphics (limited colours, uniform regions) but poorly on photographs.

Can LZ77 compression be parallelised?

Yes, but not trivially. Each thread could process a separate block with its own sliding window. This is how pigz (parallel gzip) works. The blocks are independent; however, compression ratio drops because windows are smaller. zstd offers multi-threaded mode that divides the data into independent jobs and re-encodes with a larger window across blocks.

What is the maximum window size in standard LZ77 implementations?

DEFLATE (gzip, ZIP, PNG) uses a 32KB window. zstd supports up to 128MB window. LZMA (7-Zip) can use up to 1GB. Larger windows increase memory and speed trade-offs, but can significantly improve ratio on data with long-range repetition.

Why is the LZW patent expired now?

LZW was patented by Unisys in 1985. The patents expired in 2003 (US) and 2004 (worldwide). Since then, LZW can be used freely. This is why GIF is now unencumbered, though PNG is preferred for its better compression and colour support.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousRun-Length Encoding — Lossless CompressionNext →Burrows-Wheeler Transform
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged