Home DSA Arithmetic Coding — Beyond Huffman Compression

Arithmetic Coding — Beyond Huffman Compression

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Data Compression → Topic 4 of 4
Learn arithmetic coding — how it achieves near-entropy compression by encoding entire messages as single fractions, surpassing Huffman coding for skewed probability distributions.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn:
  • Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
  • Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
  • Huffman wastes bits when symbol probabilities aren't powers of 1/2.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
Huffman coding assigns whole bits to each symbol — you can't assign 1.5 bits to a symbol that deserves it. Arithmetic coding encodes the entire message as a single number between 0 and 1, achieving compression arbitrarily close to the theoretical entropy limit — even for skewed distributions where one symbol has 99% probability.

Huffman coding assigns whole bits to symbols. But what if a symbol has probability 0.99 — it deserves 0.015 bits, not 1? This mismatch between symbol probability and assigned bit length is the redundancy that arithmetic coding eliminates. For a message consisting of 99% one symbol, arithmetic coding compresses it ~65x better than Huffman.

Arithmetic coding is now in every modern video codec. H.264 uses CABAC (Context-Adaptive Binary Arithmetic Coding) — it is why H.264 achieves better quality than MPEG-2 at the same bitrate. H.265/HEVC and AV1 also use arithmetic coding at their core. Every Netflix stream, YouTube video, and Zoom call is decoded using arithmetic coding. The patent issues that delayed adoption for decades have expired, and it is now the standard.

The Core Idea

Divide [0,1) into sub-intervals proportional to symbol probabilities. For each input symbol, zoom into the sub-interval of the current symbol. The final interval uniquely identifies the message — transmit any number in that interval.

arithmetic_coding.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940
from fractions import Fraction

def arithmetic_encode(message: str, probs: dict) -> Fraction:
    """Encode message as a fraction in [0,1)."""
    # Build cumulative probability table
    symbols = sorted(probs)
    cum = {}
    total = Fraction(0)
    for s in symbols:
        cum[s] = total
        total += Fraction(probs[s]).limit_denominator(1000)
    low, high = Fraction(0), Fraction(1)
    for symbol in message:
        range_size = high - low
        high = low + range_size * (cum[symbol] + Fraction(probs[symbol]).limit_denominator(1000))
        low  = low + range_size * cum[symbol]
    return (low + high) / 2  # midpoint of final interval

def arithmetic_decode(code: Fraction, probs: dict, length: int) -> str:
    symbols = sorted(probs)
    cum = {}
    total = Fraction(0)
    for s in symbols:
        cum[s] = total
        total += Fraction(probs[s]).limit_denominator(1000)
    message = ''
    for _ in range(length):
        for s in reversed(symbols):
            if code >= cum[s]:
                message += s
                p = Fraction(probs[s]).limit_denominator(1000)
                code = (code - cum[s]) / p
                break
    return message

probs = {'a': 0.5, 'b': 0.25, 'c': 0.25}
msg = 'aab'
code = arithmetic_encode(msg, probs)
print(f'Encoded {msg!r} → {float(code):.6f}')
print(f'Decoded → {arithmetic_decode(code, probs, len(msg))!r}')
▶ Output
Encoded 'aab' → 0.203125
Decoded → 'aab'

Why Arithmetic Beats Huffman

For symbol a with probability 0.99 and b with 0.01: - Huffman must assign at least 1 bit to each symbol: a→0, b→1 - For a 100-symbol message that is 99 a's, Huffman uses 100 bits - Entropy: H = -0.99log2(0.99) - 0.01log2(0.01) ≈ 0.081 bits/symbol - Arithmetic coding: 100 × 0.081 ≈ 8 bits for the same message - Arithmetic is ~12x better here

Huffman is only optimal when all probabilities are exact negative powers of 2 (0.5, 0.25, 0.125...).

🔥
Practical ImplementationReal arithmetic coders use integer arithmetic with renormalisation (scaling up intervals as they narrow below a threshold) to avoid infinite precision. This is called 'carry-less' or 'range coding'. ANS (Asymmetric Numeral Systems) by Duda (2009) achieves similar compression with faster throughput — used in Zstandard and modern codecs.

Applications

JPEG2000: Replaces JPEG's Huffman with arithmetic coding for better quality at same bitrate.

HEVC/H.265: CABAC (Context-Adaptive Binary Arithmetic Coding) — every modern video stream uses arithmetic coding.

LZMA (xz/7-zip): Combines LZ77 with range coding (arithmetic coding variant) for best-in-class ratios.

PPM (Prediction by Partial Matching): Combines context modelling with arithmetic coding — near-theoretical compression on text.

🎯 Key Takeaways

  • Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
  • Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
  • Huffman wastes bits when symbol probabilities aren't powers of 1/2.
  • Used in HEVC video (CABAC), JPEG2000, LZMA/xz, and modern compressors.
  • ANS (Asymmetric Numeral Systems) is the modern practical variant — faster than traditional arithmetic coding.

Interview Questions on This Topic

  • QWhy does arithmetic coding outperform Huffman coding for skewed distributions?
  • QWhat is the entropy of a source with P(a)=0.99, P(b)=0.01? How many bits does Huffman use per symbol?
  • QExplain the renormalisation step in practical arithmetic coders.
  • QWhat is the connection between arithmetic coding and ANS?

Frequently Asked Questions

Why isn't arithmetic coding used everywhere if it's better than Huffman?

Patent issues historically slowed adoption (IBM held key patents until the 2000s). Today, range coding (a patent-free variant) and ANS have replaced traditional arithmetic coding in new implementations. Huffman remains in DEFLATE/gzip for compatibility and speed.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousBurrows-Wheeler Transform
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged