Arithmetic Coding — Beyond Huffman Compression
- Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
- Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
- Huffman wastes bits when symbol probabilities aren't powers of 1/2.
Huffman coding assigns whole bits to symbols. But what if a symbol has probability 0.99 — it deserves 0.015 bits, not 1? This mismatch between symbol probability and assigned bit length is the redundancy that arithmetic coding eliminates. For a message consisting of 99% one symbol, arithmetic coding compresses it ~65x better than Huffman.
Arithmetic coding is now in every modern video codec. H.264 uses CABAC (Context-Adaptive Binary Arithmetic Coding) — it is why H.264 achieves better quality than MPEG-2 at the same bitrate. H.265/HEVC and AV1 also use arithmetic coding at their core. Every Netflix stream, YouTube video, and Zoom call is decoded using arithmetic coding. The patent issues that delayed adoption for decades have expired, and it is now the standard.
The Core Idea
Divide [0,1) into sub-intervals proportional to symbol probabilities. For each input symbol, zoom into the sub-interval of the current symbol. The final interval uniquely identifies the message — transmit any number in that interval.
from fractions import Fraction def arithmetic_encode(message: str, probs: dict) -> Fraction: """Encode message as a fraction in [0,1).""" # Build cumulative probability table symbols = sorted(probs) cum = {} total = Fraction(0) for s in symbols: cum[s] = total total += Fraction(probs[s]).limit_denominator(1000) low, high = Fraction(0), Fraction(1) for symbol in message: range_size = high - low high = low + range_size * (cum[symbol] + Fraction(probs[symbol]).limit_denominator(1000)) low = low + range_size * cum[symbol] return (low + high) / 2 # midpoint of final interval def arithmetic_decode(code: Fraction, probs: dict, length: int) -> str: symbols = sorted(probs) cum = {} total = Fraction(0) for s in symbols: cum[s] = total total += Fraction(probs[s]).limit_denominator(1000) message = '' for _ in range(length): for s in reversed(symbols): if code >= cum[s]: message += s p = Fraction(probs[s]).limit_denominator(1000) code = (code - cum[s]) / p break return message probs = {'a': 0.5, 'b': 0.25, 'c': 0.25} msg = 'aab' code = arithmetic_encode(msg, probs) print(f'Encoded {msg!r} → {float(code):.6f}') print(f'Decoded → {arithmetic_decode(code, probs, len(msg))!r}')
Decoded → 'aab'
Why Arithmetic Beats Huffman
For symbol a with probability 0.99 and b with 0.01: - Huffman must assign at least 1 bit to each symbol: a→0, b→1 - For a 100-symbol message that is 99 a's, Huffman uses 100 bits - Entropy: H = -0.99log2(0.99) - 0.01log2(0.01) ≈ 0.081 bits/symbol - Arithmetic coding: 100 × 0.081 ≈ 8 bits for the same message - Arithmetic is ~12x better here
Huffman is only optimal when all probabilities are exact negative powers of 2 (0.5, 0.25, 0.125...).
Applications
JPEG2000: Replaces JPEG's Huffman with arithmetic coding for better quality at same bitrate.
HEVC/H.265: CABAC (Context-Adaptive Binary Arithmetic Coding) — every modern video stream uses arithmetic coding.
LZMA (xz/7-zip): Combines LZ77 with range coding (arithmetic coding variant) for best-in-class ratios.
PPM (Prediction by Partial Matching): Combines context modelling with arithmetic coding — near-theoretical compression on text.
🎯 Key Takeaways
- Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
- Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
- Huffman wastes bits when symbol probabilities aren't powers of 1/2.
- Used in HEVC video (CABAC), JPEG2000, LZMA/xz, and modern compressors.
- ANS (Asymmetric Numeral Systems) is the modern practical variant — faster than traditional arithmetic coding.
Interview Questions on This Topic
- QWhy does arithmetic coding outperform Huffman coding for skewed distributions?
- QWhat is the entropy of a source with P(a)=0.99, P(b)=0.01? How many bits does Huffman use per symbol?
- QExplain the renormalisation step in practical arithmetic coders.
- QWhat is the connection between arithmetic coding and ANS?
Frequently Asked Questions
Why isn't arithmetic coding used everywhere if it's better than Huffman?
Patent issues historically slowed adoption (IBM held key patents until the 2000s). Today, range coding (a patent-free variant) and ANS have replaced traditional arithmetic coding in new implementations. Huffman remains in DEFLATE/gzip for compatibility and speed.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.