Arithmetic Coding — Why Frame 500 Decodes Wrong
- Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
- Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
- Huffman wastes bits when symbol probabilities aren't powers of 1/2.
- Arithmetic coding encodes entire message as a single fraction in [0,1) — no per-symbol bit boundaries
- How it works: recursively subdivide interval by symbol probabilities; final interval uniquely identifies message
- Performance: For P(a)=0.99, P(b)=0.01, Huffman uses 1 bit/symbol, entropy is 0.081 bits/symbol — arithmetic uses ~8 bits per 100 symbols, Huffman uses 100 bits (12x better)
- Production reality: Used in H.264/H.265 (CABAC), JPEG2000, LZMA (range coding variant)
- Biggest mistake: Implementing with floating point — precision fails on long messages. Real coders use integer math with renormalization.
Arithmetic Coding Debug Cheat Sheet
Decoder diverges after N symbols
python -c "import struct; import sys; print('float precision:', sys.float_info.dig)"echo 'For integer math, check range min: if range < 256: shift'Compression ratio lower than expected
echo -n "message" | entropy -bwc -c original.bin compressed.binOutput differs between platforms (x86 vs ARM)
grep -r 'float\|double' src/grep -r '\/ 2\^' src/; grep 'LIMIT' src/Carry overflow — low + range > 2^32
grep -n 'low.*range' src/rangecoder.cecho 'carry propagation usually requires a buffer'Renormalization loops too many times (slow)
grep -n 'while (range < ' src/perf stat -e cycles ./encoder largefileProduction Incident
Production Debug GuideSymptom → Action mapping for compression implementation failures
range >> 24 check — when range < 2^24, shift left until >= 2^24 and output bits.low + range wraps around 2^32, you need to carry tracking. Implement range coder with carry propagation or use 64-bit intermediate.Huffman coding assigns whole bits to symbols. But what if a symbol has probability 0.99 — it deserves 0.015 bits, not 1? This mismatch between symbol probability and assigned bit length is the redundancy that arithmetic coding eliminates. For a message consisting of 99% one symbol, arithmetic coding compresses it ~65x better than Huffman.
Arithmetic coding is now in every modern video codec. H.264 uses CABAC (Context-Adaptive Binary Arithmetic Coding) — it is why H.264 achieves better quality than MPEG-2 at the same bitrate. H.265/HEVC and AV1 also use arithmetic coding at their core. Every Netflix stream, YouTube video, and Zoom call is decoded using arithmetic coding. The patent issues that delayed adoption for decades have expired, and it is now the standard.
By the end you'll understand why Huffman fails for skewed distributions, how arithmetic coding recursively divides the number line, the renormalization trick that makes integer implementation possible, and how CABAC uses context to adapt probabilities on the fly.
The Core Idea
Arithmetic coding treats the message as a single number in the interval [0,1). The encoder recursively divides this interval according to symbol probabilities. After processing the entire message, any number in the final interval uniquely identifies it. The decoder works backwards: given the code, it finds which symbol's interval contains it, updates the interval, and repeats.
For a message "aab" with probabilities P(a)=0.5, P(b)=0.25, P(c)=0.25: - Start with [0,1). For 'a' (0 to 0.5), new interval [0, 0.5). - For second 'a' (0 to 0.5), subdivide [0,0.5) into [0,0.25), [0.25,0.375), [0.375,0.5). Second 'a' yields [0,0.25). - For 'b' in last symbol: within [0,0.25), 'b' covers [0.125,0.1875). - Final code = 0.15625 (midpoint). The decoder reverse-engineers this.
from fractions import Fraction def arithmetic_encode(message: str, probs: dict) -> Fraction: """Encode message as a fraction in [0,1).""" # Build cumulative probability table symbols = sorted(probs) cum = {} total = Fraction(0) for s in symbols: cum[s] = total total += Fraction(probs[s]).limit_denominator(1000) low, high = Fraction(0), Fraction(1) for symbol in message: range_size = high - low high = low + range_size * (cum[symbol] + Fraction(probs[symbol]).limit_denominator(1000)) low = low + range_size * cum[symbol] return (low + high) / 2 # midpoint of final interval def arithmetic_decode(code: Fraction, probs: dict, length: int) -> str: symbols = sorted(probs) cum = {} total = Fraction(0) for s in symbols: cum[s] = total total += Fraction(probs[s]).limit_denominator(1000) message = '' for _ in range(length): for s in reversed(symbols): if code >= cum[s]: message += s p = Fraction(probs[s]).limit_denominator(1000) code = (code - cum[s]) / p break return message probs = {'a': 0.5, 'b': 0.25, 'c': 0.25} msg = 'aab' code = arithmetic_encode(msg, probs) print(f'Encoded {msg!r} → {float(code):.6f}') print(f'Decoded → {arithmetic_decode(code, probs, len(msg))!r}')
- Start with the unit interval [0, 1). Each symbol has a fraction of this line proportional to its probability.
- For each symbol in the message, narrow the current interval to that symbol's sub-interval, keeping the same proportion.
- After the last symbol, pick any number in the final interval — conventionally the midpoint.
- Decoding: given the code number, find which symbol's interval contains it, then 'zoom out' by that symbol's probability to prepare for the next symbol.
- The longer the message, the more precise the number needed. This is why renormalisation (scaling up) is required for finite precision.
Why Arithmetic Beats Huffman
- Huffman must assign at least 1 bit to each symbol: a→0, b→1
- For a 100-symbol message that is 99 a's, Huffman uses 100 bits
- Entropy: H = -0.99log2(0.99)
- 0.01log2(0.01) ≈ 0.081 bits/symbol
- Arithmetic coding: 100 × 0.081 ≈ 8 bits for the same message
- Arithmetic is ~12x better here
Huffman is only optimal when all probabilities are exact negative powers of 2 (0.5, 0.25, 0.125...). For any other distribution, Huffman's per-symbol integer-bit constraint wastes space. The redundancy is at most 1 bit per symbol, but that adds up. For a source with 30 symbols each with probability ~1/30, Huffman wastes about 1 bit per symbol. For highly skewed sources (one symbol dominant), the waste is even higher proportionally. Arithmetic coding's fractional-bit efficiency makes it ideal for contexts where probabilities are not powers of 2 — which is almost every real-world source.
import math import heapq from fractions import Fraction # Huffman simulation (code length calculation) def huffman_expected_length(probs): heap = [(p, []) for p in probs if p > 0] heapq.heapify(heap) while len(heap) > 1: p1, c1 = heapq.heappop(heap) p2, c2 = heapq.heappop(heap) for code in c1: code.append('0') for code in c2: code.append('1') heapq.heappush(heap, (p1 + p2, c1 + c2)) codes = heap[0][1] code_lengths = [len(code) for code in codes] return sum(p * l for p, l in zip(probs, code_lengths)) # Entropy calculation def entropy(probs): return -sum(p * math.log2(p) for p in probs if p > 0) probabilities = [0.99, 0.01] huff_len = huffman_expected_length(probabilities) ent = entropy(probabilities) print(f"Huffman expected length: {huff_len:.3f} bits/symbol") print(f"Entropy: {ent:.3f} bits/symbol") print(f"Redundancy: {huff_len - ent:.3f} bits/symbol") print(f"For 1000 symbols: Huffman {1000 * huff_len} bits, optimal {1000*ent:.0f} bits") print(f"Arithmetic coding (theoretically) can get within 0.001 bits/symbol of entropy") # Multiple symbols scenario probs_multi = [0.4, 0.3, 0.2, 0.1] huff_multi = huffman_expected_length(probs_multi) ent_multi = entropy(probs_multi) print(f"\n4 symbols (0.4,0.3,0.2,0.1): Huffman {huff_multi:.2f}, Entropy {ent_multi:.2f}") print(f"Waste: {huff_multi - ent_multi:.2f} bits/symbol, {1000 * (huff_multi - ent_multi):.0f} bits per KB")
Applications
JPEG2000: Replaces JPEG's Huffman with arithmetic coding for better quality at same bitrate. JPEG2000 uses EBCOT (Embedded Block Coding with Optimized Truncation) which works with arithmetic coding to achieve progressive decoding.
HEVC/H.265: CABAC (Context-Adaptive Binary Arithmetic Coding) — every modern video stream uses arithmetic coding. CABAC uses 399 different context models (for syntax elements like motion vectors, coefficients) with adaptive probability updates per symbol. This is why H.264/HEVC achieve 50% bitrate reduction over MPEG-2 for the same quality.
LZMA (xz/7-zip): Combines LZ77 with range coding (arithmetic coding variant) for best-in-class ratios. LZMA's range coder is byte-oriented and uses 4-byte low/high range and renormalisation.
PPM (Prediction by Partial Matching): Combines context modelling with arithmetic coding — near-theoretical compression on text. PPM uses context length (order-n) to predict next symbol probability, feeding that to arithmetic coding.
Modern codecs: AV1 (next-gen video) uses arithmetic coding with symbol-to-symbol probability adaptation. Daala (Xiph.org) used range coding. Zstandard (Facebook) uses ANS for better speed.
// Simplified CABAC-style renormalization in integer arithmetic // From H.264 standard: range is in [2^8, 2^9) after renormalization #include <stdint.h> // Encoder state (as in CABAC) typedef struct { uint32_t low; // current interval low point (scaled integer) uint32_t range; // current interval range (>= 256 << 16) uint8_t pending; // number of pending bits (for carry) } CabacEncoder; // Renormalisation: called when range < 256 << 16 void cabac_renorm(CabacEncoder *enc, void (*output)(uint8_t byte)) { while (enc->range < (256 << 16)) { if (enc->low < (1 << 24)) { // No carry to propagate output(enc->low >> 24); // Output highest byte for (int i = 0; i < enc->pending; i++) output(0xFF); // Emit pending 0xFFs enc->pending = 0; } else if (enc->low >= (3 << 24)) { // 3<<24 = 50331648 output(((enc->low >> 24) + 1) & 0xFF); // Propagate carry for (int i = 0; i < enc->pending; i++) output(0x00); // Emit 0x00 for pending enc->pending = 0; } else { // Pending carry case: low in [1<<24, 3<<24) enc->pending++; enc->low -= (1 << 24); } enc->low <<= 8; enc->range <<= 8; } } // Probability update (LPS = less probable symbol, MPS = more probable symbol) typedef struct { uint16_t probLPS; // Probability of LPS (0..32767, scaled to 2^15) uint8_t mps; // More probable symbol value (0 or 1) } CabacContext; // Update probability after encoding a symbol void cabac_update(CabacContext *ctx, uint8_t encoded_symbol) { if (encoded_symbol == ctx->mps) { // Increase MPS probability: probLPS *= (1 - alpha) where alpha ~ 0.05 ctx->probLPS -= ((ctx->probLPS) >> 5); // multiply by 31/32 } else { // LPS occurred: increase probLPS, swap MPS if necessary ctx->probLPS += ((32768 - ctx->probLPS) >> 5); // increase toward 0.5 if (ctx->probLPS > 16384) { // > 0.5 ctx->mps ^= 1; // swap MPS/LPS ctx->probLPS = 32768 - ctx->probLPS; } } }
| Aspect | Huffman Coding | Arithmetic Coding | ANS (Asymmetric Numeral Systems) |
|---|---|---|---|
| Symbol encoding | Variable-length codes, whole bits per symbol | Fractional bits per symbol — whole message as single number | Table-based state machine, fractional bits per symbol |
| Compression ratio near entropy | Within 1 bit of entropy per symbol | Within 0.001 bit per symbol, asymptotically perfect | Within 0.01 bit per symbol, near-entropy |
| Speed (software decode) | Very fast — table lookup per symbol | Slow — renormalisation, branching, bit-level loops | Fast — table-driven, batch-friendly |
| Memory usage | Low — codebook plus frequency table | Low — only low/range register plus probabilities | Medium — lookup tables for decoding |
| Probability adaptation | Hard — need to rebuild codebook periodically | Easy — update probabilities per symbol | Possible but less common than arithmetic |
| Best-in-class usage | DEFLATE (gzip, PNG), JPEG, MP3 (Huffman part) | H.264/HEVC CABAC, JPEG2000, LZMA range coding variant | Zstandard, Facebook's compression, Apple's LZFSE |
| Patent status | Expired (all), free to use | Expired (core patents), free to use now | Open, unencumbered (Duda, 2009) |
| Ideal for | Small alphabets, speed-critical, simple probability distributions | Skewed distributions, adaptive coding, video/audio | High-speed general compression with near-entropy ratio |
🎯 Key Takeaways
- Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
- Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
- Huffman wastes bits when symbol probabilities aren't powers of 1/2.
- Used in HEVC video (CABAC), JPEG2000, LZMA/xz, and modern compressors.
- ANS (Asymmetric Numeral Systems) is the modern practical variant — faster than traditional arithmetic coding.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy does arithmetic coding outperform Huffman coding for skewed distributions?Mid-levelReveal
- QWhat is the renormalisation step in practical arithmetic coders, and why is it necessary?SeniorReveal
- QExplain the connection between arithmetic coding and ANS (Asymmetric Numeral Systems).SeniorReveal
- QWhat is CABAC in video coding, and how does it use arithmetic coding with context adaptation?SeniorReveal
Frequently Asked Questions
Why isn't arithmetic coding used everywhere if it's better than Huffman?
Patent issues historically slowed adoption (IBM held key patents until the 2000s). Today, range coding (a patent-free variant) and ANS have replaced traditional arithmetic coding in new implementations. Huffman remains in DEFLATE/gzip for compatibility and speed.
Arn't there faster alternatives to arithmetic coding?
Yes, ANS (Asymmetric Numeral Systems) is 2-4x faster in software while maintaining similar compression ratio. Facebook's Zstandard uses ANS; Apple's LZFSE also uses a variant. For hardware-accelerated video decoding, CABAC (H.264/HEVC) is implemented in silicon, so speed is not an issue. For general-purpose software compression, new codecs (Zstandard, LZ4 with entropy) prefer ANS over traditional arithmetic coding for speed.
How does adaptive arithmetic coding work in practice?
Start with a frequency table initialised with uniform distribution or prior knowledge. After encoding a symbol, increment that symbol's count. Recompute cumulative frequencies after each symbol or after N symbols. In binary arithmetic coding (CABAC style), maintain a probability estimate (e.g., 12-bit integer for probLPS, range 0..4095). Update probLPS using a moving average: probLPS = probLPS - (probLPS >> N) for MPS occurrence, and probLPS = probLPS + ((full_range - probLPS) >> N) for LPS occurrence. This converges to actual symbol statistics without storing large tables.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.