Skip to content
Home DSA Arithmetic Coding — Why Frame 500 Decodes Wrong

Arithmetic Coding — Why Frame 500 Decodes Wrong

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Data Compression → Topic 4 of 4
At 100 symbols, floating-point range hits 2^-100 and underflows.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn
At 100 symbols, floating-point range hits 2^-100 and underflows.
  • Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
  • Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
  • Huffman wastes bits when symbol probabilities aren't powers of 1/2.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Arithmetic coding encodes entire message as a single fraction in [0,1) — no per-symbol bit boundaries
  • How it works: recursively subdivide interval by symbol probabilities; final interval uniquely identifies message
  • Performance: For P(a)=0.99, P(b)=0.01, Huffman uses 1 bit/symbol, entropy is 0.081 bits/symbol — arithmetic uses ~8 bits per 100 symbols, Huffman uses 100 bits (12x better)
  • Production reality: Used in H.264/H.265 (CABAC), JPEG2000, LZMA (range coding variant)
  • Biggest mistake: Implementing with floating point — precision fails on long messages. Real coders use integer math with renormalization.
🚨 START HERE

Arithmetic Coding Debug Cheat Sheet

Fast diagnostics for compression implementation issues. Run these commands at the first sign of coding failure.
🟡

Decoder diverges after N symbols

Immediate ActionCheck precision — log interval range after each symbol
Commands
python -c "import struct; import sys; print('float precision:', sys.float_info.dig)"
echo 'For integer math, check range min: if range < 256: shift'
Fix NowSwitch to 32-bit integer range coder with renormalization. Use `range = (high - low) + 1` in integer domain. Output bits when range < 2^24.
🟡

Compression ratio lower than expected

Immediate ActionMeasure entropy of your source and compare to actual output size
Commands
echo -n "message" | entropy -b
wc -c original.bin compressed.bin
Fix NowIf compression > 10% above entropy, probability model is inaccurate. Use adaptive probabilities: update symbol counts after each coded symbol (or every N symbols for speed).
🟡

Output differs between platforms (x86 vs ARM)

Immediate ActionConvert all floating-point to fixed-point integers
Commands
grep -r 'float\|double' src/
grep -r '\/ 2\^' src/; grep 'LIMIT' src/
Fix NowReplace probability fractions (e.g., 0.75) with integers scaled to 2^15: `prob = 24576; total = 32768`. Compute `range >> 18` for approximate multiplication.
🟡

Carry overflow — low + range > 2^32

Immediate ActionDetect when low + range crosses maximum value
Commands
grep -n 'low.*range' src/rangecoder.c
echo 'carry propagation usually requires a buffer'
Fix NowUse `(low & 0xFF) == 0xFF` to detect pending carry. Implement a carry buffer: store output bytes and increment when carry occurs (like range coding in LZMA).
🟠

Renormalization loops too many times (slow)

Immediate ActionCheck threshold — renormalize when range < (1 << 24) not (1 << 8)
Commands
grep -n 'while (range < ' src/
perf stat -e cycles ./encoder largefile
Fix NowIncrease renormalization threshold from 2^8 to 2^24. Output 3 bytes at a time instead of 1 bit. Byte-oriented range coders are 4x faster than bit loops.
Production Incident

The Floating Point Precision Bug That Corrupted Video Frames

An H.264 encoder prototype used floating-point arithmetic coding. After 1000 frames, precision loss desynchronized encoder and decoder — random macroblocks appeared. The fix required 32-bit integer renormalization.
SymptomThe first 50 frames decoded perfectly. By frame 500, blocky artifacts appeared in random locations. By frame 1000, the decoder completely lost sync and output garbage. The same bitstream decoded differently on x86 vs ARM due to floating-point rounding differences.
AssumptionThe team assumed double-precision (53-bit mantissa) was sufficient for any message length. They didn't realise that each symbol multiplies the interval range, which narrows exponentially. After 1000 symbols with average range reduction factor 0.5, the interval width is 2^-1000 — far below double's precision.
Root causeFloating-point arithmetic coding uses low and high bounds as floating numbers. Each symbol multiplies the range by a probability factor. After enough symbols, the range becomes so small that adding a probability to low no longer changes the value — underflow. Double precision gives ~15 decimal digits (~50 bits). After 100 steps with 0.5 probability, range is 2^-100, which has no representable bits left. The encoder and decoder diverged because floating-point rounding is not deterministic across hardware or compilers.
FixReplaced floating-point with integer arithmetic using a 32-bit register scaled to the range [0, 2^31 - 1]. Added renormalization: when the range drops below 2^24, scale both low and range by 2 (left shift) and output a bit. This maintains precision indefinitely. CABAC in H.264 uses exactly this method. After the fix, the encoder produced deterministic output across all platforms and decoded correctly for 1 million+ frames.
Key Lesson
Never implement arithmetic coding with floating-point for production. Precision loss causes desynchronization after enough symbols.Use integer arithmetic with a fixed-width register (32-bit or 64-bit) and renormalization (scaling up when range gets too small).Renormalization is the key trick: output bits as you shift, not at the end. This keeps precision bounded.Test determinism across platforms. Floating-point rounding differs between x86, ARM, and compilers. Integer arithmetic is identical everywhere.
Production Debug Guide

Symptom → Action mapping for compression implementation failures

Decoder produces garbage after many symbols — earlier symbols decode correctlyPrecision loss. Check if code uses floating-point. Replace with integer arithmetic (32-bit uint) plus renormalization. Add range >> 24 check — when range < 2^24, shift left until >= 2^24 and output bits.
Different outputs on x86 vs ARM for same inputFloating-point non-determinism. Your encoder or decoder depends on floating-point rounding (likely probability updates or interval arithmetic). Replace all floats with fixed-point integers scaled to 2^15 or 2^20.
Compression ratio worse than Huffman for large filesProbability model may be inaccurate or adaptive probabilities not updating. Implement context modelling: use different probability tables for different contexts (already seen symbols, position, etc.). CABAC uses up to 399 context models.
Decoded file mismatches original — data lossRange coder may have carry overflow bug. In integer range coders, when low + range wraps around 2^32, you need to carry tracking. Implement range coder with carry propagation or use 64-bit intermediate.
Slow encoding/decoding (MB/s range, not GB/s)Bit-by-bit renormalisation is slow. Use byte-oriented range coding: output whole bytes when range drops below 2^24. This reduces branching. For high speed, use ANS (Asymmetric Numeral Systems) instead — up to 4x faster in software.

Huffman coding assigns whole bits to symbols. But what if a symbol has probability 0.99 — it deserves 0.015 bits, not 1? This mismatch between symbol probability and assigned bit length is the redundancy that arithmetic coding eliminates. For a message consisting of 99% one symbol, arithmetic coding compresses it ~65x better than Huffman.

Arithmetic coding is now in every modern video codec. H.264 uses CABAC (Context-Adaptive Binary Arithmetic Coding) — it is why H.264 achieves better quality than MPEG-2 at the same bitrate. H.265/HEVC and AV1 also use arithmetic coding at their core. Every Netflix stream, YouTube video, and Zoom call is decoded using arithmetic coding. The patent issues that delayed adoption for decades have expired, and it is now the standard.

By the end you'll understand why Huffman fails for skewed distributions, how arithmetic coding recursively divides the number line, the renormalization trick that makes integer implementation possible, and how CABAC uses context to adapt probabilities on the fly.

The Core Idea

Arithmetic coding treats the message as a single number in the interval [0,1). The encoder recursively divides this interval according to symbol probabilities. After processing the entire message, any number in the final interval uniquely identifies it. The decoder works backwards: given the code, it finds which symbol's interval contains it, updates the interval, and repeats.

For a message "aab" with probabilities P(a)=0.5, P(b)=0.25, P(c)=0.25: - Start with [0,1). For 'a' (0 to 0.5), new interval [0, 0.5). - For second 'a' (0 to 0.5), subdivide [0,0.5) into [0,0.25), [0.25,0.375), [0.375,0.5). Second 'a' yields [0,0.25). - For 'b' in last symbol: within [0,0.25), 'b' covers [0.125,0.1875). - Final code = 0.15625 (midpoint). The decoder reverse-engineers this.

io/thecodeforge/compression/arithmetic_coding.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940
from fractions import Fraction

def arithmetic_encode(message: str, probs: dict) -> Fraction:
    """Encode message as a fraction in [0,1)."""
    # Build cumulative probability table
    symbols = sorted(probs)
    cum = {}
    total = Fraction(0)
    for s in symbols:
        cum[s] = total
        total += Fraction(probs[s]).limit_denominator(1000)
    low, high = Fraction(0), Fraction(1)
    for symbol in message:
        range_size = high - low
        high = low + range_size * (cum[symbol] + Fraction(probs[symbol]).limit_denominator(1000))
        low  = low + range_size * cum[symbol]
    return (low + high) / 2  # midpoint of final interval

def arithmetic_decode(code: Fraction, probs: dict, length: int) -> str:
    symbols = sorted(probs)
    cum = {}
    total = Fraction(0)
    for s in symbols:
        cum[s] = total
        total += Fraction(probs[s]).limit_denominator(1000)
    message = ''
    for _ in range(length):
        for s in reversed(symbols):
            if code >= cum[s]:
                message += s
                p = Fraction(probs[s]).limit_denominator(1000)
                code = (code - cum[s]) / p
                break
    return message

probs = {'a': 0.5, 'b': 0.25, 'c': 0.25}
msg = 'aab'
code = arithmetic_encode(msg, probs)
print(f'Encoded {msg!r} → {float(code):.6f}')
print(f'Decoded → {arithmetic_decode(code, probs, len(msg))!r}')
Mental Model
Mental Model: The Number Line Waterfall
Arithmetic coding is like repeatedly zooming into a number line. Each symbol shrinks your view to a smaller subinterval. The final tiny interval is the entire message.
  • Start with the unit interval [0, 1). Each symbol has a fraction of this line proportional to its probability.
  • For each symbol in the message, narrow the current interval to that symbol's sub-interval, keeping the same proportion.
  • After the last symbol, pick any number in the final interval — conventionally the midpoint.
  • Decoding: given the code number, find which symbol's interval contains it, then 'zoom out' by that symbol's probability to prepare for the next symbol.
  • The longer the message, the more precise the number needed. This is why renormalisation (scaling up) is required for finite precision.
📊 Production Insight
Floating-point arithmetic coding fails after enough symbols because interval width drops below representable precision.
The renormalisation trick: when range < threshold, multiply range and low by 2 and output a bit, keeping precision bounded.
Rule: Never use floating-point for arithmetic coding in production. Integer with renormalisation is the only safe path.
🎯 Key Takeaway
Arithmetic coding encodes the entire message as a single fraction. No per-symbol bit boundaries.
The longer the message, the more precision needed — solved by integer renormalisation.
Rule: For large messages, integer range coder (32-bit) is the standard, not floating-point or infinite precision.
Arithmetic Coding Implementation Path
IfSymbol probabilities are static and known in advance
UseUse static arithmetic coding with precomputed cumulative frequency table. Context: offline compression (archiving, dataset storage).
IfProbabilities change over time (video, audio, adaptive compression)
UseUse adaptive arithmetic coding. Update symbol counts after each symbol (or after N symbols). CABAC in video codecs uses this.
IfMaximum message length < 100 symbols, high speed not required
UseFloating-point implementation may work in controlled environments (research, prototyping). In production, still prefer integer for determinism.
IfMessages are long (>1000 symbols), deterministic output required
UseInteger range coder with 32-bit register, renormalization when range < 2^24. Byte-oriented output for speed.
IfMaximum speed required (GB/s), near-entropy compression needed
UseUse ANS (Asymmetric Numeral Systems) instead of arithmetic coding. ANS is 2-4x faster and used in Zstandard, Facebook's compression.

Why Arithmetic Beats Huffman

For symbol a with probability 0.99 and b with 0.01
  • Huffman must assign at least 1 bit to each symbol: a→0, b→1
  • For a 100-symbol message that is 99 a's, Huffman uses 100 bits
  • Entropy: H = -0.99log2(0.99)
  • 0.01log2(0.01) ≈ 0.081 bits/symbol
  • Arithmetic coding: 100 × 0.081 ≈ 8 bits for the same message
  • Arithmetic is ~12x better here

Huffman is only optimal when all probabilities are exact negative powers of 2 (0.5, 0.25, 0.125...). For any other distribution, Huffman's per-symbol integer-bit constraint wastes space. The redundancy is at most 1 bit per symbol, but that adds up. For a source with 30 symbols each with probability ~1/30, Huffman wastes about 1 bit per symbol. For highly skewed sources (one symbol dominant), the waste is even higher proportionally. Arithmetic coding's fractional-bit efficiency makes it ideal for contexts where probabilities are not powers of 2 — which is almost every real-world source.

io/thecodeforge/compression/huffman_vs_arithmetic.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
import math
import heapq
from fractions import Fraction

# Huffman simulation (code length calculation)
def huffman_expected_length(probs):
    heap = [(p, []) for p in probs if p > 0]
    heapq.heapify(heap)
    while len(heap) > 1:
        p1, c1 = heapq.heappop(heap)
        p2, c2 = heapq.heappop(heap)
        for code in c1:
            code.append('0')
        for code in c2:
            code.append('1')
        heapq.heappush(heap, (p1 + p2, c1 + c2))
    codes = heap[0][1]
    code_lengths = [len(code) for code in codes]
    return sum(p * l for p, l in zip(probs, code_lengths))

# Entropy calculation
def entropy(probs):
    return -sum(p * math.log2(p) for p in probs if p > 0)

probabilities = [0.99, 0.01]
huff_len = huffman_expected_length(probabilities)
ent = entropy(probabilities)
print(f"Huffman expected length: {huff_len:.3f} bits/symbol")
print(f"Entropy: {ent:.3f} bits/symbol")
print(f"Redundancy: {huff_len - ent:.3f} bits/symbol")
print(f"For 1000 symbols: Huffman {1000 * huff_len} bits, optimal {1000*ent:.0f} bits")
print(f"Arithmetic coding (theoretically) can get within 0.001 bits/symbol of entropy")

# Multiple symbols scenario
probs_multi = [0.4, 0.3, 0.2, 0.1]
huff_multi = huffman_expected_length(probs_multi)
ent_multi = entropy(probs_multi)
print(f"\n4 symbols (0.4,0.3,0.2,0.1): Huffman {huff_multi:.2f}, Entropy {ent_multi:.2f}")
print(f"Waste: {huff_multi - ent_multi:.2f} bits/symbol, {1000 * (huff_multi - ent_multi):.0f} bits per KB")
🔥Practical Implementation
Real arithmetic coders use integer arithmetic with renormalisation (scaling up intervals as they narrow below a threshold) to avoid infinite precision. This is called 'carry-less' or 'range coding'. ANS (Asymmetric Numeral Systems) by Duda (2009) achieves similar compression with faster throughput — used in Zstandard and modern codecs.
📊 Production Insight
For real-world text, Huffman's redundancy is 5-10% extra bits compared to entropy. Arithmetic coding closes that gap.
For skewed distributions (binary with p=0.99), Huffman's waste is 12x. That's the difference between 100KB and 8KB per message.
Rule: If your symbol probabilities are not powers of 1/2, arithmetic coding (or ANS) will outperform Huffman. The gain is proportional to how skewed the distribution is.
🎯 Key Takeaway
Huffman's per-symbol integer-bit constraint wastes space when probabilities aren't powers of 1/2.
Arithmetic coding's fractional bits eliminate that waste, achieving near-entropy compression.
Rule: For skewed distributions (binary, run-length, one dominant symbol), arithmetic coding is dramatically better — often 10x+ improvement.
Choosing Between Huffman and Arithmetic
IfSymbol probabilities are close to powers of 1/2 (0.5, 0.25, 0.125, etc.)
UseHuffman is near-optimal and simpler. Example: binary codes with equal probability, symbol counts power of two.
IfDistribution is highly skewed (one symbol dominates >90%)
UseArithmetic coding wins dramatically (10x+ better compression). Use for binary data with run-length patterns.
IfMaximum speed needed, compression ratio secondary
UseHuffman can be faster (table lookup). Use for low-latency streaming where a few % ratio loss is acceptable.
IfProbability model changes over time (adaptive coding)
UseArithmetic coding handles adaptive probability updates more naturally. Used in CABAC where context probabilities evolve per symbol.
IfYou need patent-free algorithm for high compression
UseUse ANS (Asymmetric Numeral Systems) — modern, fast, near-entropy, and patent-unencumbered. Used in Zstandard (Facebook).

Applications

JPEG2000: Replaces JPEG's Huffman with arithmetic coding for better quality at same bitrate. JPEG2000 uses EBCOT (Embedded Block Coding with Optimized Truncation) which works with arithmetic coding to achieve progressive decoding.

HEVC/H.265: CABAC (Context-Adaptive Binary Arithmetic Coding) — every modern video stream uses arithmetic coding. CABAC uses 399 different context models (for syntax elements like motion vectors, coefficients) with adaptive probability updates per symbol. This is why H.264/HEVC achieve 50% bitrate reduction over MPEG-2 for the same quality.

LZMA (xz/7-zip): Combines LZ77 with range coding (arithmetic coding variant) for best-in-class ratios. LZMA's range coder is byte-oriented and uses 4-byte low/high range and renormalisation.

PPM (Prediction by Partial Matching): Combines context modelling with arithmetic coding — near-theoretical compression on text. PPM uses context length (order-n) to predict next symbol probability, feeding that to arithmetic coding.

Modern codecs: AV1 (next-gen video) uses arithmetic coding with symbol-to-symbol probability adaptation. Daala (Xiph.org) used range coding. Zstandard (Facebook) uses ANS for better speed.

io/thecodeforge/compression/cabac_renorm.c · C
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Simplified CABAC-style renormalization in integer arithmetic
// From H.264 standard: range is in [2^8, 2^9) after renormalization

#include <stdint.h>

// Encoder state (as in CABAC)
typedef struct {
    uint32_t low;      // current interval low point (scaled integer)
    uint32_t range;    // current interval range (>= 256 << 16)
    uint8_t  pending;  // number of pending bits (for carry)
} CabacEncoder;

// Renormalisation: called when range < 256 << 16
void cabac_renorm(CabacEncoder *enc, void (*output)(uint8_t byte)) {
    while (enc->range < (256 << 16)) {
        if (enc->low < (1 << 24)) {  // No carry to propagate
            output(enc->low >> 24);   // Output highest byte
            for (int i = 0; i < enc->pending; i++) output(0xFF); // Emit pending 0xFFs
            enc->pending = 0;
        } else if (enc->low >= (3 << 24)) { // 3<<24 = 50331648
            output(((enc->low >> 24) + 1) & 0xFF); // Propagate carry
            for (int i = 0; i < enc->pending; i++) output(0x00); // Emit 0x00 for pending
            enc->pending = 0;
        } else {  // Pending carry case: low in [1<<24, 3<<24)
            enc->pending++;
            enc->low -= (1 << 24);
        }
        enc->low <<= 8;
        enc->range <<= 8;
    }
}

// Probability update (LPS = less probable symbol, MPS = more probable symbol)
typedef struct {
    uint16_t probLPS;  // Probability of LPS (0..32767, scaled to 2^15)
    uint8_t  mps;      // More probable symbol value (0 or 1)
} CabacContext;

// Update probability after encoding a symbol
void cabac_update(CabacContext *ctx, uint8_t encoded_symbol) {
    if (encoded_symbol == ctx->mps) {
        // Increase MPS probability: probLPS *= (1 - alpha) where alpha ~ 0.05
        ctx->probLPS -= ((ctx->probLPS) >> 5);  // multiply by 31/32
    } else {
        // LPS occurred: increase probLPS, swap MPS if necessary
        ctx->probLPS += ((32768 - ctx->probLPS) >> 5);  // increase toward 0.5
        if (ctx->probLPS > 16384) {  // > 0.5
            ctx->mps ^= 1;  // swap MPS/LPS
            ctx->probLPS = 32768 - ctx->probLPS;
        }
    }
}
💡CABAC's Secret: Context Adaptation
CABAC doesn't use fixed probabilities. It maintains 399 different probability models for different syntax elements (motion vectors, residuals, etc.). Each model updates after coding a symbol, so probabilities adapt to the actual data. This is why CABAC outperforms simpler arithmetic coding with static probabilities.
📊 Production Insight
JPEG2000 failed in mainstream imaging because faster alternatives (JPEG with arithmetic coding wasn't patent-free) won on speed.
H.264 succeeded because the compression gain (50% bitrate reduction) outweighed the added complexity of CABAC.
Rule: For interactive applications (video chat), speed matters. Arithmetic coding can be 2-5x slower than Huffman, so use ANS for a better trade-off.
🎯 Key Takeaway
Arithmetic coding powers H.264/H.265 video codecs (CABAC), JPEG2000, and LZMA.
The patent issues are expired, but speed remains a trade-off — ANS (Zstandard) is faster for software decoding.
Rule: Use arithmetic coding when compression ratio is critical (video, archival). Use ANS or Huffman when speed is the priority.
Deployment Scenario — Which Coding to Use
IfImage compression, quality > speed, patent concerns not relevant
UseJPEG2000 with arithmetic coding. Use for archival, medical imaging, and satellite imagery.
IfVideo compression, high compression ratio required, hardware acceleration available
UseCABAC (arithmetic coding) in H.265/HEVC. All video hardware has fast CABAC decoders.
IfGeneral-purpose compression (file archiving, network packets)
UseANS (in Zstandard or LZMA) — good compression, faster decode, patent-free.
IfBounded memory, embedded system with 8-bit CPU
UseStatic Huffman is simpler. Arithmetic coding's renormalisation overhead may be too high.
IfReal-time streaming with latency < 1ms
UseUse Huffman or ANS (Zstandard's fast mode). Full arithmetic coding adds ~4-8 cycles per bit, which is too much.
🗂 Huffman Coding vs Arithmetic Coding vs ANS
Trade-offs in compression ratio, speed, and complexity — choose based on your workload
AspectHuffman CodingArithmetic CodingANS (Asymmetric Numeral Systems)
Symbol encodingVariable-length codes, whole bits per symbolFractional bits per symbol — whole message as single numberTable-based state machine, fractional bits per symbol
Compression ratio near entropyWithin 1 bit of entropy per symbolWithin 0.001 bit per symbol, asymptotically perfectWithin 0.01 bit per symbol, near-entropy
Speed (software decode)Very fast — table lookup per symbolSlow — renormalisation, branching, bit-level loopsFast — table-driven, batch-friendly
Memory usageLow — codebook plus frequency tableLow — only low/range register plus probabilitiesMedium — lookup tables for decoding
Probability adaptationHard — need to rebuild codebook periodicallyEasy — update probabilities per symbolPossible but less common than arithmetic
Best-in-class usageDEFLATE (gzip, PNG), JPEG, MP3 (Huffman part)H.264/HEVC CABAC, JPEG2000, LZMA range coding variantZstandard, Facebook's compression, Apple's LZFSE
Patent statusExpired (all), free to useExpired (core patents), free to use nowOpen, unencumbered (Duda, 2009)
Ideal forSmall alphabets, speed-critical, simple probability distributionsSkewed distributions, adaptive coding, video/audioHigh-speed general compression with near-entropy ratio

🎯 Key Takeaways

  • Arithmetic coding encodes entire message as single fraction — no per-symbol bit boundary.
  • Achieves near-entropy compression — significantly better than Huffman for skewed distributions.
  • Huffman wastes bits when symbol probabilities aren't powers of 1/2.
  • Used in HEVC video (CABAC), JPEG2000, LZMA/xz, and modern compressors.
  • ANS (Asymmetric Numeral Systems) is the modern practical variant — faster than traditional arithmetic coding.

⚠ Common Mistakes to Avoid

    Implementing arithmetic coding with floating-point arithmetic
    Symptom

    Encoder and decoder desynchronise after enough symbols. Output becomes garbage. Different results on x86 vs ARM.

    Fix

    Use integer arithmetic with 32-bit registers. Keep range as integer scaled to (1 << 31). Implement renormalisation: when range < (1 << 24), shift left and output bits. Never use floating-point for production arithmetic coding.

    Using static probabilities for non-stationary data
    Symptom

    Compression ratio good at start but degrades over time. The data distribution changes, but probabilities remain fixed.

    Fix

    Use adaptive probability updates. Start with uniform distribution (or some prior) and update symbol counts after each coded symbol. For binary arithmetic coding, use moving average (e.g., probLPS ← probLPS * 31/32 + (1 - probLPS) / 32).

    Forgetting to flush the encoder after final symbol
    Symptom

    Decoder cannot finish — last bits missing. Code may be missing final few bytes, causing incomplete decode.

    Fix

    After last symbol, output remaining bits in low register. Flush by writing low >> 24 and remaining pending bytes. LZMA does 5-byte flush, CABAC writes up to 4 extra bytes.

    Using too many context models (overfitting)
    Symptom

    Compression ratio suffers because models don't get enough data to converge. Memory usage skyrockets without benefit.

    Fix

    Use fewer contexts with shared probability tables. CABAC balances 399 models — large but carefully tuned. For custom code, start with 1-4 contexts and add only when you measure improvement.

    Decoding without validating final interval containment
    Symptom

    Decoder may accept invalid codes that decode to wrong message or crash on out-of-bounds access.

    Fix

    After decoding all symbols, ensure that the final code lies within [low, high) of the final interval. If not, signal error (malformed bitstream). Always use bounds checking in decoder.

Interview Questions on This Topic

  • QWhy does arithmetic coding outperform Huffman coding for skewed distributions?Mid-levelReveal
    Huffman coding assigns whole bits per symbol. For a symbol with probability 0.99, entropy is 0.0144 bits, but Huffman assigns at least 1 bit — about 70x waste. Arithmetic coding encodes the entire message as a single fraction, not per-symbol bits. It can allocate fractional bits across the whole message. For 100 symbols (99 of symbol a, 1 of b), Huffman uses 100 bits (1 bit per symbol). Arithmetic coding uses -log2(0.99^99 * 0.01) ≈ 8 bits total — a 12x improvement. As the distribution becomes more skewed, arithmetic coding's advantage grows. When probability of a is 0.999, Huffman still uses 1 bit per symbol, arithmetic uses -log2(0.999) ≈ 0.00144 bits per symbol — 700x improvement. The theoretical limit is the Shannon entropy, and arithmetic coding approaches it arbitrarily closely; Huffman's integer-bit constraint creates the gap.
  • QWhat is the renormalisation step in practical arithmetic coders, and why is it necessary?SeniorReveal
    Renormalisation solves the precision problem. With floating-point, after many symbols, the interval [low, high) becomes so tiny that adding a probability to low no longer changes the value — underflow. Renormalisation in integer coders: when the range drops below a threshold (e.g., 2^24 in a 32-bit coder), you scale both low and range by multiplying by 2 (left shift) and output one or more bits to the bitstream. This keeps the range within [2^24, 2^25) and low within [0, 2^32). The output bits represent the bits of low that are now fixed and won't change by future symbols. This is called 'carry-less' coding. In H.264's CABAC, renormalisation occurs whenever range < 256 << 16 (2^24). The encoder shifts range and low left by 8 bits and may output up to 3 bytes. Without renormalisation, you'd need infinite precision numbers. With it, a 32-bit register is sufficient for gigabytes of data.
  • QExplain the connection between arithmetic coding and ANS (Asymmetric Numeral Systems).SeniorReveal
    ANS by Jarek Duda (2009) is a modern alternative to arithmetic coding that achieves near-entropy compression but with significantly higher speed. Arithmetic coding treats the message as a fraction in [0,1). ANS treats the message as a single integer and uses a base-2 number system with operations that approximate multiplication by probabilities. There are two variants: rANS (range ANS) which uses a 32-bit state and table lookups, and tANS (table ANS) which uses finite state machines and is even faster but requires precomputed tables. ANS is used in Zstandard (Facebook's compression algorithm) and in Apple's LZFSE. The main advantage: ANS processes bytes (not bits) and has lower branching, making it 2-4x faster in software than arithmetic coding. The downside: ANS is more complex to implement correctly, especially with probability adaptation. For fixed distributions, tANS can be extremely fast (approaching Huffman speed) while maintaining arithmetic coding's compression ratio. The meta insight: ANS replaces arithmetic coding's interval subdivision with a state machine that encodes the same information, but more computationally efficiently.
  • QWhat is CABAC in video coding, and how does it use arithmetic coding with context adaptation?SeniorReveal
    CABAC (Context-Adaptive Binary Arithmetic Coding) is the entropy coding scheme in H.264/HEVC. It has three steps: binarisation (convert non-binary syntax elements to binary strings), context modelling (select a probability model for each bit), and binary arithmetic coding (encode bits using that model). CABAC uses up to 399 different context models. Each model tracks the probability of a '1' being the less probable symbol (LPS). The probability is stored as a 16-bit integer (0..32767, mapping to 0..0.5). After encoding a symbol, CABAC updates the probability: if the symbol matches the MPS (more probable symbol), probLPS is decreased by probLPS >> 5 (multiply by 31/32). If LPS occurred, probLPS is increased by (32768 - probLPS) >> 5. If probLPS exceeds 16384 (0.5), the LPS and MPS swap. This adaptive update adjusts probabilities to the actual data statistics without needing to signal new tables. The arithmetic coding implementation uses range and low registers with renormalisation that outputs whole bytes. CABAC is why H.264/HEVC achieve 50% bitrate reduction over MPEG-2 at the same quality — the entropy coding is adaptive and near-optimal.

Frequently Asked Questions

Why isn't arithmetic coding used everywhere if it's better than Huffman?

Patent issues historically slowed adoption (IBM held key patents until the 2000s). Today, range coding (a patent-free variant) and ANS have replaced traditional arithmetic coding in new implementations. Huffman remains in DEFLATE/gzip for compatibility and speed.

Arn't there faster alternatives to arithmetic coding?

Yes, ANS (Asymmetric Numeral Systems) is 2-4x faster in software while maintaining similar compression ratio. Facebook's Zstandard uses ANS; Apple's LZFSE also uses a variant. For hardware-accelerated video decoding, CABAC (H.264/HEVC) is implemented in silicon, so speed is not an issue. For general-purpose software compression, new codecs (Zstandard, LZ4 with entropy) prefer ANS over traditional arithmetic coding for speed.

How does adaptive arithmetic coding work in practice?

Start with a frequency table initialised with uniform distribution or prior knowledge. After encoding a symbol, increment that symbol's count. Recompute cumulative frequencies after each symbol or after N symbols. In binary arithmetic coding (CABAC style), maintain a probability estimate (e.g., 12-bit integer for probLPS, range 0..4095). Update probLPS using a moving average: probLPS = probLPS - (probLPS >> N) for MPS occurrence, and probLPS = probLPS + ((full_range - probLPS) >> N) for LPS occurrence. This converges to actual symbol statistics without storing large tables.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousBurrows-Wheeler Transform
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged