Huffman Coding — Invalid Table Triggers SIGSEGV in Decoder
Decoded output garbled? Huffman table corruption causes memory writes outside node array, causing SIGSEGV.
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
- Huffman coding assigns shorter bit patterns to more frequent characters, minimising expected encoded length.
- Build a min-heap of (frequency, character) pairs, repeatedly merge the two lowest-frequency nodes into a parent.
- The path from root to leaf (0 for left, 1 for right) gives the code; no code is a prefix of another.
- Encoding creates a code table: O(n log n) for building, O(1) per character during encoding.
- Decoding walks the tree bit by bit: O(L) where L = encoded length; no lookahead needed.
- In production, code tables must be transmitted with compressed data, consuming ~1–3% of savings.
How Huffman Coding Achieves Optimal Prefix-Free Compression
Huffman coding is a lossless data compression algorithm that assigns variable-length codes to input symbols based on their frequency. The core mechanic: more frequent symbols get shorter codes, less frequent symbols get longer codes, and no code is a prefix of any other — this prefix-free property ensures unambiguous decoding without delimiters.
The algorithm builds a binary tree by repeatedly merging the two least-frequent nodes into a parent whose weight is the sum of its children. This greedy approach guarantees an optimal prefix code for a given frequency distribution — a proof of optimality that holds when frequencies are known and static. In practice, the tree is traversed to assign bit strings: left edges get 0, right edges get 1. The resulting code table maps each symbol to its unique bit sequence.
Use Huffman coding when you need lossless compression with known symbol frequencies — it's the final stage in JPEG, PNG, and DEFLATE (gzip). Its O(n log n) construction cost is acceptable for most workloads, but the real win is that it compresses near the entropy limit for arbitrary data, making it a go-to for file formats and network protocols where every byte matters.
Building the Huffman Tree
The algorithm builds a binary tree from the bottom up. Start with each character as a leaf node with its weight (frequency). Insert all nodes into a min-heap. While more than one node remains, pop the two smallest-weight nodes, create a parent node whose weight is the sum, attach the two popped nodes as children, and push the parent back. The last node left is the root of the Huffman tree. The structure is a full binary tree — every internal node has exactly two children.
C++ Implementation Using priority_queue
The same algorithm can be implemented in C++ using std::priority_queue with a custom comparator to create a min‑heap. The node structure holds a frequency and an optional character; internal nodes have character set to '\0'. The heap ordering is by frequency. After building the tree, we traverse to generate codes.
8-Step Tree Construction Visual Guide
The following diagram shows the state of the min‑heap after each operation during the construction of the Huffman tree for the example 'aabbbccddddeeeeee' (frequencies: a=2, b=3, c=2, d=4, e=6). Each step highlights the two smallest elements that are popped and merged. The parent node’s frequency is the sum.
Generating Codes and Encoding
Once the tree is built, assign codes by traversing from the root to each leaf. For each edge, append a '0' for a left move and '1' for a right move. The accumulated binary string is the code for that character. Since every character sits at a leaf, no code is a prefix of another — optimal prefix-free property. Encoding a message becomes a simple lookup: replace each character with its code from the precomputed table.
Compression Ratio Table
The table below summarises the compression achieved for each character in our example. The original size uses 8 bits per character; the compressed size uses the Huffman code length. The last column shows the number of bits saved by using the variable-length code.
| Character | Frequency | Code | Original bits (freq×8) | Compressed bits (freq×code_len) | Bits saved |
|---|---|---|---|---|---|
| 'e' | 6 | 0 | 48 | 6 | 42 |
| 'd' | 4 | 01 | 32 | 8 | 24 |
| 'b' | 3 | 111 | 24 | 9 | 15 |
| 'a' | 2 | 110 | 16 | 6 | 10 |
| 'c' | 2 | 100 | 16 | 6 | 10 |
| Total | 17 | 136 | 35 | 101 |
This illustrates how characters with higher frequency get shorter codes (e.g., 'e' with 6 occurrences uses only 1 bit), while less frequent characters use more bits, but overall savings are >74%.
Decoding Huffman Codes
Decoding a Huffman-encoded bitstream requires the same tree used for encoding. Start at the root and read bits one by one. For each '0', go to the left child; for '1', go to the right. When you reach a leaf, output the character stored there and reset to the root. This works because the code is prefix-free — no bit sequence leads to an address that is also a prefix of another character’s code. The process is deterministic and requires no lookahead or symbol table at decode time (though a lookup table with variable-length entries can accelerate it in high‑throughput systems).
Compression Ratio Analysis
The compression ratio achieved by Huffman coding depends on the skewness of the input distribution. For highly skewed data (e.g., letter 'e' dominating English text), savings are large. For uniform distributions, Huffman approaches a fixed-length code and offers minimal gain. Shannon’s source coding theorem gives the lower bound: entropy of the source. Huffman achieves this bound to within one bit per symbol (for integer code lengths). In practice, the code table overhead and block alignment (padding to byte boundaries) reduce effective savings. The example with 5 letters achieved 25.7% of original size (~74% compression). A real English text file might compress to ~40-50% of original ASCII size with Huffman alone.
Huffman vs Run-Length Encoding: Advantages and Disadvantages
Both algorithms are lossless, but they suit different data patterns.
Advantages of Huffman over RLE: - Works on any symbol distribution, not just repeated runs. - Adapts to frequency — high‑frequency symbols get short codes; RLE treats all runs equally. - Effective on text, executables, and mixed data where runs are rare. - Provably optimal among prefix‑free codes for a given distribution.
Disadvantages of Huffman compared to RLE: - Requires storing a code table (or code lengths) — overhead for small data. - Building the tree requires a frequency pass (two‑pass algorithm) or a precomputed model. - Not efficient for data with long runs (e.g., bitmap images with large solid areas) — RLE can represent a run of 100 zeros in ~2 bytes, while Huffman would need ~100 bits. - More complex to implement and debug.
When to use which: - Use Huffman (or Huffman+LZ77) for general‑purpose compression of text, logs, source code, etc. - Use RLE for simple image formats (BMP, PCX), fax, or any data with consecutive identical symbols. - In practice, modern compressors (gzip, bzip2, PNG) combine both approaches.
Huffman in Production: DEFLATE, JPEG, and More
The most widespread use of Huffman is inside DEFLATE, used by ZIP, gzip, PNG, and HTTP’s Content-Encoding. DEFLATE combines LZ77 (back-references to repeated strings) with Huffman coding of the literals, lengths, and distances. JPEG uses Huffman on quantised DCT coefficient runs. MP3 uses a modified Huffman for scale factors and spectral data. In most real compressors, Huffman tables are not transmitted in full — instead, code lengths are coded compactly using canonical Huffman (e.g., in DEFLATE the code lengths themselves are Huffman-coded). This reduces table overhead substantially.
Practice Problems
Sharpen your Huffman skills with these problems:
- Optimal Merge Pattern (GeeksforGeeks): Given n files with sizes, find the minimum cost to merge them into one file. This is identical to Huffman — merge the two smallest repeatedly.
- [Practice](https://practice.geeksforgeeks.org/problems/optimal-merge-pattern/0)
- File Compression on LeetCode (LeetCode 1087): Implement a simple compression using Huffman-like prefix codes. Covers tree building and encoding.
- [Practice](https://leetcode.com/problems/huffman-coding/)
- Huffman Decoding (HackerRank): Decode a binary string given a Huffman tree. Tests traversal and prefix‑free understanding.
- [Practice](https://www.hackerrank.com/challenges/huffman-decoding/problem)
- Huffman Encoding (HackerRank): Build a Huffman tree from frequencies and print the codes. Good for tree construction practice.
- [Practice](https://www.hackerrank.com/challenges/tree-huffman-decoding/problem)
- Minimum Cost to Connect Sticks (LeetCode 1167): You have sticks of lengths; the cost to connect two is their sum. Same as Huffman merging.
- [Practice](https://leetcode.com/problems/minimum-cost-to-connect-sticks/)
- Burst Balloons (LeetCode 312): Not directly Huffman, but uses optimal merging logic (DP variant). Good for advanced thinking.
- [Practice](https://leetcode.com/problems/burst-balloons/)
long long in C++ or int with checks in Python.Huffman Coding Complexity: Why It Scales
You're compressing a megabyte of text. Is Huffman going to choke? No. Here's why.
The algorithm has two phases: building the tree and traversing it. Building uses a priority queue. You pop two nodes, push one back, repeat until one node remains. With n unique characters, that's O(n log n) — each priority queue operation is O(log n), and you do it n-1 times. For a typical ASCII file where n ≤ 128, this is instant.
Traversing the tree to assign codes is O(n) because you walk each path exactly once. Decoding is even better: O(m) where m is the length of the compressed bitstream. You traverse the tree bit by bit, no backtracking. That's the prefix property paying off.
Memory is the real concern. The tree nodes each store a character, frequency, and two pointers. With n nodes, that's O(n) memory. For a 1000-character alphabet, you're looking at a few kilobytes. Nothing modern hardware can't handle.
The bottom line: Huffman coding scales linearly with input size during decode, and log-linearly during encode. It's not the fastest compressor (check LZ4 for speed), but it's guaranteed optimal for symbol-by-symbol encoding. That's the trade-off you accept.
Decoding the Code: How It Actually Works
You've encoded your file. Now the hard part: reading that bitstream back without corruption. Decoding Huffman is simple — if you have the tree. Here's the reality: do you store the tree in the compressed file? If yes, you lose some compression gain. If no, you need a pre-agreed table.
Production systems store the canonical Huffman tree — a compact encoding of the code lengths, not the full tree structure. DEFLATE uses this. You send a table of {character: bit_length} pairs, then reconstruct the tree deterministically on the decode side. Saves space, prevents ambiguity.
The decode loop is a state machine: start at root, read one bit, go left on 0, right on 1. Hit a leaf? Emit the character, reset to root. Continue until you exhaust the bitstream. That's it. If the bitstream is malformed (e.g., truncated), you'll walk off the tree — always validate the input length.
Edge case: what if your file ends mid-byte? Standard practice: pad the final byte with zero bits and write the number of valid bits in the file header. Ignore the rest during decode. Every production implementation does this.
Code example: a Java decoder that reads a canonical Huffman table and decodes a bitstream. Notice the bit-by-bit navigation — no recursion, no call stack blowup.
Huffman Coding Algorithm: The Core Loop You'll See Everywhere
You've seen the tree. You've seen the complexity. Now here's the raw algorithm that every implementation follows. Memorize it. You'll use this pattern in everything from DEFLATE to custom compression protocols.
Step 1: Build frequency map for each symbol. Step 2: Insert each symbol as a leaf node into a min-heap (priority queue keyed on frequency). Step 3: While heap has more than 1 node: - Extract two minimum frequency nodes. - Create a new internal node whose frequency = sum of the two. - Make those two nodes its children. - Insert the new node back into the heap. Step 4: The remaining node is the root of your Huffman tree. Step 5: Traverse the tree. Assign '0' for left edges, '1' for right edges. Record the bitstring when you hit a leaf.
That's the greedy part: you always combine the two smallest frequencies. Proof: optimal substructure. If you pick any other pairing, total path length increases. Don't argue with math.
Real-world note: never build the tree from scratch if you can cache it. DEFLATE builds a new tree per block (max 64KB), but static Huffman tables exist for common alphabets. JPEG uses fixed tables for DC coefficients. Pick your trade-off: dynamic tables adapt better, static tables decode faster.
Corrupted Huffman Table in PNG Decoder Leads to Memory Corruption
- Never trust externally provided Huffman tables — validate canonical constraints before building the tree.
- Always bounds-check array indices when reconstructing Huffman trees from code lengths.
- Add a safe fallback mode: if decoded table is invalid, use a pre-agreed default table instead of crashing.
diff <(xxd original) <(xxd decoded) | head -20printf '%s\n' "$ORIGINAL_HASH" "$DECODED_HASH" | md5sumKey takeaways
Common mistakes to avoid
4 patternsUsing a static code table that doesn't match the input distribution
Forgetting to handle the single-character case
prefix or '0' when node.char is not None and prefix is empty.Not aligning compressed output to byte boundaries
Using a mutable default argument in Python code generation
codes dict in get_codes(node, prefix='', codes={}) accumulates across calls if the default is reused — typical Python gotcha.None as default and create a new dict inside the function, or pass an empty dict explicitly each time.Practice These on LeetCode
Interview Questions on This Topic
How do you build a Huffman tree?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
That's Greedy & Backtracking. Mark it forged?
9 min read · try the examples if you haven't