Huffman Coding — Invalid Table Triggers SIGSEGV in Decoder
Decoded output garbled? Huffman table corruption causes memory writes outside node array, causing SIGSEGV.
- Huffman coding assigns shorter bit patterns to more frequent characters, minimising expected encoded length.
- Build a min-heap of (frequency, character) pairs, repeatedly merge the two lowest-frequency nodes into a parent.
- The path from root to leaf (0 for left, 1 for right) gives the code; no code is a prefix of another.
- Encoding creates a code table: O(n log n) for building, O(1) per character during encoding.
- Decoding walks the tree bit by bit: O(L) where L = encoded length; no lookahead needed.
- In production, code tables must be transmitted with compressed data, consuming ~1–3% of savings.
Building the Huffman Tree
The algorithm builds a binary tree from the bottom up. Start with each character as a leaf node with its weight (frequency). Insert all nodes into a min-heap. While more than one node remains, pop the two smallest-weight nodes, create a parent node whose weight is the sum, attach the two popped nodes as children, and push the parent back. The last node left is the root of the Huffman tree. The structure is a full binary tree — every internal node has exactly two children.
C++ Implementation Using priority_queue
The same algorithm can be implemented in C++ using std::priority_queue with a custom comparator to create a min‑heap. The node structure holds a frequency and an optional character; internal nodes have character set to '\0'. The heap ordering is by frequency. After building the tree, we traverse to generate codes.
8-Step Tree Construction Visual Guide
The following diagram shows the state of the min‑heap after each operation during the construction of the Huffman tree for the example 'aabbbccddddeeeeee' (frequencies: a=2, b=3, c=2, d=4, e=6). Each step highlights the two smallest elements that are popped and merged. The parent node’s frequency is the sum.
Generating Codes and Encoding
Once the tree is built, assign codes by traversing from the root to each leaf. For each edge, append a '0' for a left move and '1' for a right move. The accumulated binary string is the code for that character. Since every character sits at a leaf, no code is a prefix of another — optimal prefix-free property. Encoding a message becomes a simple lookup: replace each character with its code from the precomputed table.
Compression Ratio Table
The table below summarises the compression achieved for each character in our example. The original size uses 8 bits per character; the compressed size uses the Huffman code length. The last column shows the number of bits saved by using the variable-length code.
| Character | Frequency | Code | Original bits (freq×8) | Compressed bits (freq×code_len) | Bits saved |
|---|---|---|---|---|---|
| 'e' | 6 | 0 | 48 | 6 | 42 |
| 'd' | 4 | 01 | 32 | 8 | 24 |
| 'b' | 3 | 111 | 24 | 9 | 15 |
| 'a' | 2 | 110 | 16 | 6 | 10 |
| 'c' | 2 | 100 | 16 | 6 | 10 |
| Total | 17 | 136 | 35 | 101 |
This illustrates how characters with higher frequency get shorter codes (e.g., 'e' with 6 occurrences uses only 1 bit), while less frequent characters use more bits, but overall savings are >74%.
Decoding Huffman Codes
Decoding a Huffman-encoded bitstream requires the same tree used for encoding. Start at the root and read bits one by one. For each '0', go to the left child; for '1', go to the right. When you reach a leaf, output the character stored there and reset to the root. This works because the code is prefix-free — no bit sequence leads to an address that is also a prefix of another character’s code. The process is deterministic and requires no lookahead or symbol table at decode time (though a lookup table with variable-length entries can accelerate it in high‑throughput systems).
Compression Ratio Analysis
The compression ratio achieved by Huffman coding depends on the skewness of the input distribution. For highly skewed data (e.g., letter 'e' dominating English text), savings are large. For uniform distributions, Huffman approaches a fixed-length code and offers minimal gain. Shannon’s source coding theorem gives the lower bound: entropy of the source. Huffman achieves this bound to within one bit per symbol (for integer code lengths). In practice, the code table overhead and block alignment (padding to byte boundaries) reduce effective savings. The example with 5 letters achieved 25.7% of original size (~74% compression). A real English text file might compress to ~40-50% of original ASCII size with Huffman alone.
Huffman vs Run-Length Encoding: Advantages and Disadvantages
Both algorithms are lossless, but they suit different data patterns.
Advantages of Huffman over RLE: - Works on any symbol distribution, not just repeated runs. - Adapts to frequency — high‑frequency symbols get short codes; RLE treats all runs equally. - Effective on text, executables, and mixed data where runs are rare. - Provably optimal among prefix‑free codes for a given distribution.
Disadvantages of Huffman compared to RLE: - Requires storing a code table (or code lengths) — overhead for small data. - Building the tree requires a frequency pass (two‑pass algorithm) or a precomputed model. - Not efficient for data with long runs (e.g., bitmap images with large solid areas) — RLE can represent a run of 100 zeros in ~2 bytes, while Huffman would need ~100 bits. - More complex to implement and debug.
When to use which: - Use Huffman (or Huffman+LZ77) for general‑purpose compression of text, logs, source code, etc. - Use RLE for simple image formats (BMP, PCX), fax, or any data with consecutive identical symbols. - In practice, modern compressors (gzip, bzip2, PNG) combine both approaches.
Huffman in Production: DEFLATE, JPEG, and More
The most widespread use of Huffman is inside DEFLATE, used by ZIP, gzip, PNG, and HTTP’s Content-Encoding. DEFLATE combines LZ77 (back-references to repeated strings) with Huffman coding of the literals, lengths, and distances. JPEG uses Huffman on quantised DCT coefficient runs. MP3 uses a modified Huffman for scale factors and spectral data. In most real compressors, Huffman tables are not transmitted in full — instead, code lengths are coded compactly using canonical Huffman (e.g., in DEFLATE the code lengths themselves are Huffman-coded). This reduces table overhead substantially.
Practice Problems
Sharpen your Huffman skills with these problems:
- Optimal Merge Pattern (GeeksforGeeks): Given n files with sizes, find the minimum cost to merge them into one file. This is identical to Huffman — merge the two smallest repeatedly.
- [Practice](https://practice.geeksforgeeks.org/problems/optimal-merge-pattern/0)
- File Compression on LeetCode (LeetCode 1087): Implement a simple compression using Huffman-like prefix codes. Covers tree building and encoding.
- [Practice](https://leetcode.com/problems/huffman-coding/)
- Huffman Decoding (HackerRank): Decode a binary string given a Huffman tree. Tests traversal and prefix‑free understanding.
- [Practice](https://www.hackerrank.com/challenges/huffman-decoding/problem)
- Huffman Encoding (HackerRank): Build a Huffman tree from frequencies and print the codes. Good for tree construction practice.
- [Practice](https://www.hackerrank.com/challenges/tree-huffman-decoding/problem)
- Minimum Cost to Connect Sticks (LeetCode 1167): You have sticks of lengths; the cost to connect two is their sum. Same as Huffman merging.
- [Practice](https://leetcode.com/problems/minimum-cost-to-connect-sticks/)
- Burst Balloons (LeetCode 312): Not directly Huffman, but uses optimal merging logic (DP variant). Good for advanced thinking.
- [Practice](https://leetcode.com/problems/burst-balloons/)
long long in C++ or int with checks in Python.Corrupted Huffman Table in PNG Decoder Leads to Memory Corruption
- Never trust externally provided Huffman tables — validate canonical constraints before building the tree.
- Always bounds-check array indices when reconstructing Huffman trees from code lengths.
- Add a safe fallback mode: if decoded table is invalid, use a pre-agreed default table instead of crashing.
Key takeaways
Common mistakes to avoid
4 patternsUsing a static code table that doesn't match the input distribution
Forgetting to handle the single-character case
prefix or '0' when node.char is not None and prefix is empty.Not aligning compressed output to byte boundaries
Using a mutable default argument in Python code generation
codes dict in get_codes(node, prefix='', codes={}) accumulates across calls if the default is reused — typical Python gotcha.None as default and create a new dict inside the function, or pass an empty dict explicitly each time.Interview Questions on This Topic
How do you build a Huffman tree?
Frequently Asked Questions
That's Greedy & Backtracking. Mark it forged?
5 min read · try the examples if you haven't