Senior 5 min · March 17, 2026

Huffman Coding — Invalid Table Triggers SIGSEGV in Decoder

Decoded output garbled? Huffman table corruption causes memory writes outside node array, causing SIGSEGV.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Huffman coding assigns shorter bit patterns to more frequent characters, minimising expected encoded length.
  • Build a min-heap of (frequency, character) pairs, repeatedly merge the two lowest-frequency nodes into a parent.
  • The path from root to leaf (0 for left, 1 for right) gives the code; no code is a prefix of another.
  • Encoding creates a code table: O(n log n) for building, O(1) per character during encoding.
  • Decoding walks the tree bit by bit: O(L) where L = encoded length; no lookahead needed.
  • In production, code tables must be transmitted with compressed data, consuming ~1–3% of savings.

Building the Huffman Tree

The algorithm builds a binary tree from the bottom up. Start with each character as a leaf node with its weight (frequency). Insert all nodes into a min-heap. While more than one node remains, pop the two smallest-weight nodes, create a parent node whose weight is the sum, attach the two popped nodes as children, and push the parent back. The last node left is the root of the Huffman tree. The structure is a full binary tree — every internal node has exactly two children.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Package: io.thecodeforge.python.dsa

import heapq
from collections import Counter

class HuffmanNode:
    def __init__(self, char, freq):
        self.char  = char
        self.freq  = freq
        self.left  = None
        self.right = None

    # Min-heap ordering by frequency
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text):
    freq = Counter(text)

    # Min-heap of leaf nodes
    heap = [HuffmanNode(char, f) for char, f in freq.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        # Take two nodes with lowest frequency
        left  = heapq.heappop(heap)
        right = heapq.heappop(heap)

        # Merge into a parent node
        parent = HuffmanNode(None, left.freq + right.freq)
        parent.left  = left
        parent.right = right

        heapq.heappush(heap, parent)

    return heap[0]  # root of Huffman tree

text = 'aabbbccddddeeeeee'
root = build_huffman_tree(text)
print(f'Root frequency: {root.freq}')  # 17 (total characters)
Output
Root frequency: 17
Production Insight
Heap corruption is the most common runtime error when implementing Huffman in production — especially when using custom priority queues.
If the heap comparator is not consistent with the merge logic, the tree is not minimal and compression ratio degrades silently.
Rule: always verify that internal node frequencies equal the sum of their children (consistency check during unit testing).
Key Takeaway
Build bottom-up: merge the two smallest, parent weighs the sum.
The algorithm is greedy — but globally optimal for prefix-free codes.
Proof: exchange argument — swapping a longer code for a more frequent character reduces expected length.

C++ Implementation Using priority_queue

The same algorithm can be implemented in C++ using std::priority_queue with a custom comparator to create a min‑heap. The node structure holds a frequency and an optional character; internal nodes have character set to '\0'. The heap ordering is by frequency. After building the tree, we traverse to generate codes.

huffman.cppCPP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <queue>
#include <vector>
#include <unordered_map>
#include <string>

struct HuffmanNode {
    char ch;
    int freq;
    HuffmanNode *left, *right;
    HuffmanNode(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct Compare {
    bool operator()(HuffmanNode* a, HuffmanNode* b) {\n        return a->freq > b->freq;   // min-heap\n    }
};

HuffmanNode* buildHuffmanTree(const std::unordered_map<char, int>& freq) {\n    std::priority_queue<HuffmanNode*, std::vector<HuffmanNode*>, Compare> pq;\n    for (auto& p : freq)\n        pq.push(new HuffmanNode(p.first, p.second));\n\n    while (pq.size() > 1) {\n        HuffmanNode* left = pq.top(); pq.pop();\n        HuffmanNode* right = pq.top(); pq.pop();\n        HuffmanNode* parent = new HuffmanNode('\\\0', left->freq + right->freq);\n        parent->left = left;\n        parent->right = right;\n        pq.push(parent);\n    }
    return pq.top();
}

void getCodes(HuffmanNode* root
Output
Compressed bits: 35
Original bits: 136
Production Insight
In C++ production code, always manage memory carefully — the heap‑allocated nodes must be freed after the tree is used (or use smart pointers). A common crash comes from forgetting to delete the tree after decoding, leading to memory leaks in long‑running services.
Key Takeaway
C++ implementation mirrors Python logic: use priority_queue with custom comparator.
Memory management is critical — prefer std::unique_ptr for node ownership.
The same tree structure works across languages.

8-Step Tree Construction Visual Guide

The following diagram shows the state of the min‑heap after each operation during the construction of the Huffman tree for the example 'aabbbccddddeeeeee' (frequencies: a=2, b=3, c=2, d=4, e=6). Each step highlights the two smallest elements that are popped and merged. The parent node’s frequency is the sum.

Production Insight
Visualising the heap state helps debug incorrect comparator behaviour — if the heap is actually a max‑heap, the frequencies will not decrease monotonically during merges. Always print the heap contents during unit tests.
Key Takeaway
Every merge reduces the heap size by one; the root frequency equals the total number of input characters.
The heap must remain a valid min‑heap after each push.

Generating Codes and Encoding

Once the tree is built, assign codes by traversing from the root to each leaf. For each edge, append a '0' for a left move and '1' for a right move. The accumulated binary string is the code for that character. Since every character sits at a leaf, no code is a prefix of another — optimal prefix-free property. Encoding a message becomes a simple lookup: replace each character with its code from the precomputed table.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def get_codes(node, prefix='', codes={}):
    if node.char is not None:  # leaf
        codes[node.char] = prefix or '0'  # handle single-char case
    else:
        get_codes(node.left,  prefix + '0', codes)
        get_codes(node.right, prefix + '1', codes)
    return codes

def huffman_encode(text):
    root  = build_huffman_tree(text)
    codes = get_codes(root, '', {})
    encoded = ''.join(codes[c] for c in text)
    return encoded, codes, root

text = 'aabbbccddddeeeeee'
encoded, codes, root = huffman_encode(text)

print('Codes:')
for char in sorted(codes):
    freq = text.count(char)
    print(f'  {repr(char)}: {codes[char]:5s}  (freq={freq}, bits={freq*len(codes[char])})')

original_bits = len(text) * 8
compressed_bits = len(encoded)
print(f'\nOriginal:   {original_bits} bits')
print(f'Compressed: {compressed_bits} bits')
print(f'Ratio:      {compressed_bits/original_bits:.1%}')
Output
Codes:
'a': 110 (freq=2

Compression Ratio Table

The table below summarises the compression achieved for each character in our example. The original size uses 8 bits per character; the compressed size uses the Huffman code length. The last column shows the number of bits saved by using the variable-length code.

CharacterFrequencyCodeOriginal bits (freq×8)Compressed bits (freq×code_len)Bits saved
'e'6048642
'd'40132824
'b'311124915
'a'211016610
'c'210016610
Total1713635101

This illustrates how characters with higher frequency get shorter codes (e.g., 'e' with 6 occurrences uses only 1 bit), while less frequent characters use more bits, but overall savings are >74%.

Production Insight
When presenting compression gains to stakeholders, always include the overhead of storing the code table itself. In this example the table (5 entries × 4 bytes each ≈ 20 bytes) adds 160 bits, nearly negating the 101‑bit saving. For real files, the table is compressed using canonical codes, making overhead negligible.
Key Takeaway
Per‑character savings add up but must be balanced against table overhead.
Use the table to identify which characters contribute the most to compression.

Decoding Huffman Codes

Decoding a Huffman-encoded bitstream requires the same tree used for encoding. Start at the root and read bits one by one. For each '0', go to the left child; for '1', go to the right. When you reach a leaf, output the character stored there and reset to the root. This works because the code is prefix-free — no bit sequence leads to an address that is also a prefix of another character’s code. The process is deterministic and requires no lookahead or symbol table at decode time (though a lookup table with variable-length entries can accelerate it in high‑throughput systems).

huffman_decode.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Package: io.thecodeforge.python.dsa

def huffman_decode(encoded_bits, root):
    decoded = []
    node = root
    for bit in encoded_bits:
        node = node.left if bit == '0' else node.right
        if node.char is not None:  # leaf
            decoded.append(node.char)
            node = root
    return ''.join(decoded)

# Example (using tree from previous encode)
text = 'aabbbccddddeeeeee'
encoded, codes, root = huffman_encode(text)
decoded = huffman_decode(encoded, root)
print(f'Decoded equals original: {decoded == text}')
# Output: Decoded equals original: True
Production Insight
If the bitstream contains an error (flipped bit), the decoder may traverse to a wrong leaf and produce garbage for the rest of the stream — Huffman has no error propagation recovery.
In production file formats, compressed blocks are small (typically 64 KB in DEFLATE) to limit damage from a single bit flip.
Rule: always pair Huffman with an error-correcting code (e.g., Reed–Solomon) if the channel is noisy.
Key Takeaway
Decoding is a simple tree walk: O(L) time, no code table lookup.
Prefix-free property ensures unambiguous decoding.
Bit errors are catastrophic — consider block segmentation and checksums.

Compression Ratio Analysis

The compression ratio achieved by Huffman coding depends on the skewness of the input distribution. For highly skewed data (e.g., letter 'e' dominating English text), savings are large. For uniform distributions, Huffman approaches a fixed-length code and offers minimal gain. Shannon’s source coding theorem gives the lower bound: entropy of the source. Huffman achieves this bound to within one bit per symbol (for integer code lengths). In practice, the code table overhead and block alignment (padding to byte boundaries) reduce effective savings. The example with 5 letters achieved 25.7% of original size (~74% compression). A real English text file might compress to ~40-50% of original ASCII size with Huffman alone.

Production Insight
When compressing many small files (e.g., 1 KB each), the overhead of storing Huffman tables can dominate — sometimes making the file larger.
Git stores plaintext, not compressed blobs, precisely for this reason: small files don't benefit from per-file Huffman tables.
Rule: measure effective compression ratio with table overhead included before choosing Huffman over a simpler algorithm like run-length encoding for small data.
Key Takeaway
Huffman is optimal among prefix codes, but overhead matters.
Entropy is the theoretical limit; Huffman gets within 1 bit/symbol.
For small data, ratio may be poor — evaluate per use case.

Huffman vs Run-Length Encoding: Advantages and Disadvantages

Both algorithms are lossless, but they suit different data patterns.

Advantages of Huffman over RLE: - Works on any symbol distribution, not just repeated runs. - Adapts to frequency — high‑frequency symbols get short codes; RLE treats all runs equally. - Effective on text, executables, and mixed data where runs are rare. - Provably optimal among prefix‑free codes for a given distribution.

Disadvantages of Huffman compared to RLE: - Requires storing a code table (or code lengths) — overhead for small data. - Building the tree requires a frequency pass (two‑pass algorithm) or a precomputed model. - Not efficient for data with long runs (e.g., bitmap images with large solid areas) — RLE can represent a run of 100 zeros in ~2 bytes, while Huffman would need ~100 bits. - More complex to implement and debug.

When to use which: - Use Huffman (or Huffman+LZ77) for general‑purpose compression of text, logs, source code, etc. - Use RLE for simple image formats (BMP, PCX), fax, or any data with consecutive identical symbols. - In practice, modern compressors (gzip, bzip2, PNG) combine both approaches.

Hybrid Approaches
Many production formats (e.g., PNG) use LZ77 (which captures runs) and then Huffman‑code the output. This combines the strengths of both.
Production Insight
A common mistake is applying Huffman to data that is highly repetitive. For example, a file containing a repeated pattern of 100 zeros will be compressed far better by RLE. Always profile your dataset before choosing an algorithm.
Key Takeaway
Huffman is not always the best choice; match the algorithm to the data pattern.
RLE is simpler and can outperform Huffman on runs.
Hybrid schemes often achieve the best real‑world compression.

Huffman in Production: DEFLATE, JPEG, and More

The most widespread use of Huffman is inside DEFLATE, used by ZIP, gzip, PNG, and HTTP’s Content-Encoding. DEFLATE combines LZ77 (back-references to repeated strings) with Huffman coding of the literals, lengths, and distances. JPEG uses Huffman on quantised DCT coefficient runs. MP3 uses a modified Huffman for scale factors and spectral data. In most real compressors, Huffman tables are not transmitted in full — instead, code lengths are coded compactly using canonical Huffman (e.g., in DEFLATE the code lengths themselves are Huffman-coded). This reduces table overhead substantially.

deflate_block_structure.txtTEXT
1
2
3
4
5
6
7
Deflate Block Components:
1. LZ77 parse: literals / length-distance pairs
2. Two Huffman trees: one for literals+lengths, one for distances
3. Code lengths encoded via run-length + Huffman meta-tree (dynamic Huffman)
4. Alternatively, fixed Huffman tables (pre-agreed) for fastest mode

Key: Huffman is not standalone — it's combined with a backreference scheme for high compression.
Production Insight
A known production bug: DEFLATE implementations often use fixed Huffman tables when the input is small, but the dynamic table selection heuristic can misclassify inputs — resulting in larger output than expected.
Several ZIP libraries have had CVEs due to incorrect handling of dynamic Huffman table reconstruction (e.g., invalid code length sequences causing infinite loops).
Rule: fuzz any custom Huffman decoder with randomly generated valid/invalid Deflate streams.
Key Takeaway
Real compression is layered: LZ77 + Huffman + entropy coding of tables.
Canonical Huffman minimises table storage via code length sequences.
Validation of table reconstruction is a common security pitfall.

Practice Problems

  1. Optimal Merge Pattern (GeeksforGeeks): Given n files with sizes, find the minimum cost to merge them into one file. This is identical to Huffman — merge the two smallest repeatedly.
  2. [Practice](https://practice.geeksforgeeks.org/problems/optimal-merge-pattern/0)
  3. File Compression on LeetCode (LeetCode 1087): Implement a simple compression using Huffman-like prefix codes. Covers tree building and encoding.
  4. [Practice](https://leetcode.com/problems/huffman-coding/)
  5. Huffman Decoding (HackerRank): Decode a binary string given a Huffman tree. Tests traversal and prefix‑free understanding.
  6. [Practice](https://www.hackerrank.com/challenges/huffman-decoding/problem)
  7. Huffman Encoding (HackerRank): Build a Huffman tree from frequencies and print the codes. Good for tree construction practice.
  8. [Practice](https://www.hackerrank.com/challenges/tree-huffman-decoding/problem)
  9. Minimum Cost to Connect Sticks (LeetCode 1167): You have sticks of lengths; the cost to connect two is their sum. Same as Huffman merging.
  10. [Practice](https://leetcode.com/problems/minimum-cost-to-connect-sticks/)
  11. Burst Balloons (LeetCode 312): Not directly Huffman, but uses optimal merging logic (DP variant). Good for advanced thinking.
  12. [Practice](https://leetcode.com/problems/burst-balloons/)
Production Insight
When implementing these problems, pay attention to integer overflow: frequencies can sum to large values. Use long long in C++ or int with checks in Python.
Key Takeaway
Huffman tree construction is the same as the optimal merge pattern — a classic greedy algorithm.
Practice on multiple platforms to internalise the tree-building and traversal.
● Production incidentPOST-MORTEMseverity: high

Corrupted Huffman Table in PNG Decoder Leads to Memory Corruption

Symptom
Random crashes (SIGSEGV) during PNG decode, only with images from a specific batch. No error logs prior to crash. Memory dumps showed invalid pointers in the Huffman decoder’s node array.
Assumption
The PNG files were well-formed — they passed CRC checks. The decoder had been stable for months across thousands of images.
Root cause
The Huffman table in the PNG’s DHT (Define Huffman Table) marker declared a code length that was inconsistent with the number of symbols. Specifically, a bit length of 15 was assigned to a symbol, but the total number of bits used exceeded 32767, violating the canonical Huffman constraint. The decoder’s tree construction routine indexed its node array with that length without bound-checking, writing to memory outside the allocated range.
Fix
Added bounds checking during tree construction: verify that the sum of (number_of_symbols_for_length * (1 << (max_length - length))) does not exceed 1 << max_length. Also added a fallback to a fixed Huffman table when a dynamic table is invalid — matching the PNG specification's recommendation.
Key lesson
  • Never trust externally provided Huffman tables — validate canonical constraints before building the tree.
  • Always bounds-check array indices when reconstructing Huffman trees from code lengths.
  • Add a safe fallback mode: if decoded table is invalid, use a pre-agreed default table instead of crashing.
Production debug guideSymptom → Action guide for production issues with Huffman-based compressors/decompressors4 entries
Symptom · 01
Decoded output is garbled/truncated
Fix
Verify that the same Huffman tree is used for encoding and decoding. Serialise the tree or code lengths and compare. Use a known reference library to cross-check.
Symptom · 02
Compressed size is larger than original
Fix
Check if the Huffman table overhead (especially code length encoding) exceeds savings. For small inputs, consider disabling Huffman or using a fixed table. Also ensure no padding bits are misaligned.
Symptom · 03
Memory corruption or crash during decode
Fix
Enable address sanitizer (ASan) and run on a corpus of malformed compressed blocks. Focus on the tree reconstruction loop: check for out-of-bounds writes when building the canonical Huffman tree from code lengths.
Symptom · 04
Decode algorithm runs indefinitely / high CPU
Fix
Huffman decode is deterministic and O(n) if implemented correctly. An infinite loop suggests a cycle in the tree (usually due to bad canonical code length assignment). Verify that each symbol’s code length is ≥1 and ≤max_bits for the tree. Run with --sanitize=undefined.
★ Quick Debug Cheat Sheet: Huffman IssuesUse this when your Huffman-based compression pipeline fails in production.
Decoded output does not match original
Immediate action
Compare code table used for encode vs decode. Disable any Huffman table caching.
Commands
diff <(xxd original) <(xxd decoded) | head -20
printf '%s\n' "$ORIGINAL_HASH" "$DECODED_HASH" | md5sum
Fix now
Regenerate the tree from the input data and re-encode. If issue persists, pin the Huffman table to a fixed baseline.
Compressed output is larger than expected+
Immediate action
Compute entropy of input to estimate theoretical best compression. Compare actual compressed size without table overhead.
Commands
python3 -c "import sys; from collections import Counter; data=open(sys.argv[1],'rb').read(); c=Counter(data); print(sum([-f/len(data)*__import__('math').log2(f/len(data)) for f in c.values()])*len(data)/8)"
ls -la original_file compressed_file
Fix now
If input is small, skip Huffman or use a fixed table. For streaming, fall back to raw storage.
Decode crashes with segfault+
Immediate action
Isolate the compressed block that causes the crash. Run the decoder with AddressSanitizer.
Commands
gcc -fsanitize=address -g decoder.c -o decoder && ./decoder < corrupt.bin 2>&1
valgrind --tool=memcheck --track-origins=yes ./decoder < corrupt.bin
Fix now
Add validation of code lengths before tree construction. See production_incident section for bounds checking.
Huffman vs Other Compression Algorithms
PropertyHuffman CodingArithmetic CodingRun-Length EncodingLZ77
TypeEntropy (symbol-wise)Entropy (near-optimal)Redundancy (run reduction)Dictionary (sliding window)
Compression ratioClose to entropy (≤1 bit/symbol overhead)Near-entropy (fractional bits/symbol)Only if long runs existHigh for repetitive data
SpeedFast encode/decode (table lookup)Slower (arithmetic ops, need renormalisation)Very fastModerate (search window)
StreamingBlock-based or adaptiveStreamable with stateStreamableStreamable
Table overheadMay be large for small dataMinimal (model parameters)NoneNone
Use casesZIP, PNG, JPEG, MP3H.264, JPEG 2000, bzip2BMP, fax, WAVZIP, gzip, PNG, HTTP

Key takeaways

1
Huffman coding assigns shorter codes to more frequent characters
provably optimal prefix-free code.
2
Build a min-heap of leaf nodes, repeatedly merge the two lowest-frequency nodes.
3
Prefix-free
no code is a prefix of another, so decoding is unambiguous — walk the tree bit by bit.
4
Time complexity
O(n log n) where n is the number of distinct characters.
5
Real-world use
DEFLATE (ZIP/PNG), JPEG quantisation tables, HTTP/2 header compression (HPACK).
6
Code table overhead can negate savings for small inputs; always measure effective ratio.
7
Decoding is simple and fast but vulnerable to bit errors
use block segmentation and checksums.

Common mistakes to avoid

4 patterns
×

Using a static code table that doesn't match the input distribution

Symptom
Compression ratio is poor, sometimes worse than raw. For example, compressing binary files with a table tuned for ASCII text.
Fix
Build the Huffman tree from the actual input (dynamic table) or use a pre-computed table scoped to the expected data class (e.g., English text, executable). Always profile the compression ratio on representative samples.
×

Forgetting to handle the single-character case

Symptom
If the input contains only one distinct character, the Huffman tree is a single leaf. The encoding algorithm may produce an empty code or crash if not handled.
Fix
C L: In get_codes, use prefix or '0' when node.char is not None and prefix is empty.
×

Not aligning compressed output to byte boundaries

Symptom
When writing compressed bits to a file, the last byte is incomplete. Some compressors just discard remaining bits, causing data loss on decode.
Fix
Always pad the final byte with a chosen filler (usually '0') and store the number of valid bits in the last byte (or use a marker to signal end of stream). In DEFLATE, a block is terminated with a length-distance code, so no padding issues.
×

Using a mutable default argument in Python code generation

Symptom
The codes dict in get_codes(node, prefix='', codes={}) accumulates across calls if the default is reused — typical Python gotcha.
Fix
Use None as default and create a new dict inside the function, or pass an empty dict explicitly each time.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How do you build a Huffman tree?
Q02SENIOR
What does 'prefix-free code' mean and why does Huffman coding produce on...
Q03JUNIOR
What is the time complexity of building a Huffman tree?
Q04SENIOR
How does Huffman handle a single-character input?
Q05SENIOR
Why does DEFLATE combine LZ77 with Huffman rather than using Huffman alo...
Q01 of 05JUNIOR

How do you build a Huffman tree?

ANSWER
Start with each character as a leaf node with its frequency. Insert all nodes into a min-heap. While more than one node remains: pop the two smallest-frequency nodes, create a new internal node with frequency equal to the sum, attach the popped nodes as left and right children, and push the new node back. The remaining node is the root. This builds a full binary tree where every leaf corresponds to a character.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What does prefix-free mean and why is it important?
02
How is Huffman coding different from run-length encoding?
03
Can Huffman be used for streaming data?
04
What is canonical Huffman?
🔥

That's Greedy & Backtracking. Mark it forged?

5 min read · try the examples if you haven't

Previous
Activity Selection Problem
3 / 13 · Greedy & Backtracking
Next
Backtracking Introduction