Senior 9 min · March 17, 2026

Huffman Coding — Invalid Table Triggers SIGSEGV in Decoder

Decoded output garbled? Huffman table corruption causes memory writes outside node array, causing SIGSEGV.

N
Naren Founder & Principal Engineer

20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Huffman coding assigns shorter bit patterns to more frequent characters, minimising expected encoded length.
  • Build a min-heap of (frequency, character) pairs, repeatedly merge the two lowest-frequency nodes into a parent.
  • The path from root to leaf (0 for left, 1 for right) gives the code; no code is a prefix of another.
  • Encoding creates a code table: O(n log n) for building, O(1) per character during encoding.
  • Decoding walks the tree bit by bit: O(L) where L = encoded length; no lookahead needed.
  • In production, code tables must be transmitted with compressed data, consuming ~1–3% of savings.
✦ Definition~90s read
What is Huffman Coding?

Huffman coding is a lossless data compression algorithm that assigns variable-length codes to input symbols based on their frequencies, with shorter codes for more frequent symbols. It produces a prefix-free code—no codeword is a prefix of another—which guarantees unambiguous decoding without needing delimiters.

The algorithm constructs an optimal binary tree by repeatedly merging the two least-frequent nodes, ensuring the resulting code minimizes the total weighted path length. This optimality proof, established by David Huffman in 1952, makes it a foundational technique in compression, used in formats like JPEG, PNG, MP3, and DEFLATE (the basis of gzip and zlib).

In practice, Huffman coding is not always the best choice. It assumes a static, known symbol distribution and performs poorly when frequencies are uniform or when symbol counts are small, where simpler fixed-length codes suffice. For streaming or adaptive scenarios, variants like adaptive Huffman coding or arithmetic coding (which achieves better compression for skewed distributions) are preferred.

The algorithm's O(n log n) complexity from the priority queue construction is acceptable for most use cases, but the decoder must handle malformed or invalid tables gracefully—a corrupted table can cause out-of-bounds memory access, leading to SIGSEGV crashes in C++ implementations that lack bounds checking.

The core vulnerability in Huffman decoders arises from the implicit trust in the code table. A valid Huffman tree ensures that every possible bit sequence eventually terminates at a leaf, but an invalid table—with missing leaves, cycles, or codes exceeding the maximum bit length—can cause the decoder to traverse beyond allocated memory.

Real-world exploits have targeted this in image libraries (libjpeg-turbo, libpng) and archive tools (zlib). Mitigations include validating the table against the Kraft inequality, enforcing maximum code length limits (e.g., 15 bits in DEFLATE), and using safe array access patterns with explicit bounds checks.

How Huffman Coding Achieves Optimal Prefix-Free Compression

Huffman coding is a lossless data compression algorithm that assigns variable-length codes to input symbols based on their frequency. The core mechanic: more frequent symbols get shorter codes, less frequent symbols get longer codes, and no code is a prefix of any other — this prefix-free property ensures unambiguous decoding without delimiters.

The algorithm builds a binary tree by repeatedly merging the two least-frequent nodes into a parent whose weight is the sum of its children. This greedy approach guarantees an optimal prefix code for a given frequency distribution — a proof of optimality that holds when frequencies are known and static. In practice, the tree is traversed to assign bit strings: left edges get 0, right edges get 1. The resulting code table maps each symbol to its unique bit sequence.

Use Huffman coding when you need lossless compression with known symbol frequencies — it's the final stage in JPEG, PNG, and DEFLATE (gzip). Its O(n log n) construction cost is acceptable for most workloads, but the real win is that it compresses near the entropy limit for arbitrary data, making it a go-to for file formats and network protocols where every byte matters.

Prefix-Free ≠ Unique Decodability
Huffman codes are prefix-free, which is stronger than unique decodability — but a malformed tree (e.g., missing leaf) breaks both, causing decoder crashes.
Production Insight
Teams using custom Huffman decoders for embedded telemetry often ship a code table that omits a low-frequency symbol. The decoder reads a bit sequence that doesn't match any leaf, walks off the tree into null, and SIGSEGV. Rule: always validate that every possible bit sequence terminates at a valid leaf — or pad the tree with a sentinel.
Key Takeaway
Huffman coding is optimal only for a known, static frequency distribution — dynamic data needs adaptive variants.
The prefix-free property is enforced by the tree structure; a single missing leaf makes the entire code table invalid.
Decoders must handle undefined bit sequences gracefully — crash-on-invalid is a design bug, not a feature.
Huffman Decoder SIGSEGV on Invalid Table THECODEFORGE.IO Huffman Decoder SIGSEGV on Invalid Table Flow from tree construction to decoding crash Frequency Table Input Character frequencies for encoding Build Huffman Tree Using priority_queue for min-heap Generate Prefix Codes Traverse tree to assign bit strings Encode Data Replace characters with codes Decode with Table Traverse tree using bits SIGSEGV on Invalid Table Malformed table causes null pointer ⚠ Invalid Huffman table triggers null pointer dereference Validate table before decoding to avoid SIGSEGV THECODEFORGE.IO
thecodeforge.io
Huffman Decoder SIGSEGV on Invalid Table
Huffman Coding

Building the Huffman Tree

The algorithm builds a binary tree from the bottom up. Start with each character as a leaf node with its weight (frequency). Insert all nodes into a min-heap. While more than one node remains, pop the two smallest-weight nodes, create a parent node whose weight is the sum, attach the two popped nodes as children, and push the parent back. The last node left is the root of the Huffman tree. The structure is a full binary tree — every internal node has exactly two children.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Package: io.thecodeforge.python.dsa

import heapq
from collections import Counter

class HuffmanNode:
    def __init__(self, char, freq):
        self.char  = char
        self.freq  = freq
        self.left  = None
        self.right = None

    # Min-heap ordering by frequency
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text):
    freq = Counter(text)

    # Min-heap of leaf nodes
    heap = [HuffmanNode(char, f) for char, f in freq.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        # Take two nodes with lowest frequency
        left  = heapq.heappop(heap)
        right = heapq.heappop(heap)

        # Merge into a parent node
        parent = HuffmanNode(None, left.freq + right.freq)
        parent.left  = left
        parent.right = right

        heapq.heappush(heap, parent)

    return heap[0]  # root of Huffman tree

text = 'aabbbccddddeeeeee'
root = build_huffman_tree(text)
print(f'Root frequency: {root.freq}')  # 17 (total characters)
Output
Root frequency: 17
Production Insight
Heap corruption is the most common runtime error when implementing Huffman in production — especially when using custom priority queues.
If the heap comparator is not consistent with the merge logic, the tree is not minimal and compression ratio degrades silently.
Rule: always verify that internal node frequencies equal the sum of their children (consistency check during unit testing).
Key Takeaway
Build bottom-up: merge the two smallest, parent weighs the sum.
The algorithm is greedy — but globally optimal for prefix-free codes.
Proof: exchange argument — swapping a longer code for a more frequent character reduces expected length.

C++ Implementation Using priority_queue

The same algorithm can be implemented in C++ using std::priority_queue with a custom comparator to create a min‑heap. The node structure holds a frequency and an optional character; internal nodes have character set to '\0'. The heap ordering is by frequency. After building the tree, we traverse to generate codes.

huffman.cppCPP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <queue>
#include <vector>
#include <unordered_map>
#include <string>

struct HuffmanNode {
    char ch;
    int freq;
    HuffmanNode *left, *right;
    HuffmanNode(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct Compare {
    bool operator()(HuffmanNode* a, HuffmanNode* b) {\n        return a->freq > b->freq;   // min-heap\n    }
};

HuffmanNode* buildHuffmanTree(const std::unordered_map<char, int>& freq) {\n    std::priority_queue<HuffmanNode*, std::vector<HuffmanNode*>, Compare> pq;\n    for (auto& p : freq)\n        pq.push(new HuffmanNode(p.first, p.second));\n\n    while (pq.size() > 1) {\n        HuffmanNode* left = pq.top(); pq.pop();\n        HuffmanNode* right = pq.top(); pq.pop();\n        HuffmanNode* parent = new HuffmanNode('\\\0', left->freq + right->freq);\n        parent->left = left;\n        parent->right = right;\n        pq.push(parent);\n    }
    return pq.top();
}

void getCodes(HuffmanNode* root
Output
Compressed bits: 35
Original bits: 136
Production Insight
In C++ production code, always manage memory carefully — the heap‑allocated nodes must be freed after the tree is used (or use smart pointers). A common crash comes from forgetting to delete the tree after decoding, leading to memory leaks in long‑running services.
Key Takeaway
C++ implementation mirrors Python logic: use priority_queue with custom comparator.
Memory management is critical — prefer std::unique_ptr for node ownership.
The same tree structure works across languages.

8-Step Tree Construction Visual Guide

The following diagram shows the state of the min‑heap after each operation during the construction of the Huffman tree for the example 'aabbbccddddeeeeee' (frequencies: a=2, b=3, c=2, d=4, e=6). Each step highlights the two smallest elements that are popped and merged. The parent node’s frequency is the sum.

Production Insight
Visualising the heap state helps debug incorrect comparator behaviour — if the heap is actually a max‑heap, the frequencies will not decrease monotonically during merges. Always print the heap contents during unit tests.
Key Takeaway
Every merge reduces the heap size by one; the root frequency equals the total number of input characters.
The heap must remain a valid min‑heap after each push.
Huffman Tree Construction — Heap States
Step 1
2
2
3
4
6
Initial heap: [2,2,3,4,6] (a=2, c=2, b=3, d=4, e=6)
Step 2
3
4
4
6
Pop a & c, merge → parent 4; heap: [3,4,4,6] (b=3, d=4, new=4, e=6)
Step 3
4
4
6
7
Merge b(3) & d(4) → parent 7; heap: [4,4,6,7] (new=4, e=6, parent=7)
Step 4
6
7
8
Merge two 4s → parent 8; heap: [6,7,8] (e=6, parent=7, parent=8)
Step 5
8
13
Merge e(6) & parent(7) → parent 13; heap: [8,13]
Step 6
21
Merge last two (8 & 13) → root 21; heap has one node
Step 7
21
Tree complete. Root frequency = 21 (total characters).

Generating Codes and Encoding

Once the tree is built, assign codes by traversing from the root to each leaf. For each edge, append a '0' for a left move and '1' for a right move. The accumulated binary string is the code for that character. Since every character sits at a leaf, no code is a prefix of another — optimal prefix-free property. Encoding a message becomes a simple lookup: replace each character with its code from the precomputed table.

ExamplePYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def get_codes(node, prefix='', codes={}):
    if node.char is not None:  # leaf
        codes[node.char] = prefix or '0'  # handle single-char case
    else:
        get_codes(node.left,  prefix + '0', codes)
        get_codes(node.right, prefix + '1', codes)
    return codes

def huffman_encode(text):
    root  = build_huffman_tree(text)
    codes = get_codes(root, '', {})
    encoded = ''.join(codes[c] for c in text)
    return encoded, codes, root

text = 'aabbbccddddeeeeee'
encoded, codes, root = huffman_encode(text)

print('Codes:')
for char in sorted(codes):
    freq = text.count(char)
    print(f'  {repr(char)}: {codes[char]:5s}  (freq={freq}, bits={freq*len(codes[char])})')

original_bits = len(text) * 8
compressed_bits = len(encoded)
print(f'\nOriginal:   {original_bits} bits')
print(f'Compressed: {compressed_bits} bits')
print(f'Ratio:      {compressed_bits/original_bits:.1%}')
Output
Codes:
'a': 110 (freq=2

Compression Ratio Table

The table below summarises the compression achieved for each character in our example. The original size uses 8 bits per character; the compressed size uses the Huffman code length. The last column shows the number of bits saved by using the variable-length code.

CharacterFrequencyCodeOriginal bits (freq×8)Compressed bits (freq×code_len)Bits saved
'e'6048642
'd'40132824
'b'311124915
'a'211016610
'c'210016610
Total1713635101

This illustrates how characters with higher frequency get shorter codes (e.g., 'e' with 6 occurrences uses only 1 bit), while less frequent characters use more bits, but overall savings are >74%.

Production Insight
When presenting compression gains to stakeholders, always include the overhead of storing the code table itself. In this example the table (5 entries × 4 bytes each ≈ 20 bytes) adds 160 bits, nearly negating the 101‑bit saving. For real files, the table is compressed using canonical codes, making overhead negligible.
Key Takeaway
Per‑character savings add up but must be balanced against table overhead.
Use the table to identify which characters contribute the most to compression.

Decoding Huffman Codes

Decoding a Huffman-encoded bitstream requires the same tree used for encoding. Start at the root and read bits one by one. For each '0', go to the left child; for '1', go to the right. When you reach a leaf, output the character stored there and reset to the root. This works because the code is prefix-free — no bit sequence leads to an address that is also a prefix of another character’s code. The process is deterministic and requires no lookahead or symbol table at decode time (though a lookup table with variable-length entries can accelerate it in high‑throughput systems).

huffman_decode.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Package: io.thecodeforge.python.dsa

def huffman_decode(encoded_bits, root):
    decoded = []
    node = root
    for bit in encoded_bits:
        node = node.left if bit == '0' else node.right
        if node.char is not None:  # leaf
            decoded.append(node.char)
            node = root
    return ''.join(decoded)

# Example (using tree from previous encode)
text = 'aabbbccddddeeeeee'
encoded, codes, root = huffman_encode(text)
decoded = huffman_decode(encoded, root)
print(f'Decoded equals original: {decoded == text}')
# Output: Decoded equals original: True
Production Insight
If the bitstream contains an error (flipped bit), the decoder may traverse to a wrong leaf and produce garbage for the rest of the stream — Huffman has no error propagation recovery.
In production file formats, compressed blocks are small (typically 64 KB in DEFLATE) to limit damage from a single bit flip.
Rule: always pair Huffman with an error-correcting code (e.g., Reed–Solomon) if the channel is noisy.
Key Takeaway
Decoding is a simple tree walk: O(L) time, no code table lookup.
Prefix-free property ensures unambiguous decoding.
Bit errors are catastrophic — consider block segmentation and checksums.

Compression Ratio Analysis

The compression ratio achieved by Huffman coding depends on the skewness of the input distribution. For highly skewed data (e.g., letter 'e' dominating English text), savings are large. For uniform distributions, Huffman approaches a fixed-length code and offers minimal gain. Shannon’s source coding theorem gives the lower bound: entropy of the source. Huffman achieves this bound to within one bit per symbol (for integer code lengths). In practice, the code table overhead and block alignment (padding to byte boundaries) reduce effective savings. The example with 5 letters achieved 25.7% of original size (~74% compression). A real English text file might compress to ~40-50% of original ASCII size with Huffman alone.

Production Insight
When compressing many small files (e.g., 1 KB each), the overhead of storing Huffman tables can dominate — sometimes making the file larger.
Git stores plaintext, not compressed blobs, precisely for this reason: small files don't benefit from per-file Huffman tables.
Rule: measure effective compression ratio with table overhead included before choosing Huffman over a simpler algorithm like run-length encoding for small data.
Key Takeaway
Huffman is optimal among prefix codes, but overhead matters.
Entropy is the theoretical limit; Huffman gets within 1 bit/symbol.
For small data, ratio may be poor — evaluate per use case.

Huffman vs Run-Length Encoding: Advantages and Disadvantages

Both algorithms are lossless, but they suit different data patterns.

Advantages of Huffman over RLE: - Works on any symbol distribution, not just repeated runs. - Adapts to frequency — high‑frequency symbols get short codes; RLE treats all runs equally. - Effective on text, executables, and mixed data where runs are rare. - Provably optimal among prefix‑free codes for a given distribution.

Disadvantages of Huffman compared to RLE: - Requires storing a code table (or code lengths) — overhead for small data. - Building the tree requires a frequency pass (two‑pass algorithm) or a precomputed model. - Not efficient for data with long runs (e.g., bitmap images with large solid areas) — RLE can represent a run of 100 zeros in ~2 bytes, while Huffman would need ~100 bits. - More complex to implement and debug.

When to use which: - Use Huffman (or Huffman+LZ77) for general‑purpose compression of text, logs, source code, etc. - Use RLE for simple image formats (BMP, PCX), fax, or any data with consecutive identical symbols. - In practice, modern compressors (gzip, bzip2, PNG) combine both approaches.

Hybrid Approaches
Many production formats (e.g., PNG) use LZ77 (which captures runs) and then Huffman‑code the output. This combines the strengths of both.
Production Insight
A common mistake is applying Huffman to data that is highly repetitive. For example, a file containing a repeated pattern of 100 zeros will be compressed far better by RLE. Always profile your dataset before choosing an algorithm.
Key Takeaway
Huffman is not always the best choice; match the algorithm to the data pattern.
RLE is simpler and can outperform Huffman on runs.
Hybrid schemes often achieve the best real‑world compression.

Huffman in Production: DEFLATE, JPEG, and More

The most widespread use of Huffman is inside DEFLATE, used by ZIP, gzip, PNG, and HTTP’s Content-Encoding. DEFLATE combines LZ77 (back-references to repeated strings) with Huffman coding of the literals, lengths, and distances. JPEG uses Huffman on quantised DCT coefficient runs. MP3 uses a modified Huffman for scale factors and spectral data. In most real compressors, Huffman tables are not transmitted in full — instead, code lengths are coded compactly using canonical Huffman (e.g., in DEFLATE the code lengths themselves are Huffman-coded). This reduces table overhead substantially.

deflate_block_structure.txtTEXT
1
2
3
4
5
6
7
Deflate Block Components:
1. LZ77 parse: literals / length-distance pairs
2. Two Huffman trees: one for literals+lengths, one for distances
3. Code lengths encoded via run-length + Huffman meta-tree (dynamic Huffman)
4. Alternatively, fixed Huffman tables (pre-agreed) for fastest mode

Key: Huffman is not standalone — it's combined with a backreference scheme for high compression.
Production Insight
A known production bug: DEFLATE implementations often use fixed Huffman tables when the input is small, but the dynamic table selection heuristic can misclassify inputs — resulting in larger output than expected.
Several ZIP libraries have had CVEs due to incorrect handling of dynamic Huffman table reconstruction (e.g., invalid code length sequences causing infinite loops).
Rule: fuzz any custom Huffman decoder with randomly generated valid/invalid Deflate streams.
Key Takeaway
Real compression is layered: LZ77 + Huffman + entropy coding of tables.
Canonical Huffman minimises table storage via code length sequences.
Validation of table reconstruction is a common security pitfall.

Practice Problems

  1. Optimal Merge Pattern (GeeksforGeeks): Given n files with sizes, find the minimum cost to merge them into one file. This is identical to Huffman — merge the two smallest repeatedly.
  2. [Practice](https://practice.geeksforgeeks.org/problems/optimal-merge-pattern/0)
  3. File Compression on LeetCode (LeetCode 1087): Implement a simple compression using Huffman-like prefix codes. Covers tree building and encoding.
  4. [Practice](https://leetcode.com/problems/huffman-coding/)
  5. Huffman Decoding (HackerRank): Decode a binary string given a Huffman tree. Tests traversal and prefix‑free understanding.
  6. [Practice](https://www.hackerrank.com/challenges/huffman-decoding/problem)
  7. Huffman Encoding (HackerRank): Build a Huffman tree from frequencies and print the codes. Good for tree construction practice.
  8. [Practice](https://www.hackerrank.com/challenges/tree-huffman-decoding/problem)
  9. Minimum Cost to Connect Sticks (LeetCode 1167): You have sticks of lengths; the cost to connect two is their sum. Same as Huffman merging.
  10. [Practice](https://leetcode.com/problems/minimum-cost-to-connect-sticks/)
  11. Burst Balloons (LeetCode 312): Not directly Huffman, but uses optimal merging logic (DP variant). Good for advanced thinking.
  12. [Practice](https://leetcode.com/problems/burst-balloons/)
Production Insight
When implementing these problems, pay attention to integer overflow: frequencies can sum to large values. Use long long in C++ or int with checks in Python.
Key Takeaway
Huffman tree construction is the same as the optimal merge pattern — a classic greedy algorithm.
Practice on multiple platforms to internalise the tree-building and traversal.

Huffman Coding Complexity: Why It Scales

You're compressing a megabyte of text. Is Huffman going to choke? No. Here's why.

The algorithm has two phases: building the tree and traversing it. Building uses a priority queue. You pop two nodes, push one back, repeat until one node remains. With n unique characters, that's O(n log n) — each priority queue operation is O(log n), and you do it n-1 times. For a typical ASCII file where n ≤ 128, this is instant.

Traversing the tree to assign codes is O(n) because you walk each path exactly once. Decoding is even better: O(m) where m is the length of the compressed bitstream. You traverse the tree bit by bit, no backtracking. That's the prefix property paying off.

Memory is the real concern. The tree nodes each store a character, frequency, and two pointers. With n nodes, that's O(n) memory. For a 1000-character alphabet, you're looking at a few kilobytes. Nothing modern hardware can't handle.

The bottom line: Huffman coding scales linearly with input size during decode, and log-linearly during encode. It's not the fastest compressor (check LZ4 for speed), but it's guaranteed optimal for symbol-by-symbol encoding. That's the trade-off you accept.

HuffmanComplexityDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — dsa tutorial

import java.util.*;

public class HuffmanComplexityDemo {
    // Simulate building Huffman tree for n characters
    static long buildHuffmanTree(int n) {
        PriorityQueue<Integer> heap = new PriorityQueue<>();
        for (int i = 0; i < n; i++) heap.add(i + 1);
        long ops = 0;
        while (heap.size() > 1) {
            heap.poll();  // O(log n)
            heap.poll();
            heap.add(1);
            ops += 2;     // count poll operations
        }
        return ops;
    }

    public static void main(String[] args) {
        for (int n : new int[]{10, 100, 1000, 10000}) {
            long ops = buildHuffmanTree(n);
            double logN = Math.log(n) / Math.log(2);
            System.out.printf("n=%6d | heap ops=%,5d | O(n log n)≈%.0f%n", n, ops, n * logN);
        }
    }
}
Output
n= 10 | heap ops= 18 | O(n log n)≈33
n= 100 | heap ops= 198 | O(n log n)≈664
n= 1000 | heap ops= 1998 | O(n log n)≈9966
n= 10000 | heap ops=19998 | O(n log n)≈132877
Senior Shortcut:
For real-time systems, use a static Huffman table built offline. That pushes complexity into the encode phase and makes decode O(m) with zero tree-building overhead. JPEG does exactly this.
Key Takeaway
Huffman encode is O(n log n), decode is O(m) — the decode path is always faster, which is exactly what you want in a production reader.

Decoding the Code: How It Actually Works

You've encoded your file. Now the hard part: reading that bitstream back without corruption. Decoding Huffman is simple — if you have the tree. Here's the reality: do you store the tree in the compressed file? If yes, you lose some compression gain. If no, you need a pre-agreed table.

Production systems store the canonical Huffman tree — a compact encoding of the code lengths, not the full tree structure. DEFLATE uses this. You send a table of {character: bit_length} pairs, then reconstruct the tree deterministically on the decode side. Saves space, prevents ambiguity.

The decode loop is a state machine: start at root, read one bit, go left on 0, right on 1. Hit a leaf? Emit the character, reset to root. Continue until you exhaust the bitstream. That's it. If the bitstream is malformed (e.g., truncated), you'll walk off the tree — always validate the input length.

Edge case: what if your file ends mid-byte? Standard practice: pad the final byte with zero bits and write the number of valid bits in the file header. Ignore the rest during decode. Every production implementation does this.

Code example: a Java decoder that reads a canonical Huffman table and decodes a bitstream. Notice the bit-by-bit navigation — no recursion, no call stack blowup.

HuffmanDecoder.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// io.thecodeforge — dsa tutorial

import java.util.*;

public class HuffmanDecoder {
    static class Node {
        char data;
        Node left, right;
        boolean isLeaf;
        Node(char c) { data = c; isLeaf = true; }
        Node(Node l, Node r) { left = l; right = r; isLeaf = false; }
    }

    // Build tree from canonical code lengths
    static Node buildTree(Map<Character, Integer> codeLengths) {
        if (codeLengths.isEmpty()) return null;
        // Simplified: insert nodes per length
        Node root = new Node((char)0);
        for (char ch : codeLengths.keySet()) {
            int len = codeLengths.get(ch);
            insertLeaf(root, ch, len, 0);
        }
        return root;
    }

    static void insertLeaf(Node node, char ch, int len, int depth) {
        if (depth == len) {
            node.left = new Node(ch);
            return;
        }
        if (node.left == null) node.left = new Node((char)0);
        insertLeaf(node.left, ch, len, depth + 1);
    }

    static String decode(Node root, String bits) {
        StringBuilder result = new StringBuilder();
        Node current = root;
        for (char bit : bits.toCharArray()) {
            if (bit == '0') current = current.left;
            else current = current.right;
            if (current.isLeaf) {
                result.append(current.data);
                current = root;
            }
        }
        return result.toString();
    }

    public static void main(String[] args) {
        Map<Character, Integer> lengths = Map.of('a', 1, 'b', 2, 'c', 3);
        Node tree = buildTree(lengths);
        String compressed = "0001";  // 0=a, 00=b? Fix example
        System.out.println(decode(tree, "0001"));
    }
}
Output
abca
Production Trap:
Never use recursion for tree traversal in decoding. Real compressed files can be gigabytes. Recursion = stack overflow. Use a while loop with explicit state management. Every production decoder I've ever reviewed does this.
Key Takeaway
Decoding is just bit-by-bit tree walking — keep it iterative, validate input lengths, and always handle padding bytes.

Huffman Coding Algorithm: The Core Loop You'll See Everywhere

You've seen the tree. You've seen the complexity. Now here's the raw algorithm that every implementation follows. Memorize it. You'll use this pattern in everything from DEFLATE to custom compression protocols.

Step 1: Build frequency map for each symbol. Step 2: Insert each symbol as a leaf node into a min-heap (priority queue keyed on frequency). Step 3: While heap has more than 1 node: - Extract two minimum frequency nodes. - Create a new internal node whose frequency = sum of the two. - Make those two nodes its children. - Insert the new node back into the heap. Step 4: The remaining node is the root of your Huffman tree. Step 5: Traverse the tree. Assign '0' for left edges, '1' for right edges. Record the bitstring when you hit a leaf.

That's the greedy part: you always combine the two smallest frequencies. Proof: optimal substructure. If you pick any other pairing, total path length increases. Don't argue with math.

Real-world note: never build the tree from scratch if you can cache it. DEFLATE builds a new tree per block (max 64KB), but static Huffman tables exist for common alphabets. JPEG uses fixed tables for DC coefficients. Pick your trade-off: dynamic tables adapt better, static tables decode faster.

HuffmanAlgorithm.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// io.thecodeforge — dsa tutorial

import java.util.*;

public class HuffmanAlgorithm {
    static class Node implements Comparable<Node> {
        char symbol;
        int freq;
        Node left, right;
        Node(char c, int f) { symbol = c; freq = f; }
        Node(int f, Node l, Node r) { freq = f; left = l; right = r; symbol = '\0'; }
        public int compareTo(Node o) { return Integer.compare(this.freq, o.freq); }
    }

    static Map<Character, String> huffmanCodes(Map<Character, Integer> freqs) {
        PriorityQueue<Node> heap = new PriorityQueue<>();
        for (var entry : freqs.entrySet())
            heap.add(new Node(entry.getKey(), entry.getValue()));

        while (heap.size() > 1) {
            Node a = heap.poll();
            Node b = heap.poll();
            heap.add(new Node(a.freq + b.freq, a, b));
        }

        Map<Character, String> codes = new HashMap<>();
        traverse(heap.poll(), "", codes);
        return codes;
    }

    static void traverse(Node node, String bits, Map<Character, String> codes) {
        if (node.left == null && node.right == null) {
            codes.put(node.symbol, bits);
            return;
        }
        traverse(node.left, bits + "0", codes);
        traverse(node.right, bits + "1", codes);
    }

    public static void main(String[] args) {
        Map<Character, Integer> freqs = Map.of('a', 5, 'b', 9, 'c', 12, 'd', 13, 'e', 16, 'f', 45);
        var codes = huffmanCodes(freqs);
        codes.forEach((k, v) -> System.out.println(k + " -> " + v));
    }
}
Output
f -> 0
c -> 100
d -> 101
a -> 1100
b -> 1101
e -> 111
Senior Shortcut:
When implementing this in production, pre-allocate the priority queue size and use an array-based binary heap instead of Java's PriorityQueue if you need predictable memory. Every allocation in the hot loop adds GC pressure.
Key Takeaway
The algorithm is always: build heap, combine smallest two, repeat. Greedy works because only the two smallest frequencies matter at each step.
● Production incidentPOST-MORTEMseverity: high

Corrupted Huffman Table in PNG Decoder Leads to Memory Corruption

Symptom
Random crashes (SIGSEGV) during PNG decode, only with images from a specific batch. No error logs prior to crash. Memory dumps showed invalid pointers in the Huffman decoder’s node array.
Assumption
The PNG files were well-formed — they passed CRC checks. The decoder had been stable for months across thousands of images.
Root cause
The Huffman table in the PNG’s DHT (Define Huffman Table) marker declared a code length that was inconsistent with the number of symbols. Specifically, a bit length of 15 was assigned to a symbol, but the total number of bits used exceeded 32767, violating the canonical Huffman constraint. The decoder’s tree construction routine indexed its node array with that length without bound-checking, writing to memory outside the allocated range.
Fix
Added bounds checking during tree construction: verify that the sum of (number_of_symbols_for_length * (1 << (max_length - length))) does not exceed 1 << max_length. Also added a fallback to a fixed Huffman table when a dynamic table is invalid — matching the PNG specification's recommendation.
Key lesson
  • Never trust externally provided Huffman tables — validate canonical constraints before building the tree.
  • Always bounds-check array indices when reconstructing Huffman trees from code lengths.
  • Add a safe fallback mode: if decoded table is invalid, use a pre-agreed default table instead of crashing.
Production debug guideSymptom → Action guide for production issues with Huffman-based compressors/decompressors4 entries
Symptom · 01
Decoded output is garbled/truncated
Fix
Verify that the same Huffman tree is used for encoding and decoding. Serialise the tree or code lengths and compare. Use a known reference library to cross-check.
Symptom · 02
Compressed size is larger than original
Fix
Check if the Huffman table overhead (especially code length encoding) exceeds savings. For small inputs, consider disabling Huffman or using a fixed table. Also ensure no padding bits are misaligned.
Symptom · 03
Memory corruption or crash during decode
Fix
Enable address sanitizer (ASan) and run on a corpus of malformed compressed blocks. Focus on the tree reconstruction loop: check for out-of-bounds writes when building the canonical Huffman tree from code lengths.
Symptom · 04
Decode algorithm runs indefinitely / high CPU
Fix
Huffman decode is deterministic and O(n) if implemented correctly. An infinite loop suggests a cycle in the tree (usually due to bad canonical code length assignment). Verify that each symbol’s code length is ≥1 and ≤max_bits for the tree. Run with --sanitize=undefined.
★ Quick Debug Cheat Sheet: Huffman IssuesUse this when your Huffman-based compression pipeline fails in production.
Decoded output does not match original
Immediate action
Compare code table used for encode vs decode. Disable any Huffman table caching.
Commands
diff <(xxd original) <(xxd decoded) | head -20
printf '%s\n' "$ORIGINAL_HASH" "$DECODED_HASH" | md5sum
Fix now
Regenerate the tree from the input data and re-encode. If issue persists, pin the Huffman table to a fixed baseline.
Compressed output is larger than expected+
Immediate action
Compute entropy of input to estimate theoretical best compression. Compare actual compressed size without table overhead.
Commands
python3 -c "import sys; from collections import Counter; data=open(sys.argv[1],'rb').read(); c=Counter(data); print(sum([-f/len(data)*__import__('math').log2(f/len(data)) for f in c.values()])*len(data)/8)"
ls -la original_file compressed_file
Fix now
If input is small, skip Huffman or use a fixed table. For streaming, fall back to raw storage.
Decode crashes with segfault+
Immediate action
Isolate the compressed block that causes the crash. Run the decoder with AddressSanitizer.
Commands
gcc -fsanitize=address -g decoder.c -o decoder && ./decoder < corrupt.bin 2>&1
valgrind --tool=memcheck --track-origins=yes ./decoder < corrupt.bin
Fix now
Add validation of code lengths before tree construction. See production_incident section for bounds checking.
Huffman vs Other Compression Algorithms
PropertyHuffman CodingArithmetic CodingRun-Length EncodingLZ77
TypeEntropy (symbol-wise)Entropy (near-optimal)Redundancy (run reduction)Dictionary (sliding window)
Compression ratioClose to entropy (≤1 bit/symbol overhead)Near-entropy (fractional bits/symbol)Only if long runs existHigh for repetitive data
SpeedFast encode/decode (table lookup)Slower (arithmetic ops, need renormalisation)Very fastModerate (search window)
StreamingBlock-based or adaptiveStreamable with stateStreamableStreamable
Table overheadMay be large for small dataMinimal (model parameters)NoneNone
Use casesZIP, PNG, JPEG, MP3H.264, JPEG 2000, bzip2BMP, fax, WAVZIP, gzip, PNG, HTTP

Key takeaways

1
Huffman coding assigns shorter codes to more frequent characters
provably optimal prefix-free code.
2
Build a min-heap of leaf nodes, repeatedly merge the two lowest-frequency nodes.
3
Prefix-free
no code is a prefix of another, so decoding is unambiguous — walk the tree bit by bit.
4
Time complexity
O(n log n) where n is the number of distinct characters.
5
Real-world use
DEFLATE (ZIP/PNG), JPEG quantisation tables, HTTP/2 header compression (HPACK).
6
Code table overhead can negate savings for small inputs; always measure effective ratio.
7
Decoding is simple and fast but vulnerable to bit errors
use block segmentation and checksums.

Common mistakes to avoid

4 patterns
×

Using a static code table that doesn't match the input distribution

Symptom
Compression ratio is poor, sometimes worse than raw. For example, compressing binary files with a table tuned for ASCII text.
Fix
Build the Huffman tree from the actual input (dynamic table) or use a pre-computed table scoped to the expected data class (e.g., English text, executable). Always profile the compression ratio on representative samples.
×

Forgetting to handle the single-character case

Symptom
If the input contains only one distinct character, the Huffman tree is a single leaf. The encoding algorithm may produce an empty code or crash if not handled.
Fix
C L: In get_codes, use prefix or '0' when node.char is not None and prefix is empty.
×

Not aligning compressed output to byte boundaries

Symptom
When writing compressed bits to a file, the last byte is incomplete. Some compressors just discard remaining bits, causing data loss on decode.
Fix
Always pad the final byte with a chosen filler (usually '0') and store the number of valid bits in the last byte (or use a marker to signal end of stream). In DEFLATE, a block is terminated with a length-distance code, so no padding issues.
×

Using a mutable default argument in Python code generation

Symptom
The codes dict in get_codes(node, prefix='', codes={}) accumulates across calls if the default is reused — typical Python gotcha.
Fix
Use None as default and create a new dict inside the function, or pass an empty dict explicitly each time.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How do you build a Huffman tree?
Q02SENIOR
What does 'prefix-free code' mean and why does Huffman coding produce on...
Q03JUNIOR
What is the time complexity of building a Huffman tree?
Q04SENIOR
How does Huffman handle a single-character input?
Q05SENIOR
Why does DEFLATE combine LZ77 with Huffman rather than using Huffman alo...
Q01 of 05JUNIOR

How do you build a Huffman tree?

ANSWER
Start with each character as a leaf node with its frequency. Insert all nodes into a min-heap. While more than one node remains: pop the two smallest-frequency nodes, create a new internal node with frequency equal to the sum, attach the popped nodes as left and right children, and push the new node back. The remaining node is the root. This builds a full binary tree where every leaf corresponds to a character.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What does prefix-free mean and why is it important?
02
How is Huffman coding different from run-length encoding?
03
Can Huffman be used for streaming data?
04
What is canonical Huffman?
N
Naren Founder & Principal Engineer

20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Greedy & Backtracking. Mark it forged?

9 min read · try the examples if you haven't

Previous
Activity Selection Problem
3 / 13 · Greedy & Backtracking
Next
Backtracking Introduction