DSA Intermediate

Huffman Encoding — Greedy Data Compression Algorithm

📅 March 24, 2026 ⏱ 8 min read 🎯 Intermediate

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Greedy & Backtracking → Topic 11 of 13

Learn Huffman encoding — how it builds a variable-length prefix code using a greedy min-heap approach.

⚙️ Intermediate — basic DSA knowledge assumed

In this tutorial, you'll learn:

Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡ Quick Answer

Morse code uses shorter signals for common letters (E=dot, T=dash) and longer ones for rare letters. Huffman encoding does the same thing for binary data — assigns shorter bit strings to more frequent characters and longer ones to rare ones. The greedy strategy always merges the two least-frequent items first, ensuring optimal prefix codes.

Every time you compress a file with gzip, zip, or bzip2, Huffman coding is running somewhere in the pipeline. JPEG uses Huffman for its DCT coefficients. MP3 uses a Huffman variant for quantised audio data. The algorithm is from 1952 and is still in active production use in 2026 because it achieves optimal prefix-free codes — you cannot do better with fixed codes.

David Huffman invented this as a term paper assignment at MIT. His professor (Robert Fano) had worked on an earlier approach (Shannon-Fano coding) and assigned the problem thinking it was unsolvable. Huffman's insight — build the tree bottom-up from the least frequent symbols using a min-heap — was simpler and more elegant than Fano's top-down approach and provably optimal.

The Greedy Strategy — Min-Heap Merging

Build a min-heap from (frequency, character) pairs. Repeatedly extract the two minimum nodes and merge them into a new node with combined frequency. Repeat until one node remains — the root.

huffman.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839

import heapq
from collections import Counter

class HNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = self.right = None
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text: str) -> HNode:
    freq = Counter(text)
    heap = [HNode(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = HNode(None, left.freq + right.freq)
        merged.left, merged.right = left, right
        heapq.heappush(heap, merged)
    return heap[0]

def get_codes(root: HNode) -> dict[str, str]:
    codes = {}
    def dfs(node, code):
        if node.char is not None:
            codes[node.char] = code or '0'
            return
        dfs(node.left,  code + '0')
        dfs(node.right, code + '1')
    dfs(root, '')
    return codes

text = 'huffman encoding'
tree = build_huffman_tree(text)
codes = get_codes(tree)
for ch, code in sorted(codes.items()):
    print(f'{repr(ch)}: {code}')

▶ Output

' ': 111
'a': 1101
'c': 0000
'd': 0001
'e': 001
'f': 100
'g': 0010
'h': 0011
'i': 1100
'm': 101
'n': 110
'o': 0100
'u': 0101

Encoding and Decoding

Encoding: replace each character with its Huffman code. Decoding: traverse the tree bit by bit.

huffman_codec.py · PYTHON

1234567891011121314151617

def huffman_encode(text: str, codes: dict) -> str:
    return ''.join(codes[ch] for ch in text)

def huffman_decode(encoded: str, root: HNode) -> str:
    result = []
    node = root
    for bit in encoded:
        node = node.left if bit == '0' else node.right
        if node.char is not None:
            result.append(node.char)
            node = root
    return ''.join(result)

encoded = huffman_encode('huffman', codes)
print(f'Encoded: {encoded}')
print(f'Bits: {len(encoded)} vs naive {len("huffman")*8}')
print(f'Decoded: {huffman_decode(encoded, tree)}')

▶ Output

Encoded: 10010000101100110001111
Bits: 23 vs naive 56
Decoded: huffman

Why Greedy is Optimal

Huffman's algorithm produces the optimal prefix-free code. Proof sketch: if two characters x, y have the lowest frequencies, there exists an optimal code where x and y are siblings at the deepest level (exchange argument). Merging them first maintains this property at each step. This is a classic greedy proof by exchange argument.

🔥

Prefix-Free PropertyHuffman codes are prefix-free: no code is a prefix of another. This guarantees unambiguous decoding without separators. The trie structure naturally enforces this — characters are only at leaves.

Complexity

Build tree: O(n log n) — n heap operations Encode: O(m) — m = text length Decode: O(m)

Compression ratio depends on entropy of the source. For English text, Huffman achieves ~4.5 bits/character vs 8 bits ASCII — about 44% compression.

🎯 Key Takeaways

Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.
Production usage: DEFLATE (gzip/zip) uses Huffman after LZ77. JPEG uses Huffman after DCT. The compressed data stream you send over HTTPS contains Huffman-coded data.
Canonical Huffman codes (as used in DEFLATE) are transmitted as code lengths, not the full tree — a standard optimisation to reduce header overhead.

Interview Questions on This Topic

QWhy does always merging the two least-frequent nodes produce an optimal code?
QWhat is a prefix-free code and why does Huffman produce one?
QWhat is the time complexity of building a Huffman tree?
QHow would you decode a Huffman-encoded string given only the encoded bits and the tree?

Frequently Asked Questions

What if all characters have the same frequency?

Huffman produces a balanced binary tree where all codes have the same length ⌈log₂ n⌉. This is still optimal — any prefix-free code for n equiprobable symbols needs at least ⌈log₂ n⌉ bits.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged