Huffman Encoding — Greedy Data Compression Algorithm
- Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
- The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
- Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.
Every time you compress a file with gzip, zip, or bzip2, Huffman coding is running somewhere in the pipeline. JPEG uses Huffman for its DCT coefficients. MP3 uses a Huffman variant for quantised audio data. The algorithm is from 1952 and is still in active production use in 2026 because it achieves optimal prefix-free codes — you cannot do better with fixed codes.
David Huffman invented this as a term paper assignment at MIT. His professor (Robert Fano) had worked on an earlier approach (Shannon-Fano coding) and assigned the problem thinking it was unsolvable. Huffman's insight — build the tree bottom-up from the least frequent symbols using a min-heap — was simpler and more elegant than Fano's top-down approach and provably optimal.
The Greedy Strategy — Min-Heap Merging
Build a min-heap from (frequency, character) pairs. Repeatedly extract the two minimum nodes and merge them into a new node with combined frequency. Repeat until one node remains — the root.
import heapq from collections import Counter class HNode: def __init__(self, char, freq): self.char = char self.freq = freq self.left = self.right = None def __lt__(self, other): return self.freq < other.freq def build_huffman_tree(text: str) -> HNode: freq = Counter(text) heap = [HNode(ch, f) for ch, f in freq.items()] heapq.heapify(heap) while len(heap) > 1: left = heapq.heappop(heap) right = heapq.heappop(heap) merged = HNode(None, left.freq + right.freq) merged.left, merged.right = left, right heapq.heappush(heap, merged) return heap[0] def get_codes(root: HNode) -> dict[str, str]: codes = {} def dfs(node, code): if node.char is not None: codes[node.char] = code or '0' return dfs(node.left, code + '0') dfs(node.right, code + '1') dfs(root, '') return codes text = 'huffman encoding' tree = build_huffman_tree(text) codes = get_codes(tree) for ch, code in sorted(codes.items()): print(f'{repr(ch)}: {code}')
'a': 1101
'c': 0000
'd': 0001
'e': 001
'f': 100
'g': 0010
'h': 0011
'i': 1100
'm': 101
'n': 110
'o': 0100
'u': 0101
Encoding and Decoding
Encoding: replace each character with its Huffman code. Decoding: traverse the tree bit by bit.
def huffman_encode(text: str, codes: dict) -> str: return ''.join(codes[ch] for ch in text) def huffman_decode(encoded: str, root: HNode) -> str: result = [] node = root for bit in encoded: node = node.left if bit == '0' else node.right if node.char is not None: result.append(node.char) node = root return ''.join(result) encoded = huffman_encode('huffman', codes) print(f'Encoded: {encoded}') print(f'Bits: {len(encoded)} vs naive {len("huffman")*8}') print(f'Decoded: {huffman_decode(encoded, tree)}')
Bits: 23 vs naive 56
Decoded: huffman
Why Greedy is Optimal
Huffman's algorithm produces the optimal prefix-free code. Proof sketch: if two characters x, y have the lowest frequencies, there exists an optimal code where x and y are siblings at the deepest level (exchange argument). Merging them first maintains this property at each step. This is a classic greedy proof by exchange argument.
Complexity
Build tree: O(n log n) — n heap operations Encode: O(m) — m = text length Decode: O(m)
Compression ratio depends on entropy of the source. For English text, Huffman achieves ~4.5 bits/character vs 8 bits ASCII — about 44% compression.
🎯 Key Takeaways
- Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
- The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
- Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.
- Production usage: DEFLATE (gzip/zip) uses Huffman after LZ77. JPEG uses Huffman after DCT. The compressed data stream you send over HTTPS contains Huffman-coded data.
- Canonical Huffman codes (as used in DEFLATE) are transmitted as code lengths, not the full tree — a standard optimisation to reduce header overhead.
Interview Questions on This Topic
- QWhy does always merging the two least-frequent nodes produce an optimal code?
- QWhat is a prefix-free code and why does Huffman produce one?
- QWhat is the time complexity of building a Huffman tree?
- QHow would you decode a Huffman-encoded string given only the encoded bits and the tree?
Frequently Asked Questions
What if all characters have the same frequency?
Huffman produces a balanced binary tree where all codes have the same length ⌈log₂ n⌉. This is still optimal — any prefix-free code for n equiprobable symbols needs at least ⌈log₂ n⌉ bits.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.