Home DSA Huffman Encoding — Greedy Data Compression Algorithm

Huffman Encoding — Greedy Data Compression Algorithm

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Greedy & Backtracking → Topic 11 of 13
Learn Huffman encoding — how it builds a variable-length prefix code using a greedy min-heap approach.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn:
  • Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
  • The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
  • Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
Morse code uses shorter signals for common letters (E=dot, T=dash) and longer ones for rare letters. Huffman encoding does the same thing for binary data — assigns shorter bit strings to more frequent characters and longer ones to rare ones. The greedy strategy always merges the two least-frequent items first, ensuring optimal prefix codes.

Every time you compress a file with gzip, zip, or bzip2, Huffman coding is running somewhere in the pipeline. JPEG uses Huffman for its DCT coefficients. MP3 uses a Huffman variant for quantised audio data. The algorithm is from 1952 and is still in active production use in 2026 because it achieves optimal prefix-free codes — you cannot do better with fixed codes.

David Huffman invented this as a term paper assignment at MIT. His professor (Robert Fano) had worked on an earlier approach (Shannon-Fano coding) and assigned the problem thinking it was unsolvable. Huffman's insight — build the tree bottom-up from the least frequent symbols using a min-heap — was simpler and more elegant than Fano's top-down approach and provably optimal.

The Greedy Strategy — Min-Heap Merging

Build a min-heap from (frequency, character) pairs. Repeatedly extract the two minimum nodes and merge them into a new node with combined frequency. Repeat until one node remains — the root.

huffman.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
import heapq
from collections import Counter

class HNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = self.right = None
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text: str) -> HNode:
    freq = Counter(text)
    heap = [HNode(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = HNode(None, left.freq + right.freq)
        merged.left, merged.right = left, right
        heapq.heappush(heap, merged)
    return heap[0]

def get_codes(root: HNode) -> dict[str, str]:
    codes = {}
    def dfs(node, code):
        if node.char is not None:
            codes[node.char] = code or '0'
            return
        dfs(node.left,  code + '0')
        dfs(node.right, code + '1')
    dfs(root, '')
    return codes

text = 'huffman encoding'
tree = build_huffman_tree(text)
codes = get_codes(tree)
for ch, code in sorted(codes.items()):
    print(f'{repr(ch)}: {code}')
▶ Output
' ': 111
'a': 1101
'c': 0000
'd': 0001
'e': 001
'f': 100
'g': 0010
'h': 0011
'i': 1100
'm': 101
'n': 110
'o': 0100
'u': 0101

Encoding and Decoding

Encoding: replace each character with its Huffman code. Decoding: traverse the tree bit by bit.

huffman_codec.py · PYTHON
1234567891011121314151617
def huffman_encode(text: str, codes: dict) -> str:
    return ''.join(codes[ch] for ch in text)

def huffman_decode(encoded: str, root: HNode) -> str:
    result = []
    node = root
    for bit in encoded:
        node = node.left if bit == '0' else node.right
        if node.char is not None:
            result.append(node.char)
            node = root
    return ''.join(result)

encoded = huffman_encode('huffman', codes)
print(f'Encoded: {encoded}')
print(f'Bits: {len(encoded)} vs naive {len("huffman")*8}')
print(f'Decoded: {huffman_decode(encoded, tree)}')
▶ Output
Encoded: 10010000101100110001111
Bits: 23 vs naive 56
Decoded: huffman

Why Greedy is Optimal

Huffman's algorithm produces the optimal prefix-free code. Proof sketch: if two characters x, y have the lowest frequencies, there exists an optimal code where x and y are siblings at the deepest level (exchange argument). Merging them first maintains this property at each step. This is a classic greedy proof by exchange argument.

🔥
Prefix-Free PropertyHuffman codes are prefix-free: no code is a prefix of another. This guarantees unambiguous decoding without separators. The trie structure naturally enforces this — characters are only at leaves.

Complexity

Build tree: O(n log n) — n heap operations Encode: O(m) — m = text length Decode: O(m)

Compression ratio depends on entropy of the source. For English text, Huffman achieves ~4.5 bits/character vs 8 bits ASCII — about 44% compression.

🎯 Key Takeaways

  • Always merge the two lowest-frequency nodes first — the min-heap handles this in O(log n) per merge, O(n log n) total. This greedy choice is provably optimal by the exchange argument.
  • The resulting tree is prefix-free by construction — characters are only at leaves, so no code is a prefix of another. This is what enables unambiguous decoding without separators.
  • Huffman codes are optimal among prefix-free codes but not optimal overall. Arithmetic coding can do better for skewed distributions by assigning fractional bits.
  • Production usage: DEFLATE (gzip/zip) uses Huffman after LZ77. JPEG uses Huffman after DCT. The compressed data stream you send over HTTPS contains Huffman-coded data.
  • Canonical Huffman codes (as used in DEFLATE) are transmitted as code lengths, not the full tree — a standard optimisation to reduce header overhead.

Interview Questions on This Topic

  • QWhy does always merging the two least-frequent nodes produce an optimal code?
  • QWhat is a prefix-free code and why does Huffman produce one?
  • QWhat is the time complexity of building a Huffman tree?
  • QHow would you decode a Huffman-encoded string given only the encoded bits and the tree?

Frequently Asked Questions

What if all characters have the same frequency?

Huffman produces a balanced binary tree where all codes have the same length ⌈log₂ n⌉. This is still optimal — any prefix-free code for n equiprobable symbols needs at least ⌈log₂ n⌉ bits.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousJob Sequencing Problem with DeadlinesNext →Interval Scheduling and Meeting Rooms Problem
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged