Huffman Coding
- Huffman coding assigns shorter codes to more frequent characters — provably optimal prefix-free code.
- Build a min-heap of leaf nodes, repeatedly merge the two lowest-frequency nodes.
- Prefix-free: no code is a prefix of another, so decoding is unambiguous — walk the tree bit by bit.
Huffman coding assigns shorter bit patterns to more frequent characters and longer patterns to rarer ones. Build a min-heap of (frequency, character) pairs, repeatedly merge the two lowest-frequency nodes into a parent, until one tree remains. The path from root to each leaf (0 for left, 1 for right) gives the code for that character.
Building the Huffman Tree
# Package: io.thecodeforge.python.dsa import heapq from collections import Counter class HuffmanNode: def __init__(self, char, freq): self.char = char self.freq = freq self.left = None self.right = None # Min-heap ordering by frequency def __lt__(self, other): return self.freq < other.freq def build_huffman_tree(text): freq = Counter(text) # Min-heap of leaf nodes heap = [HuffmanNode(char, f) for char, f in freq.items()] heapq.heapify(heap) while len(heap) > 1: # Take two nodes with lowest frequency left = heapq.heappop(heap) right = heapq.heappop(heap) # Merge into a parent node parent = HuffmanNode(None, left.freq + right.freq) parent.left = left parent.right = right heapq.heappush(heap, parent) return heap[0] # root of Huffman tree text = 'aabbbccddddeeeeee' root = build_huffman_tree(text) print(f'Root frequency: {root.freq}') # 17 (total characters)
Generating Codes and Encoding
def get_codes(node, prefix='', codes={}): if node.char is not None: # leaf codes[node.char] = prefix or '0' # handle single-char case else: get_codes(node.left, prefix + '0', codes) get_codes(node.right, prefix + '1', codes) return codes def huffman_encode(text): root = build_huffman_tree(text) codes = get_codes(root, '', {}) encoded = ''.join(codes[c] for c in text) return encoded, codes, root text = 'aabbbccddddeeeeee' encoded, codes, root = huffman_encode(text) print('Codes:') for char in sorted(codes): freq = text.count(char) print(f' {repr(char)}: {codes[char]:5s} (freq={freq}, bits={freq*len(codes[char])})') original_bits = len(text) * 8 compressed_bits = len(encoded) print(f'\nOriginal: {original_bits} bits') print(f'Compressed: {compressed_bits} bits') print(f'Ratio: {compressed_bits/original_bits:.1%}')
'a': 110 (freq=2, bits=6)
'b': 111 (freq=3, bits=9)
'c': 100 (freq=2, bits=6)
'd': 01 (freq=4, bits=8)
'e': 0 (freq=6, bits=6)
Original: 136 bits
Compressed: 35 bits
Ratio: 25.7%
🎯 Key Takeaways
- Huffman coding assigns shorter codes to more frequent characters — provably optimal prefix-free code.
- Build a min-heap of leaf nodes, repeatedly merge the two lowest-frequency nodes.
- Prefix-free: no code is a prefix of another, so decoding is unambiguous — walk the tree bit by bit.
- Time complexity: O(n log n) where n is the number of distinct characters.
- Real-world use: DEFLATE (ZIP/PNG), JPEG quantisation tables, HTTP/2 header compression (HPACK).
Interview Questions on This Topic
- QHow do you build a Huffman tree?
- QWhat does 'prefix-free code' mean and why does Huffman coding produce one?
- QWhat is the time complexity of building a Huffman tree?
Frequently Asked Questions
What does prefix-free mean and why is it important?
Prefix-free means no code is a prefix of any other code. If 'a' is '10' and 'ab' is '101', then seeing '10' in the bit stream is ambiguous — is it 'a' or the start of 'ab'? Huffman codes are inherently prefix-free because they are derived from leaf nodes in a tree — no leaf is an ancestor of another.
How is Huffman coding different from run-length encoding?
Run-length encoding replaces runs of identical characters with (character, count) pairs — efficient for data with long runs (like simple images). Huffman coding works on frequency of individual characters regardless of position. Real compression formats use both: DEFLATE (used in ZIP and PNG) combines Huffman coding with LZ77 sliding window compression.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.