Senior 3 min · March 24, 2026

Hamming Code — Double-Bit Error Mis-Correction

A stuck bit in ECC RAM is silently corrected; a second cosmic-ray flip causes Hamming mis-correction and undetected corruption.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Hamming code adds parity bits at power-of-2 positions to detect and correct single-bit errors.
  • Each parity bit covers a specific set of data bits; XOR checks produce a syndrome that points to the error location.
  • Minimum Hamming distance of 3 allows 1-bit correction; distance d corrects ⌊(d-1)/2⌋ errors.
  • Real-world use: ECC RAM corrects memory bit flips silently, preventing crashes.
  • Common mistake: assuming Hamming(7,4) handles burst errors — it cannot; it only fixes one bit per block.
Plain-English First

When data travels over a network or is stored on disk, bits can flip. Hamming code adds a few extra 'parity' bits that together act like a GPS for errors — they can pinpoint exactly which bit flipped and correct it. Richard Hamming invented this in 1950 after getting frustrated with punch card machines that crashed on single-bit errors.

In 1947, Richard Hamming was working at Bell Labs on the Model V relay computer. Every time the computer hit an error while he was away for the weekend, it would simply stop and wait for an operator to fix it — losing his entire Saturday batch job. Frustrated, he thought: 'If the machine can detect an error, why can't it correct it?' Over the next two years he developed Hamming codes — the first error-correcting codes.

Today, ECC (Error Correcting Code) RAM uses Hamming codes on every memory operation. Server-grade machines will not pass certification without ECC RAM. Every time you access memory on a server, a Hamming code is silently verifying and potentially correcting the read. The DRAM inside your laptop likely performs billions of Hamming-protected reads per second.

Hamming Distance

The Hamming distance between two binary strings is the number of positions where they differ. Hamming(7,4) has minimum distance 3 — any two valid codewords differ in at least 3 positions. This means: - Any 1-bit error produces an invalid codeword (detectable) - The invalid codeword is closest to exactly one valid codeword (correctable)

General rule: minimum distance d allows detecting d-1 errors and correcting ⌊(d-1)/2⌋ errors.

Think of codewords as points in space
  • Distance 3 means no two codewords share a nearest neighbour for single-bit errors.
  • If a received word has one flipped bit, it's still closer to the original than to any other codeword.
  • Two-bit errors can be detected (land halfway between two lights) but not corrected.
Production Insight
Real-world systems use distance 3 as the minimum for SECDED.
Larger distances (5, 7) enable multiple-bit correction but add overhead.
Rule: design for the dominant error mode: single-bit in memory, burst in wireless.
Key Takeaway
Minimum distance d = 2t+1 for t-bit correction.
Distance 3 enables 1-bit correction and 2-bit detection.
Punchline: never use a code without knowing its minimum distance.
Choose Minimum Distance Based on Error Model
IfExpected error rate: rare single-bit flips (e.g., DRAM)
UseUse distance 3 (Hamming code). SEC overhead ~12.5% for 64-bit words.
IfExpected errors: two-bit bursts (e.g., flash storage)
UseUse distance 5 (BCH code) or interleaved Hamming to spread burst across multiple blocks.
IfExpected errors: high random bit flip rate (e.g., space communication)
UseUse distance 7+ (Reed-Solomon, LDPC). Higher overhead but can correct 3+ errors per block.

Hamming(7,4) Encoding

Place data bits at positions 3,5,6,7 (non-powers-of-2). Parity bits at positions 1,2,4 (powers of 2). Each parity bit covers a specific set of positions. The encoding process uses XOR to compute parity values that make the total number of 1s in each parity's coverage set even (even parity).

hamming.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def hamming_encode(data: list[int]) -> list[int]:
    """Encode 4 data bits into 7-bit Hamming code."""
    # Positions 1-7 (1-indexed): p1,p2,d1,p4,d2,d3,d4
    d = data  # [d1, d2, d3, d4]
    # Build 7-bit codeword (0-indexed internally)
    code = [0] * 7
    code[2] = d[0]  # position 3
    code[4] = d[1]  # position 5
    code[5] = d[2]  # position 6
    code[6] = d[3]  # position 7
    # Compute parity bits
    # p1 (pos 1): covers positions 1,3,5,7
    code[0] = code[2] ^ code[4] ^ code[6]
    # p2 (pos 2): covers positions 2,3,6,7
    code[1] = code[2] ^ code[5] ^ code[6]
    # p4 (pos 4): covers positions 4,5,6,7
    code[3] = code[4] ^ code[5] ^ code[6]
    return code

def hamming_decode(code: list[int]) -> tuple[list[int], int]:
    """Decode and correct single-bit errors. Returns (data, error_pos)."""
    s1 = code[0] ^ code[2] ^ code[4] ^ code[6]  # p1 check
    s2 = code[1] ^ code[2] ^ code[5] ^ code[6]  # p2 check
    s4 = code[3] ^ code[4] ^ code[5] ^ code[6]  # p4 check
    syndrome = s4 * 4 + s2 * 2 + s1             # error position (1-indexed)
    if syndrome:
        code[syndrome - 1] ^= 1  # flip the erroneous bit
    data = [code[2], code[4], code[5], code[6]]
    return data, syndrome

# Encode 1011
original = [1, 0, 1, 1]
codeword = hamming_encode(original)
print(f'Data: {original} → Codeword: {codeword}')

# Introduce error at position 5
erroneous = codeword[:]
erroneous[4] ^= 1
print(f'Received (error at pos 5): {erroneous}')
decoded, err_pos = hamming_decode(erroneous)
print(f'Corrected: {decoded}, error at position {err_pos}')
Output
Data: [1, 0, 1, 1] → Codeword: [0, 1, 1, 1, 0, 1, 1]
Received (error at pos 5): [0, 1, 1, 1, 1, 1, 1]
Corrected: [1, 0, 1, 1], error at position 5
Production Insight
The syndrome directly gives the bit position in hardware — no lookup table needed.
Modern ECC controllers compute syndrome in one clock cycle using XOR trees.
Rule: syndrome arithmetic is the fastest way to locate errors in silicon.
Key Takeaway
Parity bits at power-of-2 positions each cover a unique subset.
Syndrome (s4,s2,s1) forms the binary index of the erroneous bit.
Punchline: the syndrome = GPS coordinates for the flipped bit.

Syndrome Decoding and Error Correction

Syndrome decoding is the core of Hamming code correction. After receiving a codeword, we recompute the parity checks. If all three parity checks pass (syndrome = 0), no single-bit error occurred. If one or more fail, the binary number formed by the failed parity checks (p4,p2,p1) points directly to the bit position (1-indexed) that flipped.

For example, if parity check for p1 fails (s1=1), p2 passes (s2=0), p4 passes (s4=0), then syndrome = 001 binary = 1 -> bit 1 (the p1 bit itself) is erroneous. If s1=1, s2=1, s4=0 -> 011 = 3 -> bit 3 (the first data bit) flipped.

This works because each data bit is covered by exactly three of the parity checks — the pattern of 'which parity checks cover this bit' is unique for each bit position. A single-bit error causes the parity checks it participates in to fail, forming a unique syndrome.

Why syndrome works without a lookup
The parity coverage sets are designed so that the binary pattern of 'which parities cover a given bit' equals the bit's position number. Bit 3 (011) is covered by p1 and p2 (bits 1 and 2). When bit 3 flips, p1 and p2 fail → syndrome 011 = 3. This property is not coincidental; it's by construction.
Production Insight
Syndrome decoding is stateless — you don't need to know the original codeword.
That's why ECC can correct errors transparently without performance penalty.
Rule: syndrome = error location; apply XOR flip and move on.
Key Takeaway
Syndrome is a binary number formed by recomputed parity checks.
Zero syndrome = no error; non-zero = bit position to correct.
Punchline: the parity bits are designed so that the syndrome is the error address.

Generalization to Larger Codes

Hamming codes exist for any r ≥ 2: Hamming(2^r - 1, 2^r - 1 - r). For 64-bit words (common in modern memory), we need r such that 2^r ≥ 64 + r + 1. r = 7 gives 2^7 = 128 ≥ 64 + 7 + 1 = 72, so Hamming(127,120) is possible, but in practice memory controllers use a smaller block size with interleaving. Common variant: (72,64) SECDED code with 8 parity bits per 64-bit word — that uses a Hamming(127,120) derived code with an extra parity bit for double-error detection.

The generalization follows the same parity coverage pattern: parity bit i covers all positions where the i-th bit of the position's binary representation is 1. This allows systematic construction for any block size.

Production Insight
Real ECC doesn't use raw Hamming(7,4). It uses shortened codes: e.g., (72,64) SECDED.
The extra parity bit adds double-error detection at the cost of one more bit.
Rule: production systems always use SECDED, never plain Hamming — double-bit errors are catastrophic.
Key Takeaway
Hamming codes exist for any r: block length = 2^r - 1, data = 2^r - 1 - r.
Shortened codes tailor the block to exact word sizes.
Punchline: choose r such that 2^r ≥ n + r + 1, then shorten to fit your data width.

Applications

ECC RAM: Error Correcting Code memory adds Hamming-like parity bits to each 64-bit word. Used in servers, workstations, and anywhere data corruption is unacceptable.

RAID-2: Uses Hamming codes across disk drives (largely superseded by RAID-5/6).

QR Codes: Use Reed-Solomon codes (a generalisation) allowing correction of up to 30% data loss.

Satellite/Deep Space: Voyager probes use Golay codes (24,12) — correcting 3 errors per 24-bit block across billions of kilometres.

Production Insight
ECC RAM adds ~5-10% overhead in memory cost but reduces crash rate by orders of magnitude.
Without ECC, one bit flip in a critical memory region causes kernel panic every few months.
Rule: any machine handling financial or health data must have ECC RAM.
Key Takeaway
ECC RAM uses Hamming-derived codes to correct single-bit errors per memory word.
Without ECC, cosmic rays cause about 1 bit flip per 256 GB per month.
Punchline: if your data matters, your RAM needs ECC.

ECC Memory Internals

Modern ECC memory controllers implement a (72,64) SECDED code. For each 64-bit data word, 8 check bits are stored. The extra bit enables detection of two-bit errors (but correction of single-bit only). The controller computes the syndrome on every read; if the syndrome is non-zero and the extra parity check fails, a double-bit error is flagged.

The memory subsystem is organized into multiple banks and ranks; errors may affect one chip (chipkill) or multiple chips. Advanced features like Chipkill Correct and Memory Scrubbing use Hamming-based codes to handle partial failures.

ecc_memory_sim.pyPYTHON
1
2
3
4
5
6
7
8
9
10
# Simulating (72,64) SECDED syndrome decoding
# In hardware, this happens in parallel over all 64+8 bits

def check_syndromes(code_bits, parity_bits):
    # Simplified: compute 8 parity checks for 64-bit data
    # For demo, we show only 3 checks for Hamming(7,4) style
    pass

# Real implementation uses XOR trees and banks of parity checkers.
# Latency is typically 1-2 CPU cycles.
Production Insight
Memory scrubbing reads all rows periodically to correct single-bit errors before they accumulate.
Without scrubbing, multiple single-bit errors in the same word become uncorrectable double-bit errors.
Rule: enable memory scrubbing in BIOS; scrub interval ≤ 8 hours for large memory systems.
Key Takeaway
ECC corrects single-bit errors silently; scrubbing prevents accumulation.
Double-bit errors are detected but not corrected — they cause machine check exceptions.
Punchline: ECC buys you time, scrubbing buys you peace.
● Production incidentPOST-MORTEMseverity: high

Silent Data Corruption from ECC-Protected Memory Failure

Symptom
Application logs showed occasional checksum mismatches in replicated data. Database consistency checks flagged random rows with incorrect values but no disk errors were reported.
Assumption
The team assumed ECC RAM guaranteed data integrity; the hardware monitoring dashboard showed no ECC error counters above zero.
Root cause
A failing DIMM developed a 'stuck bit' that toggled every read. ECC corrected the single-bit error silently, so the OS never logged a corrected error. When another bit flipped due to cosmic radiation, the Hamming code could not correct the double-bit error — instead, it mis-corrected, producing undetected corruption.
Fix
Replaced the faulty DIMM. Enabled BIOS logging for Correctable Error Correcting Code (CECC) events. Configured the kernel to log all corrected ECC errors via EDAC (Error Detection and Correction) driver. Set up Prometheus alerts on corrected error rate exceeding 1 per hour per DIMM.
Key lesson
  • ECC corrects single-bit errors but cannot correct double-bit errors; it only detects them (if the code has distance 3 or more).
  • Silent correction hides failing hardware — monitor corrected error counts even if they're 'harmless'.
  • Use SECDED (Single Error Correction, Double Error Detection) variants (e.g., Hamming with extra parity) for production systems.
Production debug guideDiagnosing single-bit and multi-bit errors on Linux4 entries
Symptom · 01
Kernel logs show 'EDAC MC0: CE page 0x...' messages
Fix
Check the corrected error count: 'grep -o 'CE count [0-9]*' /var/log/kern.log | tail -1'. If rate > 1/hour, schedule DIMM replacement.
Symptom · 02
Memory-intensive application crashes with SIGBUS or segfault on a machine with ECC RAM
Fix
Run 'mcelog --client' to check for uncorrected errors. If found, use dmidecode -t memory to locate the failing DIMM slot.
Symptom · 03
Comparing two identical copies of a file reveals bit flips that CRC32 does not detect
Fix
Verify DMA integrity: check 'ethtool -S eth0 | grep crc' for NIC errors. Also test RAM with memtester or stressapptest.
Symptom · 04
ECC error counters are zero but data corruption still occurs
Fix
Double-bit errors may not be logged if the memory controller cannot determine which DIMM failed. Run 'edac-util --report=all' and inspect the CPER records via 'ras-mc-ctl --errors'.
★ ECC RAM Diagnostic CommandsQuick commands to check ECC status and locate failing DIMMs on Linux systems.
Check if ECC is enabled on the system
Immediate action
Run 'dmidecode -t memory | grep -i ecc'
Commands
sudo dmidecode -t memory | grep -E 'Type:|Error Correction|Supported'
dmesg | grep -i edac
Fix now
If ECC not enabled, enable in BIOS and reboot. Cannot be changed without reboot.
High corrected error count on one DIMM+
Immediate action
Identify the affected DIMM using EDAC sysfs
Commands
sudo edac-util --report=all
grep . /sys/devices/system/edac/mc/mc*/csrow*/ce_count
Fix now
Replace the DIMM with the highest ce_count. Use 'sudo dmidecode -t memory' to find the slot label.
Uncorrectable memory error (UE) reported+
Immediate action
Isolate the failing memory region and replace DIMM immediately
Commands
sudo mcelog --client --record | grep -i 'uncorrected'
sudo ras-mc-ctl --errors
Fix now
Reboot with one DIMM removed to identify faulty module. Use memtest86+ during next maintenance window.
Error Correction Code Types
CodeBlock Size (data+parity)Min DistanceCorrectsDetectsOverhead
Hamming(7,4)7 bits31 bit2 bits75% data, 25% parity
Extended Hamming (SECDED)8 bits (7+1)41 bit2 bits~14% overhead on 64-bit
BCH (63,51)63 bits52 bits3+ bits~23% overhead
Reed-Solomon (255,223)255 bytes3316 bytes32 bytes~14% overhead at byte level

Key takeaways

1
Hamming distance
bits that differ between two strings — minimum distance 3 enables 1-bit correction.
2
Parity bits at power-of-2 positions each cover a specific subset of data positions.
3
Syndrome decoding
XOR parity checks form a binary number pointing to the error position.
4
Hamming(7,4)
encode 4 bits in 7 — 3 parity bits overhead for single-bit correction.
5
ECC RAM, QR codes, RAID, satellite comm all use Hamming or generalised codes (Reed-Solomon).
6
Production systems use SECDED (Hamming with extra parity) to detect but not correct double-bit errors.
7
Monitor corrected error counts
they reveal hardware failing before catastrophic crashes.

Common mistakes to avoid

4 patterns
×

Using Hamming(7,4) for burst error correction

Symptom
A burst of 2 consecutive bit flips corrupts two adjacent bits in the same block — Hamming code may correct to the wrong codeword (mis-correction) or detect an uncorrectable error.
Fix
Use interleaving: arrange multiple Hamming blocks so that a burst is spread across different blocks. Or switch to a code designed for bursts, like Fire code or CRC+ARQ.
×

Assuming ECC RAM prevents all data corruption

Symptom
After a double-bit error, the memory controller may silently return corrupted data if the code cannot detect the second error (plain Hamming without extra parity).
Fix
Always use SECDED (Hamming extended with a global parity bit) in production. Plain Hamming(7,4) is only for education.
×

Miscalculating required parity bits for a given data size

Symptom
The code cannot correct errors because 2^r < n + r + 1; some errors remain undetectable.
Fix
Solve 2^r ≥ n + r + 1. For n=64, r=7 (2^7=128 ≥ 64+7+1=72). Use at least 7 parity bits; in practice 8 for SECDED.
×

Ignoring corrected error counters in production

Symptom
A DIMM with high corrected error rate is going bad. If you don't monitor CE counts, it will eventually cause an uncorrectable error and crash.
Fix
Set up alerts on corrected error rate (e.g., >1 per hour per DIMM). Use EDAC or mcelog to report failures.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How many parity bits are needed to correct a single-bit error in an n-bi...
Q02SENIOR
Explain syndrome decoding — how does the syndrome identify the error pos...
Q03SENIOR
What is the relationship between minimum Hamming distance and error corr...
Q04SENIOR
How does ECC RAM differ from regular RAM?
Q05SENIOR
Can Hamming code correct a two-bit error?
Q01 of 05JUNIOR

How many parity bits are needed to correct a single-bit error in an n-bit codeword?

ANSWER
You need r parity bits where 2^r ≥ n + r + 1. For example, for n=4, r=3 because 2^3=8 ≥ 4+3+1=8. In general, solve for smallest integer r satisfying that inequality. This ensures each bit position (including parity bits) has a unique syndrome pattern when it flips.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How many parity bits does Hamming code need for n data bits?
02
What is the difference between parity check and Hamming code?
03
Why is ECC RAM slower than regular RAM?
04
Can I use Hamming code for error correction in wireless communication?
05
What happens if the syndrome points to a parity bit?
🔥

That's Hashing. Mark it forged?

3 min read · try the examples if you haven't

Previous
Universal Hashing and Perfect Hashing
11 / 11 · Hashing
Next
Run-Length Encoding — Lossless Compression