Skip to content
Home DSA Hamming Code — Double-Bit Error Mis-Correction

Hamming Code — Double-Bit Error Mis-Correction

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Hashing → Topic 11 of 11
A stuck bit in ECC RAM is silently corrected; a second cosmic-ray flip causes Hamming mis-correction and undetected corruption.
⚙️ Intermediate — basic DSA knowledge assumed
In this tutorial, you'll learn
A stuck bit in ECC RAM is silently corrected; a second cosmic-ray flip causes Hamming mis-correction and undetected corruption.
  • Hamming distance: bits that differ between two strings — minimum distance 3 enables 1-bit correction.
  • Parity bits at power-of-2 positions each cover a specific subset of data positions.
  • Syndrome decoding: XOR parity checks form a binary number pointing to the error position.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Hamming code adds parity bits at power-of-2 positions to detect and correct single-bit errors.
  • Each parity bit covers a specific set of data bits; XOR checks produce a syndrome that points to the error location.
  • Minimum Hamming distance of 3 allows 1-bit correction; distance d corrects ⌊(d-1)/2⌋ errors.
  • Real-world use: ECC RAM corrects memory bit flips silently, preventing crashes.
  • Common mistake: assuming Hamming(7,4) handles burst errors — it cannot; it only fixes one bit per block.
🚨 START HERE

ECC RAM Diagnostic Commands

Quick commands to check ECC status and locate failing DIMMs on Linux systems.
🟡

Check if ECC is enabled on the system

Immediate ActionRun 'dmidecode -t memory | grep -i ecc'
Commands
sudo dmidecode -t memory | grep -E 'Type:|Error Correction|Supported'
dmesg | grep -i edac
Fix NowIf ECC not enabled, enable in BIOS and reboot. Cannot be changed without reboot.
🟡

High corrected error count on one DIMM

Immediate ActionIdentify the affected DIMM using EDAC sysfs
Commands
sudo edac-util --report=all
grep . /sys/devices/system/edac/mc/mc*/csrow*/ce_count
Fix NowReplace the DIMM with the highest ce_count. Use 'sudo dmidecode -t memory' to find the slot label.
🟡

Uncorrectable memory error (UE) reported

Immediate ActionIsolate the failing memory region and replace DIMM immediately
Commands
sudo mcelog --client --record | grep -i 'uncorrected'
sudo ras-mc-ctl --errors
Fix NowReboot with one DIMM removed to identify faulty module. Use memtest86+ during next maintenance window.
Production Incident

Silent Data Corruption from ECC-Protected Memory Failure

A production database server began returning corrupt query results intermittently. No hardware alarms were triggered because ECC corrected every single-bit error — until a second bit failed in the same word.
SymptomApplication logs showed occasional checksum mismatches in replicated data. Database consistency checks flagged random rows with incorrect values but no disk errors were reported.
AssumptionThe team assumed ECC RAM guaranteed data integrity; the hardware monitoring dashboard showed no ECC error counters above zero.
Root causeA failing DIMM developed a 'stuck bit' that toggled every read. ECC corrected the single-bit error silently, so the OS never logged a corrected error. When another bit flipped due to cosmic radiation, the Hamming code could not correct the double-bit error — instead, it mis-corrected, producing undetected corruption.
FixReplaced the faulty DIMM. Enabled BIOS logging for Correctable Error Correcting Code (CECC) events. Configured the kernel to log all corrected ECC errors via EDAC (Error Detection and Correction) driver. Set up Prometheus alerts on corrected error rate exceeding 1 per hour per DIMM.
Key Lesson
ECC corrects single-bit errors but cannot correct double-bit errors; it only detects them (if the code has distance 3 or more).Silent correction hides failing hardware — monitor corrected error counts even if they're 'harmless'.Use SECDED (Single Error Correction, Double Error Detection) variants (e.g., Hamming with extra parity) for production systems.
Production Debug Guide

Diagnosing single-bit and multi-bit errors on Linux

Kernel logs show 'EDAC MC0: CE page 0x...' messagesCheck the corrected error count: 'grep -o 'CE count [0-9]*' /var/log/kern.log | tail -1'. If rate > 1/hour, schedule DIMM replacement.
Memory-intensive application crashes with SIGBUS or segfault on a machine with ECC RAMRun 'mcelog --client' to check for uncorrected errors. If found, use dmidecode -t memory to locate the failing DIMM slot.
Comparing two identical copies of a file reveals bit flips that CRC32 does not detectVerify DMA integrity: check 'ethtool -S eth0 | grep crc' for NIC errors. Also test RAM with memtester or stressapptest.
ECC error counters are zero but data corruption still occursDouble-bit errors may not be logged if the memory controller cannot determine which DIMM failed. Run 'edac-util --report=all' and inspect the CPER records via 'ras-mc-ctl --errors'.

In 1947, Richard Hamming was working at Bell Labs on the Model V relay computer. Every time the computer hit an error while he was away for the weekend, it would simply stop and wait for an operator to fix it — losing his entire Saturday batch job. Frustrated, he thought: 'If the machine can detect an error, why can't it correct it?' Over the next two years he developed Hamming codes — the first error-correcting codes.

Today, ECC (Error Correcting Code) RAM uses Hamming codes on every memory operation. Server-grade machines will not pass certification without ECC RAM. Every time you access memory on a server, a Hamming code is silently verifying and potentially correcting the read. The DRAM inside your laptop likely performs billions of Hamming-protected reads per second.

Hamming Distance

The Hamming distance between two binary strings is the number of positions where they differ. Hamming(7,4) has minimum distance 3 — any two valid codewords differ in at least 3 positions. This means: - Any 1-bit error produces an invalid codeword (detectable) - The invalid codeword is closest to exactly one valid codeword (correctable)

General rule: minimum distance d allows detecting d-1 errors and correcting ⌊(d-1)/2⌋ errors.

Mental Model
Think of codewords as points in space
Valid codewords are like streetlights: each light is far enough from its neighbours that you can always tell which one you're closest to.
  • Distance 3 means no two codewords share a nearest neighbour for single-bit errors.
  • If a received word has one flipped bit, it's still closer to the original than to any other codeword.
  • Two-bit errors can be detected (land halfway between two lights) but not corrected.
📊 Production Insight
Real-world systems use distance 3 as the minimum for SECDED.
Larger distances (5, 7) enable multiple-bit correction but add overhead.
Rule: design for the dominant error mode: single-bit in memory, burst in wireless.
🎯 Key Takeaway
Minimum distance d = 2t+1 for t-bit correction.
Distance 3 enables 1-bit correction and 2-bit detection.
Punchline: never use a code without knowing its minimum distance.
Choose Minimum Distance Based on Error Model
IfExpected error rate: rare single-bit flips (e.g., DRAM)
UseUse distance 3 (Hamming code). SEC overhead ~12.5% for 64-bit words.
IfExpected errors: two-bit bursts (e.g., flash storage)
UseUse distance 5 (BCH code) or interleaved Hamming to spread burst across multiple blocks.
IfExpected errors: high random bit flip rate (e.g., space communication)
UseUse distance 7+ (Reed-Solomon, LDPC). Higher overhead but can correct 3+ errors per block.

Hamming(7,4) Encoding

Place data bits at positions 3,5,6,7 (non-powers-of-2). Parity bits at positions 1,2,4 (powers of 2). Each parity bit covers a specific set of positions. The encoding process uses XOR to compute parity values that make the total number of 1s in each parity's coverage set even (even parity).

hamming.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041
def hamming_encode(data: list[int]) -> list[int]:
    """Encode 4 data bits into 7-bit Hamming code."""
    # Positions 1-7 (1-indexed): p1,p2,d1,p4,d2,d3,d4
    d = data  # [d1, d2, d3, d4]
    # Build 7-bit codeword (0-indexed internally)
    code = [0] * 7
    code[2] = d[0]  # position 3
    code[4] = d[1]  # position 5
    code[5] = d[2]  # position 6
    code[6] = d[3]  # position 7
    # Compute parity bits
    # p1 (pos 1): covers positions 1,3,5,7
    code[0] = code[2] ^ code[4] ^ code[6]
    # p2 (pos 2): covers positions 2,3,6,7
    code[1] = code[2] ^ code[5] ^ code[6]
    # p4 (pos 4): covers positions 4,5,6,7
    code[3] = code[4] ^ code[5] ^ code[6]
    return code

def hamming_decode(code: list[int]) -> tuple[list[int], int]:
    """Decode and correct single-bit errors. Returns (data, error_pos)."""
    s1 = code[0] ^ code[2] ^ code[4] ^ code[6]  # p1 check
    s2 = code[1] ^ code[2] ^ code[5] ^ code[6]  # p2 check
    s4 = code[3] ^ code[4] ^ code[5] ^ code[6]  # p4 check
    syndrome = s4 * 4 + s2 * 2 + s1             # error position (1-indexed)
    if syndrome:
        code[syndrome - 1] ^= 1  # flip the erroneous bit
    data = [code[2], code[4], code[5], code[6]]
    return data, syndrome

# Encode 1011
original = [1, 0, 1, 1]
codeword = hamming_encode(original)
print(f'Data: {original} → Codeword: {codeword}')

# Introduce error at position 5
erroneous = codeword[:]
erroneous[4] ^= 1
print(f'Received (error at pos 5): {erroneous}')
decoded, err_pos = hamming_decode(erroneous)
print(f'Corrected: {decoded}, error at position {err_pos}')
▶ Output
Data: [1, 0, 1, 1] → Codeword: [0, 1, 1, 1, 0, 1, 1]
Received (error at pos 5): [0, 1, 1, 1, 1, 1, 1]
Corrected: [1, 0, 1, 1], error at position 5
📊 Production Insight
The syndrome directly gives the bit position in hardware — no lookup table needed.
Modern ECC controllers compute syndrome in one clock cycle using XOR trees.
Rule: syndrome arithmetic is the fastest way to locate errors in silicon.
🎯 Key Takeaway
Parity bits at power-of-2 positions each cover a unique subset.
Syndrome (s4,s2,s1) forms the binary index of the erroneous bit.
Punchline: the syndrome = GPS coordinates for the flipped bit.

Syndrome Decoding and Error Correction

Syndrome decoding is the core of Hamming code correction. After receiving a codeword, we recompute the parity checks. If all three parity checks pass (syndrome = 0), no single-bit error occurred. If one or more fail, the binary number formed by the failed parity checks (p4,p2,p1) points directly to the bit position (1-indexed) that flipped.

For example, if parity check for p1 fails (s1=1), p2 passes (s2=0), p4 passes (s4=0), then syndrome = 001 binary = 1 -> bit 1 (the p1 bit itself) is erroneous. If s1=1, s2=1, s4=0 -> 011 = 3 -> bit 3 (the first data bit) flipped.

This works because each data bit is covered by exactly three of the parity checks — the pattern of 'which parity checks cover this bit' is unique for each bit position. A single-bit error causes the parity checks it participates in to fail, forming a unique syndrome.

🔥Why syndrome works without a lookup
The parity coverage sets are designed so that the binary pattern of 'which parities cover a given bit' equals the bit's position number. Bit 3 (011) is covered by p1 and p2 (bits 1 and 2). When bit 3 flips, p1 and p2 fail → syndrome 011 = 3. This property is not coincidental; it's by construction.
📊 Production Insight
Syndrome decoding is stateless — you don't need to know the original codeword.
That's why ECC can correct errors transparently without performance penalty.
Rule: syndrome = error location; apply XOR flip and move on.
🎯 Key Takeaway
Syndrome is a binary number formed by recomputed parity checks.
Zero syndrome = no error; non-zero = bit position to correct.
Punchline: the parity bits are designed so that the syndrome is the error address.

Generalization to Larger Codes

Hamming codes exist for any r ≥ 2: Hamming(2^r - 1, 2^r - 1 - r). For 64-bit words (common in modern memory), we need r such that 2^r ≥ 64 + r + 1. r = 7 gives 2^7 = 128 ≥ 64 + 7 + 1 = 72, so Hamming(127,120) is possible, but in practice memory controllers use a smaller block size with interleaving. Common variant: (72,64) SECDED code with 8 parity bits per 64-bit word — that uses a Hamming(127,120) derived code with an extra parity bit for double-error detection.

The generalization follows the same parity coverage pattern: parity bit i covers all positions where the i-th bit of the position's binary representation is 1. This allows systematic construction for any block size.

📊 Production Insight
Real ECC doesn't use raw Hamming(7,4). It uses shortened codes: e.g., (72,64) SECDED.
The extra parity bit adds double-error detection at the cost of one more bit.
Rule: production systems always use SECDED, never plain Hamming — double-bit errors are catastrophic.
🎯 Key Takeaway
Hamming codes exist for any r: block length = 2^r - 1, data = 2^r - 1 - r.
Shortened codes tailor the block to exact word sizes.
Punchline: choose r such that 2^r ≥ n + r + 1, then shorten to fit your data width.

Applications

ECC RAM: Error Correcting Code memory adds Hamming-like parity bits to each 64-bit word. Used in servers, workstations, and anywhere data corruption is unacceptable.

RAID-2: Uses Hamming codes across disk drives (largely superseded by RAID-5/6).

QR Codes: Use Reed-Solomon codes (a generalisation) allowing correction of up to 30% data loss.

Satellite/Deep Space: Voyager probes use Golay codes (24,12) — correcting 3 errors per 24-bit block across billions of kilometres.

📊 Production Insight
ECC RAM adds ~5-10% overhead in memory cost but reduces crash rate by orders of magnitude.
Without ECC, one bit flip in a critical memory region causes kernel panic every few months.
Rule: any machine handling financial or health data must have ECC RAM.
🎯 Key Takeaway
ECC RAM uses Hamming-derived codes to correct single-bit errors per memory word.
Without ECC, cosmic rays cause about 1 bit flip per 256 GB per month.
Punchline: if your data matters, your RAM needs ECC.

ECC Memory Internals

Modern ECC memory controllers implement a (72,64) SECDED code. For each 64-bit data word, 8 check bits are stored. The extra bit enables detection of two-bit errors (but correction of single-bit only). The controller computes the syndrome on every read; if the syndrome is non-zero and the extra parity check fails, a double-bit error is flagged.

The memory subsystem is organized into multiple banks and ranks; errors may affect one chip (chipkill) or multiple chips. Advanced features like Chipkill Correct and Memory Scrubbing use Hamming-based codes to handle partial failures.

ecc_memory_sim.py · PYTHON
12345678910
# Simulating (72,64) SECDED syndrome decoding
# In hardware, this happens in parallel over all 64+8 bits

def check_syndromes(code_bits, parity_bits):
    # Simplified: compute 8 parity checks for 64-bit data
    # For demo, we show only 3 checks for Hamming(7,4) style
    pass

# Real implementation uses XOR trees and banks of parity checkers.
# Latency is typically 1-2 CPU cycles.
📊 Production Insight
Memory scrubbing reads all rows periodically to correct single-bit errors before they accumulate.
Without scrubbing, multiple single-bit errors in the same word become uncorrectable double-bit errors.
Rule: enable memory scrubbing in BIOS; scrub interval ≤ 8 hours for large memory systems.
🎯 Key Takeaway
ECC corrects single-bit errors silently; scrubbing prevents accumulation.
Double-bit errors are detected but not corrected — they cause machine check exceptions.
Punchline: ECC buys you time, scrubbing buys you peace.
🗂 Error Correction Code Types
Popular ECC schemes and their capabilities
CodeBlock Size (data+parity)Min DistanceCorrectsDetectsOverhead
Hamming(7,4)7 bits31 bit2 bits75% data, 25% parity
Extended Hamming (SECDED)8 bits (7+1)41 bit2 bits~14% overhead on 64-bit
BCH (63,51)63 bits52 bits3+ bits~23% overhead
Reed-Solomon (255,223)255 bytes3316 bytes32 bytes~14% overhead at byte level

🎯 Key Takeaways

  • Hamming distance: bits that differ between two strings — minimum distance 3 enables 1-bit correction.
  • Parity bits at power-of-2 positions each cover a specific subset of data positions.
  • Syndrome decoding: XOR parity checks form a binary number pointing to the error position.
  • Hamming(7,4): encode 4 bits in 7 — 3 parity bits overhead for single-bit correction.
  • ECC RAM, QR codes, RAID, satellite comm all use Hamming or generalised codes (Reed-Solomon).
  • Production systems use SECDED (Hamming with extra parity) to detect but not correct double-bit errors.
  • Monitor corrected error counts — they reveal hardware failing before catastrophic crashes.

⚠ Common Mistakes to Avoid

    Using Hamming(7,4) for burst error correction
    Symptom

    A burst of 2 consecutive bit flips corrupts two adjacent bits in the same block — Hamming code may correct to the wrong codeword (mis-correction) or detect an uncorrectable error.

    Fix

    Use interleaving: arrange multiple Hamming blocks so that a burst is spread across different blocks. Or switch to a code designed for bursts, like Fire code or CRC+ARQ.

    Assuming ECC RAM prevents all data corruption
    Symptom

    After a double-bit error, the memory controller may silently return corrupted data if the code cannot detect the second error (plain Hamming without extra parity).

    Fix

    Always use SECDED (Hamming extended with a global parity bit) in production. Plain Hamming(7,4) is only for education.

    Miscalculating required parity bits for a given data size
    Symptom

    The code cannot correct errors because 2^r < n + r + 1; some errors remain undetectable.

    Fix

    Solve 2^r ≥ n + r + 1. For n=64, r=7 (2^7=128 ≥ 64+7+1=72). Use at least 7 parity bits; in practice 8 for SECDED.

    Ignoring corrected error counters in production
    Symptom

    A DIMM with high corrected error rate is going bad. If you don't monitor CE counts, it will eventually cause an uncorrectable error and crash.

    Fix

    Set up alerts on corrected error rate (e.g., >1 per hour per DIMM). Use EDAC or mcelog to report failures.

Interview Questions on This Topic

  • QHow many parity bits are needed to correct a single-bit error in an n-bit codeword?JuniorReveal
    You need r parity bits where 2^r ≥ n + r + 1. For example, for n=4, r=3 because 2^3=8 ≥ 4+3+1=8. In general, solve for smallest integer r satisfying that inequality. This ensures each bit position (including parity bits) has a unique syndrome pattern when it flips.
  • QExplain syndrome decoding — how does the syndrome identify the error position?Mid-levelReveal
    Syndrome decoding recomputes the parity checks at the receiver. For Hamming(7,4), we compute s1, s2, s4 by XORing the received bits in the same coverage sets as the encoding parity. If all are zero, no single-bit error exists. If not, the binary number (s4 s2 s1) gives the 1-indexed position of the erroneous bit. This works because each bit is covered by a unique combination of parity bits. The syndrome directly points to the bit to flip.
  • QWhat is the relationship between minimum Hamming distance and error correction capability?SeniorReveal
    A code with minimum distance d can detect up to d-1 errors and correct up to ⌊(d-1)/2⌋ errors. For Hamming(7,4), d=3, so it corrects 1 error and detects 2 errors (if using extended Hamming with extra parity, d=4 corrects 1 and detects 2). The distance defines a sphere around each valid codeword; spheres must not overlap for correction to be unambiguous.
  • QHow does ECC RAM differ from regular RAM?Mid-levelReveal
    ECC RAM stores additional parity bits alongside data (e.g., 8 parity bits per 64-bit word in a 72-bit memory module). The memory controller computes syndrome on every read and corrects single-bit errors before sending data to the CPU. Non-ECC RAM has no parity; any bit flip causes silent data corruption. ECC RAM is mandatory in servers and typically costs 10–15% more. It also introduces ~2-5ns additional latency per read due to syndrome computation.
  • QCan Hamming code correct a two-bit error?SeniorReveal
    Standard Hamming(7,4) with minimum distance 3 cannot correct a two-bit error; it will either miscorrect to a wrong codeword (if the error pattern is within correction distance of another codeword) or detect an invalid syndrome (if the error pattern does not match any valid error syndrome). To correct 2-bit errors, you need a code with minimum distance at least 5 (e.g., BCH or Reed-Solomon). The extended Hamming code with distance 4 can detect 2-bit errors but cannot correct them.

Frequently Asked Questions

How many parity bits does Hamming code need for n data bits?

You need r parity bits where 2^r ≥ n + r + 1. For 4 data bits: 2^3=8 ≥ 4+3+1=8 → 3 parity bits (Hamming(7,4)). For 8 data bits: 2^4=16 ≥ 8+4+1=13 → 4 parity bits (Hamming(12,8)). For 64 data bits: 2^7=128 ≥ 64+7+1=72 → 7 parity bits but used in ECC as (72,64) with 8 bits for SECDED.

What is the difference between parity check and Hamming code?

A single parity bit can only detect an odd number of errors; it cannot correct. Hamming code uses multiple parity bits, each covering a different subset of bits, to both detect and locate a single-bit error. The parity coverage sets are designed so that the syndrome uniquely identifies the error position.

Why is ECC RAM slower than regular RAM?

ECC memory controllers require extra cycles to compute syndrome and, if necessary, correct the data before delivering it to the CPU. The overhead is usually 1-2 additional CPU cycles per read (~2-5ns added latency). Some high-end controllers pipeline this to hide the latency. The trade-off is acceptable for data integrity.

Can I use Hamming code for error correction in wireless communication?

Wireless channels typically experience burst errors (multiple consecutive bits flipped). Hamming code is not ideal for bursts because it assumes random single-bit errors. Interleaving can spread bursts across multiple Hamming blocks, but convolutional codes or Reed-Solomon are more efficient for wireless.

What happens if the syndrome points to a parity bit?

If the syndrome points to a parity bit (e.g., position 1, 2, or 4 in Hamming(7,4)), then that parity bit itself was corrupted. The correction flips that parity bit, which does not affect the data bits. The data is read as correct — the parity bit was only used for detection/correction.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousUniversal Hashing and Perfect Hashing
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged