2026-05-03
You already know how SRAM and DRAM store bits. But what happens when a cosmic ray flips one? In safety-critical systems — servers, spacecraft, automotive ECUs — a single bit flip can corrupt data or crash a system. Error-Correcting Codes (ECC) are how hardware detects and fixes these errors without any software involvement.
The simplest error detection is a parity bit: XOR all data bits together and append the result. If any single bit flips, the parity check fails. But parity only detects — it can't tell you which bit flipped, so it can't correct anything.
Hamming codes solve this by using multiple parity bits, each covering a different subset of data bits. The key insight: place parity bits at positions that are powers of 2 (positions 1, 2, 4, 8, ...). Each parity bit covers all positions whose binary representation has a 1 in the corresponding bit position.
Concrete example — encoding an 8-bit byte with Hamming(12,8):
On read, you recompute each parity bit. The failing parity bits form a binary number called the syndrome, which directly points to the flawed bit position. If the syndrome is 0, no error. If it's nonzero, flip that bit — correction done, entirely in hardware, in a single clock cycle.
Rule of thumb for parity bit count: to protect m data bits, you need r parity bits where 2r ≥ m + r + 1. For 64-bit data (a typical cache line word), r = 7 gives 128 ≥ 72. That's only 10.9% overhead.
In the real world, SECDED (Single Error Correct, Double Error Detect) adds one extra parity bit over the whole word. ECC DIMMs use exactly this: 72-bit buses (64 data + 8 check bits) instead of the 64-bit buses on non-ECC DIMMs. The memory controller has dedicated XOR trees that compute the syndrome on every single read — no performance penalty.
In hardware, the syndrome calculator is just a tree of XOR gates — cheap, fast, and purely combinational. This is why ECC is everywhere: it costs almost nothing in area or timing, yet it turns a hard crash into a silently corrected non-event.
