2026-05-03
When a network call fails, your first instinct is to retry it immediately. This is almost always wrong. If a service is struggling under load and 500 clients all retry instantly, you've just doubled the traffic hitting an already-sick system. This is called a retry storm, and it can turn a minor blip into a full outage.
The fix is exponential backoff: each retry waits longer than the last. A common formula is delay = base * 2^attempt. With a 1-second base, your retries fire at 1s, 2s, 4s, 8s, 16s. This gives the downstream service breathing room to recover.
But there's a subtlety. If 500 clients all start at the same moment, they'll all retry at 1s, then all at 2s — still synchronized. This is where jitter comes in. You randomize each delay so clients spread their retries across time. The two common approaches:
delay = random(0, base * 2^attempt) — simple and effectivedelay = random(base, previous_delay * 3) — spreads retries even furtherHere's a real-world example. Say your payment service calls a bank API that occasionally returns 503. Without backoff, a 200ms blip causes a retry storm that extends the outage to 30 seconds. With exponential backoff plus jitter, each client independently backs off, the bank recovers in under a second, and users barely notice.
Rule of thumb for max retries: calculate the total worst-case delay before giving up. With base=1s, 5 retries gives a max wait of 1+2+4+8+16 = 31 seconds. For user-facing requests, 3 retries (max ~7s) is usually the limit before you should fail and show an error. For async background jobs, 5-8 retries with a cap of 60s between attempts is reasonable.
Critical details people forget:
Most HTTP client libraries and cloud SDKs have built-in retry with backoff — AWS SDK, gRPC, and Axios retry plugins all support it. Don't hand-roll this unless you have a reason to. Configure what's already there.
