Daily Software Engineering: The Retry Pattern: Exponential Backoff and Jitter

The Retry Pattern: Exponential Backoff and Jitter

2026-05-03

When a network call fails, your first instinct is to retry it immediately. This is almost always wrong. If a service is struggling under load and 500 clients all retry instantly, you've just doubled the traffic hitting an already-sick system. This is called a retry storm, and it can turn a minor blip into a full outage.

The fix is exponential backoff: each retry waits longer than the last. A common formula is delay = base * 2^attempt. With a 1-second base, your retries fire at 1s, 2s, 4s, 8s, 16s. This gives the downstream service breathing room to recover.

But there's a subtlety. If 500 clients all start at the same moment, they'll all retry at 1s, then all at 2s — still synchronized. This is where jitter comes in. You randomize each delay so clients spread their retries across time. The two common approaches:

Full jitter: delay = random(0, base * 2^attempt) — simple and effective
Decorrelated jitter: delay = random(base, previous_delay * 3) — spreads retries even further

Here's a real-world example. Say your payment service calls a bank API that occasionally returns 503. Without backoff, a 200ms blip causes a retry storm that extends the outage to 30 seconds. With exponential backoff plus jitter, each client independently backs off, the bank recovers in under a second, and users barely notice.

Rule of thumb for max retries: calculate the total worst-case delay before giving up. With base=1s, 5 retries gives a max wait of 1+2+4+8+16 = 31 seconds. For user-facing requests, 3 retries (max ~7s) is usually the limit before you should fail and show an error. For async background jobs, 5-8 retries with a cap of 60s between attempts is reasonable.

Critical details people forget:

Only retry on transient errors. A 503 or timeout is worth retrying. A 400 Bad Request will fail every time — retrying it is wasteful and potentially dangerous.
Make the operation idempotent. If your retry causes a double charge, backoff won't save you. Use idempotency keys.
Set a retry budget. Some systems cap total retries across all requests (e.g., "no more than 10% of traffic should be retries"). This prevents cascading failures even when individual clients behave correctly.
Combine with circuit breakers. After enough failures, stop retrying entirely and fail fast. Let the circuit breaker decide when to probe again.

Most HTTP client libraries and cloud SDKs have built-in retry with backoff — AWS SDK, gRPC, and Axios retry plugins all support it. Don't hand-roll this unless you have a reason to. Configure what's already there.

See it in action: Check out Understanding Exponential Backoff: A Comprehensive Guide to Efficient Retries by Muhammad Daif to see this theory applied.

Key Takeaway: Always pair retries with exponential backoff and jitter — immediate retries on failure feel intuitive but amplify outages instead of recovering from them.

All newsletters