Daily Software Engineering: Rate Limiting and Backpressure: Protecting Your Systems from Themselves

Rate Limiting and Backpressure: Protecting Your Systems from Themselves

2026-04-25

Every system has a breaking point. Rate limiting and backpressure are how you prevent reaching it — and how you degrade gracefully when you get close.

Rate limiting caps how many requests a client can make in a time window. Backpressure is the broader concept: when a downstream system is overwhelmed, it signals upstream to slow down. They're complementary — rate limiting is one tool for applying backpressure.

The Token Bucket Algorithm is the most practical rate limiting approach. Imagine a bucket that holds N tokens and refills at a steady rate. Each request costs one token. If the bucket is empty, the request is rejected (or queued). This naturally allows short bursts while enforcing an average rate.

Rule of thumb for sizing: set your burst size to 2-3x your expected per-second rate, and your refill rate to your sustainable throughput. If your API can handle 100 req/s sustainably, allow bursts of 200-300 but refill at 100/s. This accommodates real user behavior (bursty) without letting anyone hammer you.

Real-world example: You run an order processing service. Your database can handle 500 writes/second. Without backpressure, a flash sale sends 5,000 req/s, your DB connection pool exhausts, queries start timing out, and every user gets errors — including the 500 who would have succeeded. With a rate limiter at the API gateway capping at 400 req/s (leaving headroom), 400 users get instant success, the rest get a 429 Too Many Requests with a Retry-After header, and nobody sees a 500 error.

Where to apply rate limiting:

Per-client — prevent any single consumer from monopolizing resources. Key by API key, user ID, or IP.
Per-endpoint — expensive operations (search, reports) get tighter limits than cheap ones (health checks).
Global — protect the system's overall capacity regardless of who's calling.

Backpressure patterns beyond rate limiting:

Queue depth limits — reject work when queues exceed a threshold rather than letting memory grow unbounded.
Load shedding — under extreme load, drop low-priority requests entirely to preserve capacity for critical ones.
Adaptive concurrency — dynamically adjust the number of in-flight requests based on observed latency (libraries like Netflix's concurrency-limits do this automatically).

Common mistakes: Implementing rate limiting only at the edge but not between internal services. Service A can still crush Service B during a retry storm. Also, don't forget to return meaningful responses — a 429 with Retry-After: 2 lets well-behaved clients self-throttle, while a connection timeout causes thundering herd retries that make everything worse.

See it in action: Check out 🔥 How Rate Limiting and Throttling Saves Your API Server From CRASHING! by ByteMonk to see this theory applied.

Key Takeaway: A system that rejects excess load cleanly will always outperform one that accepts everything and collapses — design for overload from day one, not after your first outage.

All newsletters