2026-06-04
Plain round robin assumes every backend is identical. That assumption breaks the moment you have a mixed fleet: a 16-core box next to an 8-core box, a fresh server next to one running a memory-hungry sidecar, or a canary that should only see 5% of traffic. Weighted round robin (WRR) fixes this by assigning each backend an integer weight and distributing requests in proportion.
The naive implementation expands the list: weights {A:3, B:2, C:1} become the cycle [A, A, A, B, B, C]. Simple, but bursty — A gets three requests in a row before B sees anything. For 1000 RPS that's fine; for long-lived connections or expensive requests, the bursts cause uneven queue depths.
The fix is smooth weighted round robin, the algorithm Nginx uses. Each server tracks a current weight that starts at 0. On each request:
With weights {A:5, B:1, C:1} (total 7), the picks over 7 requests are: A, A, B, A, C, A, A — same ratio, but interleaved. No three-in-a-row bursts.
Real example. A team I worked with had three API pods: two on m5.2xlarge (8 vCPU) and one canary on m5.large (2 vCPU). Plain round robin pushed the canary to 90% CPU while the big boxes idled at 30%. Setting weights to {big1:4, big2:4, canary:1} dropped canary CPU to 60% and lifted the others to 55% — utilization within 5 points across the fleet.
Rule of thumb for picking weights. Start with the ratio of usable capacity, not raw specs. If pod A handles 400 RPS at p99 SLO and pod B handles 100 RPS, use 4:1 — not 2:1 just because A has twice the cores. Measure under load; CPU count lies when you have I/O-bound work, GC pauses, or noisy neighbors.
What WRR doesn't solve. It's still stateless and oblivious to actual load. A weighted server stuck on a slow query keeps getting its share until you yank it out. Pair WRR with health checks for crashed nodes and consider weighted least connections when request cost varies wildly. WRR is the right tool when capacity differs but per-request cost is roughly uniform — not when one endpoint is 10ms and another is 10s.
