ArXiv Paper Digest: Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

2026-05-01

Authors: Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don

Here's a problem hiding in plain sight on every big server: modern CPUs have dozens or hundreds of cores, often split across multiple chiplets, each with its own chunk of cache. When you run several programs on the same machine, the Linux scheduler tries to keep all cores busy by spreading work around. That sounds reasonable — but it quietly destroys performance. Every time a program's threads get scattered across distant cores, they lose the warm data sitting in nearby caches, pollute each other's branch predictors, and trigger expensive cross-chip memory traffic. The CPU is "busy," but it's busy waiting on data.

Affinity Tailor tackles this by dynamically corralling each workload onto a compact set of nearby cores — keeping threads close to each other and to the data they've recently touched. The key insight is that this isn't a static assignment problem. Workloads grow, shrink, and compete for resources in real time, so the system needs to continuously adjust which cores each workload "owns" without introducing scheduling latency or starvation.

The approach works in three parts:

Locality-aware grouping: The scheduler identifies which cores share caches (same chiplet, same NUMA node) and tries to pack each workload's threads within one of these groups, maximizing cache reuse.
Dynamic resizing: As a workload's CPU demand changes, its core allocation expands or contracts, but always in units that respect the hardware topology — you grow into a neighboring cache domain, not a random core on the other side of the chip.
Fairness under contention: When multiple workloads compete, the system balances locality gains against fair CPU time, preventing any single workload from monopolizing a chiplet.

The results are striking. On production-scale Google workloads running on large multi-chiplet machines, Affinity Tailor reduces cache misses and cross-chip traffic substantially, translating into meaningful throughput improvements — all without any application-level changes. The programs don't know anything changed; they just run faster because the scheduler stopped scattering their threads across the chip.

What makes this paper particularly compelling is that it addresses a problem that gets worse with every new CPU generation. As core counts climb and chiplet architectures become the norm (AMD's EPYC, Intel's upcoming designs), naive load balancing becomes an increasingly expensive default. This work shows that topology-aware scheduling isn't just a nice-to-have — it's becoming essential for extracting the performance you're already paying for.

Why it matters: As CPUs pack more cores across multiple chiplets, this work demonstrates that smarter, topology-aware thread scheduling can recover significant performance that today's default Linux scheduler silently wastes — without changing a single line of application code.

All newsletters