2026-05-01
Here's a problem hiding in plain sight on every big server: modern CPUs have dozens or hundreds of cores, often split across multiple chiplets, each with its own chunk of cache. When you run several programs on the same machine, the Linux scheduler tries to keep all cores busy by spreading work around. That sounds reasonable — but it quietly destroys performance. Every time a program's threads get scattered across distant cores, they lose the warm data sitting in nearby caches, pollute each other's branch predictors, and trigger expensive cross-chip memory traffic. The CPU is "busy," but it's busy waiting on data.
Affinity Tailor tackles this by dynamically corralling each workload onto a compact set of nearby cores — keeping threads close to each other and to the data they've recently touched. The key insight is that this isn't a static assignment problem. Workloads grow, shrink, and compete for resources in real time, so the system needs to continuously adjust which cores each workload "owns" without introducing scheduling latency or starvation.
The approach works in three parts:
The results are striking. On production-scale Google workloads running on large multi-chiplet machines, Affinity Tailor reduces cache misses and cross-chip traffic substantially, translating into meaningful throughput improvements — all without any application-level changes. The programs don't know anything changed; they just run faster because the scheduler stopped scattering their threads across the chip.
What makes this paper particularly compelling is that it addresses a problem that gets worse with every new CPU generation. As core counts climb and chiplet architectures become the norm (AMD's EPYC, Intel's upcoming designs), naive load balancing becomes an increasingly expensive default. This work shows that topology-aware scheduling isn't just a nice-to-have — it's becoming essential for extracting the performance you're already paying for.
