2026-05-03
Modern servers have dozens or even hundreds of CPU cores, and those cores aren't all created equal. They're grouped into clusters that share caches, memory controllers, and interconnects — think of it like offices on different floors of a building. Working with someone on your floor is fast; walking to another floor takes time. CPUs have the same problem: when a program's threads get scattered across distant cores, they lose the benefit of shared caches and start stepping on each other's toes.
The standard Linux scheduler (CFS) is designed to keep every core busy. When one core is idle, it pulls work from wherever it can find some. This is great for utilization but terrible for locality — the property that a thread keeps running on cores near where its data already lives in cache. Every time a thread hops to a distant core, it pays a penalty warming up a cold cache, polluting the new core's branch predictor, and competing with whatever was already running there.
Affinity Tailor tackles this by dynamically constraining where each workload is allowed to run. Instead of letting the scheduler scatter threads everywhere, it assigns each workload an affinity group — a subset of cores that are physically close together. The key insight is that this grouping isn't static. The system continuously monitors each workload's CPU demand and reshuffles the assignments as load changes, expanding a workload's core set when it needs more throughput and shrinking it when it doesn't.
The results are striking:
What makes this paper particularly compelling is that it comes from engineers working at scale (the author affiliations suggest Google-scale infrastructure). This isn't a simulation on a four-core laptop — it's about machines with hundreds of cores running mixed production workloads. The approach works with the existing Linux scheduler rather than replacing it, which makes it far more deployable than academic schedulers that require kernel rewrites.
The deeper lesson here is that as CPUs get wider — more cores, more chiplets, more NUMA domains — the scheduler's job shifts from "keep cores busy" to "keep work local." Raw utilization is no longer the bottleneck; memory hierarchy is.
