ArXiv Paper Digest: Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

2026-05-03

Authors: Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don

Modern servers have dozens or even hundreds of CPU cores, and those cores aren't all created equal. They're grouped into clusters that share caches, memory controllers, and interconnects — think of it like offices on different floors of a building. Working with someone on your floor is fast; walking to another floor takes time. CPUs have the same problem: when a program's threads get scattered across distant cores, they lose the benefit of shared caches and start stepping on each other's toes.

The standard Linux scheduler (CFS) is designed to keep every core busy. When one core is idle, it pulls work from wherever it can find some. This is great for utilization but terrible for locality — the property that a thread keeps running on cores near where its data already lives in cache. Every time a thread hops to a distant core, it pays a penalty warming up a cold cache, polluting the new core's branch predictor, and competing with whatever was already running there.

Affinity Tailor tackles this by dynamically constraining where each workload is allowed to run. Instead of letting the scheduler scatter threads everywhere, it assigns each workload an affinity group — a subset of cores that are physically close together. The key insight is that this grouping isn't static. The system continuously monitors each workload's CPU demand and reshuffles the assignments as load changes, expanding a workload's core set when it needs more throughput and shrinking it when it doesn't.

The results are striking:

Cache hit rates go up because threads stay near their data instead of bouncing around the chip.
Cross-chiplet traffic drops, which matters enormously on modern AMD and Intel server processors where chiplet boundaries are a real performance cliff.
Tail latency improves because workloads interfere with each other less — each one mostly stays in its own neighborhood.

What makes this paper particularly compelling is that it comes from engineers working at scale (the author affiliations suggest Google-scale infrastructure). This isn't a simulation on a four-core laptop — it's about machines with hundreds of cores running mixed production workloads. The approach works with the existing Linux scheduler rather than replacing it, which makes it far more deployable than academic schedulers that require kernel rewrites.

The deeper lesson here is that as CPUs get wider — more cores, more chiplets, more NUMA domains — the scheduler's job shifts from "keep cores busy" to "keep work local." Raw utilization is no longer the bottleneck; memory hierarchy is.

Why it matters: As server chips grow ever wider and more fragmented across chiplets, this work shows that dynamically pinning workloads to nearby cores can recover significant performance lost to cache thrashing and cross-chip traffic — without replacing the Linux scheduler.

All newsletters