ArXiv Paper Digest: Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

2026-05-02

Authors: Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don

Here's the problem: modern servers have tons of CPU cores, often spread across multiple chiplets on a single processor. When you run several programs on such a machine, the Linux scheduler tries to be helpful by spreading work across all available cores to keep everything busy. But this "helpfulness" backfires. Every time a program's threads get scattered to distant cores, they lose all the warm data sitting in nearby caches, branch predictors, and prefetchers. Worse, on chiplet-based designs (think AMD EPYC or recent Intel Xeons), hopping between chiplets means crossing last-level cache (LLC) boundaries — your carefully cached data is now on a completely different chunk of silicon.

Affinity Tailor attacks this by dynamically constraining where each workload runs. Instead of letting the scheduler fling threads wherever there's a free core, it groups each workload's threads onto a compact set of cores that share cache hierarchy. The key ideas:

Dynamic, not static: Rather than manually pinning processes to cores (which is brittle and doesn't adapt to changing load), Affinity Tailor continuously monitors workload behavior and adjusts core assignments on the fly.
Locality-aware grouping: It understands the physical topology — which cores share an L3 cache, which are on the same chiplet — and uses that knowledge to keep related threads close together in the memory hierarchy.
Scalability: This isn't a research prototype for 8-core laptops. The paper targets large-scale production systems where dozens or hundreds of cores are shared by multiple services, which is exactly the environment where cache thrashing and cross-chiplet interference hurt most.

The practical impact is straightforward: workloads get better cache hit rates, less cross-chiplet traffic, and reduced interference from neighbors. This translates directly to lower latency and higher throughput without adding any hardware — you're just using what you already have more intelligently.

What makes this paper stand out from prior work on NUMA-aware or topology-aware scheduling is the emphasis on being dynamic at production scale. Static affinity masks are a well-known trick, but they break down when workloads fluctuate or when you're running mixed services on shared infrastructure (which is basically every cloud and hyperscale environment today). Affinity Tailor bridges the gap between "let the OS figure it out" and "manually tune everything," which is exactly the kind of systems work that quietly makes everything faster without anyone noticing.

Why it matters: As servers grow wider with more cores and chiplets, naive scheduling leaves significant performance on the table — Affinity Tailor shows how dynamic, topology-aware thread placement can reclaim that lost efficiency at production scale without hardware changes.

All newsletters