2026-05-02
Here's the problem: modern servers have tons of CPU cores, often spread across multiple chiplets on a single processor. When you run several programs on such a machine, the Linux scheduler tries to be helpful by spreading work across all available cores to keep everything busy. But this "helpfulness" backfires. Every time a program's threads get scattered to distant cores, they lose all the warm data sitting in nearby caches, branch predictors, and prefetchers. Worse, on chiplet-based designs (think AMD EPYC or recent Intel Xeons), hopping between chiplets means crossing last-level cache (LLC) boundaries — your carefully cached data is now on a completely different chunk of silicon.
Affinity Tailor attacks this by dynamically constraining where each workload runs. Instead of letting the scheduler fling threads wherever there's a free core, it groups each workload's threads onto a compact set of cores that share cache hierarchy. The key ideas:
The practical impact is straightforward: workloads get better cache hit rates, less cross-chiplet traffic, and reduced interference from neighbors. This translates directly to lower latency and higher throughput without adding any hardware — you're just using what you already have more intelligently.
What makes this paper stand out from prior work on NUMA-aware or topology-aware scheduling is the emphasis on being dynamic at production scale. Static affinity masks are a well-known trick, but they break down when workloads fluctuate or when you're running mixed services on shared infrastructure (which is basically every cloud and hyperscale environment today). Affinity Tailor bridges the gap between "let the OS figure it out" and "manually tune everything," which is exactly the kind of systems work that quietly makes everything faster without anyone noticing.
