2026-06-02
Authors: Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang
ArXiv: 2606.01839v1
PDF: Download PDF
When you ask an AI agent to do something — book a flight, debug code, research a topic — it doesn't answer in one shot. It thinks, calls a tool, reads the result, thinks again, calls another tool, and so on. Each of those "turns" is a separate trip through the language model, and the server has to schedule them alongside thousands of other users' turns.
Modern LLM servers split each turn into two phases that have very different performance profiles. Prefill is reading the prompt — compute-heavy, bursty. Decode is generating tokens one at a time — memory-bandwidth-heavy, long-running. A trick called disaggregation puts these phases on different GPUs so they don't step on each other. The catch: deciding whether to disaggregate a given turn requires knowing how long the decode will be, how much memory its context will eat, and whether the agent is about to call a tool. None of that is knowable when the turn arrives.
So today's systems guess. They train predictors on past traffic to estimate these quantities. When the predictor is wrong — and with agentic workloads, it often is — the scheduler makes bad placement decisions and throughput suffers.
This paper proposes a refreshingly simple alternative: stop predicting, start observing. Instead of treating each turn as an isolated scheduling decision, treat the whole multi-turn conversation as the unit. By the time turn 3 arrives, you've already observed how turns 1 and 2 behaved for this specific conversation — their decode lengths, their tool-call patterns, their memory footprint. That observed history is a far better signal than any general-purpose predictor.
The key insights:
It's a "stop trying to be clever, just look at what's happening" result — the kind of paper that makes you wonder why everyone was predicting in the first place.
