ArXiv Paper Digest: Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

2026-06-02

Authors: Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang

When you ask an AI agent to do something — book a flight, debug code, research a topic — it doesn't answer in one shot. It thinks, calls a tool, reads the result, thinks again, calls another tool, and so on. Each of those "turns" is a separate trip through the language model, and the server has to schedule them alongside thousands of other users' turns.

Modern LLM servers split each turn into two phases that have very different performance profiles. Prefill is reading the prompt — compute-heavy, bursty. Decode is generating tokens one at a time — memory-bandwidth-heavy, long-running. A trick called disaggregation puts these phases on different GPUs so they don't step on each other. The catch: deciding whether to disaggregate a given turn requires knowing how long the decode will be, how much memory its context will eat, and whether the agent is about to call a tool. None of that is knowable when the turn arrives.

So today's systems guess. They train predictors on past traffic to estimate these quantities. When the predictor is wrong — and with agentic workloads, it often is — the scheduler makes bad placement decisions and throughput suffers.

This paper proposes a refreshingly simple alternative: stop predicting, start observing. Instead of treating each turn as an isolated scheduling decision, treat the whole multi-turn conversation as the unit. By the time turn 3 arrives, you've already observed how turns 1 and 2 behaved for this specific conversation — their decode lengths, their tool-call patterns, their memory footprint. That observed history is a far better signal than any general-purpose predictor.

The key insights:

Agentic workloads are sticky. A conversation that has been making lots of tool calls will probably keep doing so. A conversation with growing context tends to keep growing.
Observation beats prediction when the future resembles the recent past. The paper shows that simple per-conversation statistics outperform learned predictors because they capture the actual workload, not a population average.
Scheduling at the conversation level lets the system pin a whole conversation's KV cache to the right pool of GPUs once, instead of relitigating placement every turn.

It's a "stop trying to be clever, just look at what's happening" result — the kind of paper that makes you wonder why everyone was predicting in the first place.

Why it matters: As AI agents move into production, the systems serving them are wasting compute on bad predictions when the data they need is already sitting in the conversation history.

All newsletters