2026-05-11
When you ask an AI agent to "send this to my boss," the model has to pick the right tool — maybe send_email, maybe send_slack, maybe schedule_meeting. If it picks wrong, you don't find out until the email has already landed in the wrong inbox. By then it's too late. The researchers wanted to know: can we tell which tool the model is about to pick before it actually picks it?
Their answer, after probing 12 open-source instruction-tuned models ranging from tiny (270M parameters) to fairly large (27B) across the Gemma, Qwen, and Llama families: yes, and trivially so. The identity of the tool the model is going to call is encoded in its internal activations in a way so simple that a basic linear classifier can read it out. No fancy interpretability machinery required.
Even more striking, they can steer the choice. Here's the trick: take the average internal activation pattern when the model picks tool A, take the average when it picks tool B, and compute the difference vector. Now inject that difference vector into the model's internal state on a fresh prompt, and the model flips from picking A to picking B — at 77 to 100% accuracy on simple single-turn prompts, and 93%+ in some setups. You are literally nudging the model's "mind" toward a different tool by adding a single vector.
Why does this matter? A few reasons:
delete_user" before the call is materialized.The flip side is sobering: if you can steer tool choice this easily with white-box access, so can an attacker who controls activations (think: a malicious fine-tune, a compromised inference server, or an adversarial prompt that nudges the residual stream). The same property that enables safety enables manipulation.
The deeper insight is conceptual. Tool calling has often been treated as a black-box emergent behavior. This paper says it's more like a switch — a direction in activation space — and switches can be inspected, monitored, and flipped.
