ArXiv Paper Digest: Tool Calling is Linearly Readable and Steerable in Language Models

Tool Calling is Linearly Readable and Steerable in Language Models

2026-05-11

Authors: Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang

When you ask an AI agent to "send this to my boss," the model has to pick the right tool — maybe send_email, maybe send_slack, maybe schedule_meeting. If it picks wrong, you don't find out until the email has already landed in the wrong inbox. By then it's too late. The researchers wanted to know: can we tell which tool the model is about to pick before it actually picks it?

Their answer, after probing 12 open-source instruction-tuned models ranging from tiny (270M parameters) to fairly large (27B) across the Gemma, Qwen, and Llama families: yes, and trivially so. The identity of the tool the model is going to call is encoded in its internal activations in a way so simple that a basic linear classifier can read it out. No fancy interpretability machinery required.

Even more striking, they can steer the choice. Here's the trick: take the average internal activation pattern when the model picks tool A, take the average when it picks tool B, and compute the difference vector. Now inject that difference vector into the model's internal state on a fresh prompt, and the model flips from picking A to picking B — at 77 to 100% accuracy on simple single-turn prompts, and 93%+ in some setups. You are literally nudging the model's "mind" toward a different tool by adding a single vector.

Why does this matter? A few reasons:

Pre-execution safety checks become possible. Instead of waiting for the model to emit a tool call and then validating it, a host system could read the upcoming choice from activations and intervene before any irreversible action fires.
Cheap, fast guardrails. Linear probes are essentially free to run compared to running a second LLM as a judge. You could imagine a production agent that flags "this looks like it's about to call delete_user" before the call is materialized.
It works across model families and sizes. This isn't a quirk of one architecture — it's a general property of how instruction-tuned models represent tool choice. That suggests something fundamental about how these models organize their "intent."

The flip side is sobering: if you can steer tool choice this easily with white-box access, so can an attacker who controls activations (think: a malicious fine-tune, a compromised inference server, or an adversarial prompt that nudges the residual stream). The same property that enables safety enables manipulation.

The deeper insight is conceptual. Tool calling has often been treated as a black-box emergent behavior. This paper says it's more like a switch — a direction in activation space — and switches can be inspected, monitored, and flipped.

Why it matters: Agent tool choice is far less opaque than assumed, opening a path to cheap pre-execution safety checks but also a new attack surface for anyone with white-box access.

All newsletters