ArXiv Paper Digest: SWE-chat: Coding Agent Interactions From Real Users in the Wild

SWE-chat: Coding Agent Interactions From Real Users in the Wild

2026-04-23

Authors: Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang

ArXiv: 2604.20779v1

AI coding agents — tools that can read your codebase, run commands, edit files, and iterate on solutions — are exploding in popularity. But here's the thing: almost everything we know about how well they work comes from benchmarks, not from watching real people use them on real projects. SWE-chat changes that.

The researchers built the first large-scale dataset of actual coding agent sessions collected from open-source developers using these tools in their day-to-day work. We're talking 6,000 sessions, over 63,000 user prompts, and 355,000 tool calls. This isn't a lab experiment — it's the messy reality of how developers interact with AI agents in the wild.

A few findings stand out:

People don't just ask simple questions. Real sessions involve multi-turn conversations where developers refine instructions, push back on suggestions, and guide the agent through complex tasks. The gap between "ask a question, get an answer" benchmarks and real usage is enormous.
A lot of agent output gets thrown away. Not everything the agent produces is useful, and the dataset captures which outputs developers actually accepted versus rejected. This is gold for understanding where current agents fall short.
The dataset is alive. Their collection pipeline continuously discovers and adds new sessions, so it grows over time rather than being a static snapshot.

The key insight is a familiar one in software engineering research: lab performance and field performance are different things. An agent might ace a benchmark where the task is clearly specified and self-contained, but struggle when a real developer gives vague instructions, changes their mind mid-task, or works in a messy codebase with unusual conventions. By studying what actually happens in practice, researchers and tool builders can focus on the problems that matter most to real users.

This is analogous to how web search engines improved dramatically once researchers started studying real query logs instead of just running information retrieval experiments. You can't fix what you can't see, and until now, we couldn't see how people actually use coding agents.

Why it matters: By capturing how real developers actually use AI coding agents — not just how they perform on benchmarks — SWE-chat gives the research community ground truth for building tools that work in practice, not just in the lab.

All newsletters