2026-04-23
Authors: Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang
ArXiv: 2604.20779v1
PDF: Download PDF
AI coding agents — tools that can read your codebase, run commands, edit files, and iterate on solutions — are exploding in popularity. But here's the thing: almost everything we know about how well they work comes from benchmarks, not from watching real people use them on real projects. SWE-chat changes that.
The researchers built the first large-scale dataset of actual coding agent sessions collected from open-source developers using these tools in their day-to-day work. We're talking 6,000 sessions, over 63,000 user prompts, and 355,000 tool calls. This isn't a lab experiment — it's the messy reality of how developers interact with AI agents in the wild.
A few findings stand out:
The key insight is a familiar one in software engineering research: lab performance and field performance are different things. An agent might ace a benchmark where the task is clearly specified and self-contained, but struggle when a real developer gives vague instructions, changes their mind mid-task, or works in a messy codebase with unusual conventions. By studying what actually happens in practice, researchers and tool builders can focus on the problems that matter most to real users.
This is analogous to how web search engines improved dramatically once researchers started studying real query logs instead of just running information retrieval experiments. You can't fix what you can't see, and until now, we couldn't see how people actually use coding agents.
