SWE-chat: Coding Agent Interactions From Real Users in the Wild

2026-04-23

Authors: Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang

ArXiv: 2604.20779v1

PDF: Download PDF

AI coding agents — tools that can read your codebase, run commands, edit files, and iterate on solutions — are exploding in popularity. But here's the thing: almost everything we know about how well they work comes from benchmarks, not from watching real people use them on real projects. SWE-chat changes that.

The researchers built the first large-scale dataset of actual coding agent sessions collected from open-source developers using these tools in their day-to-day work. We're talking 6,000 sessions, over 63,000 user prompts, and 355,000 tool calls. This isn't a lab experiment — it's the messy reality of how developers interact with AI agents in the wild.

A few findings stand out:

The key insight is a familiar one in software engineering research: lab performance and field performance are different things. An agent might ace a benchmark where the task is clearly specified and self-contained, but struggle when a real developer gives vague instructions, changes their mind mid-task, or works in a messy codebase with unusual conventions. By studying what actually happens in practice, researchers and tool builders can focus on the problems that matter most to real users.

This is analogous to how web search engines improved dramatically once researchers started studying real query logs instead of just running information retrieval experiments. You can't fix what you can't see, and until now, we couldn't see how people actually use coding agents.

Why it matters: By capturing how real developers actually use AI coding agents — not just how they perform on benchmarks — SWE-chat gives the research community ground truth for building tools that work in practice, not just in the lab.

All newsletters