2026-05-23
Authors: Sien Reeve O. Peralta, Fumika Hoshi, Hironori Washizaki, Naoyasu Ubayashi
ArXiv: 2605.22534v1
PDF: Download PDF
AI coding agents are now opening pull requests against real open-source projects on their own. The obvious way to grade them is the same way we grade humans: did the PR get merged or rejected? This paper argues that scoreboard is misleading, and goes digging through the actual conversations to find out what's really going on.
The researchers analyzed 11,048 closed agentic pull requests, narrowed to 9,799 that humans actually reviewed, and then hand-inspected 717 representative cases to reconstruct the reasoning behind each decision. That's a lot of manual labor, but it's the only way to get past the binary outcome and into the why.
Here's the key insight: merge and rejection labels are noisy signals of agent quality. A PR can get merged because a human reviewer cleaned it up extensively, or because the bar was low. It can get rejected for reasons completely unrelated to the agent's work — duplicate of an existing PR, project shifted direction, maintainer didn't have bandwidth, scope mismatch. Treating "merged = good agent, rejected = bad agent" lumps all these together and gives you a number that doesn't really measure capability.
By categorizing the actual decision rationales, the paper surfaces patterns like:
Why does this matter to anyone outside academia? Two reasons. First, if you're building or buying coding agents, the benchmarks you trust are probably overcounting wins and undercounting partial credit. The field needs evaluation methods that incorporate review interactions — what reviewers actually said, how much rework was required, whether the agent could iterate productively — not just the final merge bit.
Second, this is one of the first large-scale looks at agents as collaborators in a human review process, not just code generators in a sandbox. Real software development is a conversation, and agents that can't participate in that conversation — can't take feedback, can't scope their changes appropriately, can't follow project norms — will keep losing PRs for reasons their benchmarks never measured.
