ArXiv Paper Digest: Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

2026-05-23

Authors: Sien Reeve O. Peralta, Fumika Hoshi, Hironori Washizaki, Naoyasu Ubayashi

AI coding agents are now opening pull requests against real open-source projects on their own. The obvious way to grade them is the same way we grade humans: did the PR get merged or rejected? This paper argues that scoreboard is misleading, and goes digging through the actual conversations to find out what's really going on.

The researchers analyzed 11,048 closed agentic pull requests, narrowed to 9,799 that humans actually reviewed, and then hand-inspected 717 representative cases to reconstruct the reasoning behind each decision. That's a lot of manual labor, but it's the only way to get past the binary outcome and into the why.

Here's the key insight: merge and rejection labels are noisy signals of agent quality. A PR can get merged because a human reviewer cleaned it up extensively, or because the bar was low. It can get rejected for reasons completely unrelated to the agent's work — duplicate of an existing PR, project shifted direction, maintainer didn't have bandwidth, scope mismatch. Treating "merged = good agent, rejected = bad agent" lumps all these together and gives you a number that doesn't really measure capability.

By categorizing the actual decision rationales, the paper surfaces patterns like:

PRs that solved the wrong problem — technically working code, but misunderstood the issue
PRs merged only after substantial human rewriting, which looks like success but isn't
PRs rejected for process reasons (project conventions, contribution guidelines) rather than code defects
Recurring failure modes around tests, scope creep, and engagement with reviewer feedback

Why does this matter to anyone outside academia? Two reasons. First, if you're building or buying coding agents, the benchmarks you trust are probably overcounting wins and undercounting partial credit. The field needs evaluation methods that incorporate review interactions — what reviewers actually said, how much rework was required, whether the agent could iterate productively — not just the final merge bit.

Second, this is one of the first large-scale looks at agents as collaborators in a human review process, not just code generators in a sandbox. Real software development is a conversation, and agents that can't participate in that conversation — can't take feedback, can't scope their changes appropriately, can't follow project norms — will keep losing PRs for reasons their benchmarks never measured.

Why it matters: The merge/reject score we use to rank coding agents hides most of what actually determines whether their work is useful, and we need evaluation frameworks that read the review conversation, not just the outcome.

All newsletters