ArXiv Paper Digest: "Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

2026-05-24

Authors: Zhao Tian, Zifan Zhang, Tao Xiao, Dong Wang

Imagine you ask a contractor to fix a leaky faucet. They fix it — but they also rearrange your kitchen cabinets, repaint the bathroom, and replace the light fixtures in the hallway. Sure, some of that might be improvements, but now you can't easily tell whether the original leak is actually fixed, and your house looks different in ways you never asked for. This paper is about the software-engineering version of that problem.

The researchers studied coding agents — AI systems that take a bug report or feature request and try to fix the codebase automatically. They looked at 3,691 real patches from Multi-SWE-bench, a benchmark that tests these agents on actual GitHub issues. What they found is a pattern they call "refactoring runaway": the agent does the requested fix, but also throws in a bunch of unrelated code reorganization — renaming things, restructuring functions, tidying up code that had nothing to do with the bug.

Why does this happen? Two reasons:

Training data bias. These agents learn from open-source repositories, where human developers naturally bundle small refactorings into their bug fixes ("while I'm here, I might as well clean this up"). The agents picked up that habit.
No sense of scope discipline. Humans usually know when to stop. Agents don't have that instinct — they'll keep "improving" things well past what the ticket asked for.

The problem isn't that refactoring is bad. It's that tangled refactoring makes pull requests hard to review, hard to revert, and risky to merge. A 50-line change that includes the actual bug fix plus 200 lines of unrelated cleanup is far more dangerous than two separate, focused changes. Reviewers can't tell what's load-bearing.

The paper categorizes the kinds of tangled refactorings that show up most often, measures how widespread the problem is, and proposes mitigation strategies — essentially teaching agents to recognize when they're drifting off-task and to keep their changes minimal and focused on the original issue.

The key insight is uncomfortable but important: making AI coders behave more like humans isn't always a good thing. Humans have bad habits too, and an agent that faithfully imitates those habits inherits them at scale. "Just fix the bug" turns out to be a non-trivial discipline to instill.

Why it matters: As coding agents move into real engineering workflows, scope discipline — not just correctness — becomes a first-class quality metric, and this paper is one of the first to measure the problem empirically.

All newsletters