AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

2026-05-05

Authors: Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby

ArXiv: 2605.02741v1

PDF: Download PDF

When we talk about AI coding assistants, the conversation usually centers on one question: does the code work? Benchmarks measure pass rates, test coverage, and whether the function returns the right answer. But there's a quieter question that matters far more for anyone who has to live with the code six months later: is it any good? Will a human be able to read it, change it, and not break three other things in the process?

This paper takes a hard look at that second question. The authors performed a systematic audit of "technical debt" in AI-generated software — the accumulated mess that makes codebases harder to maintain over time. Think of technical debt as the difference between a kitchen where everything is in its proper drawer versus one where every utensil is dumped on the counter. Both can cook dinner. Only one is sustainable.

The researchers analyzed AI output at two scales:

The headline finding: AI doesn't just produce code with the same flaws humans do. It produces a distinct "machine signature" of defects — recognizable patterns of bad design that show up consistently across AI outputs but rarely in human-written code. In other words, AI doesn't merely inherit our bad habits; it has its own.

The deeper insight is what they call a fundamental "Reasoning" gap. Functional correctness is a local property — does this snippet do the right thing? But maintainability is a global property — does this code fit sensibly into a larger system, with clear boundaries, sensible abstractions, and minimal duplication? LLMs, optimized to make the next token plausible, tend to win at the local game and lose at the global one. Agents that build whole systems compound the problem: each step looks reasonable, but the resulting architecture often resembles a house where every room was designed by a different contractor who never spoke to the others.

Practically, this matters because the industry is racing to ship AI-generated code into production. Passing tests today says nothing about debugging it at 2 a.m. next year. The paper's audit gives concrete categories of "AI smells" that teams can look for in code review, and suggests that future evaluation benchmarks need to grow up beyond pass/fail and start measuring whether the code is something a human would actually want to inherit.

Why it matters: AI-generated code introduces its own recognizable family of design flaws, meaning "it passes the tests" is a dangerously incomplete measure of whether AI-written software is fit for production.

All newsletters