2026-05-27
If you've ever asked an LLM to "build me a webpage," you've probably noticed something strange: the page looks great in the initial screenshot, but the moment you actually try to use it — scroll, click a button, resize the window, play the little game it generated — something falls apart. A dropdown doesn't open. A modal traps you. The layout collapses on mobile. The page looked fine; it just didn't work fine.
That gap between "looks correct" and "behaves correctly" is the problem HTMLCure tackles. The authors point out that most evaluation pipelines for AI-generated HTML judge pages from a single screenshot, which means tons of interactive bugs slip through. Worse, when a page does fail the screenshot test, it often gets thrown out entirely — even though many of those pages are almost right and could be fixed with a small repair.
HTMLCure's idea is to evaluate and repair HTML the way a real user would experience it. Specifically, it:
The "state-guided" part is the key insight. A static screenshot tells you what a page is; a recorded interaction trace tells you what a page does and fails to do. By treating the browser itself as the source of truth — and turning real interaction history into structured feedback an LLM can act on — HTMLCure can fix pages that screenshot-based evaluators would have silently discarded.
This matters more than it might sound. As LLMs increasingly generate real applications instead of static mockups, the bottleneck isn't generation quality — it's verification under use. Tools that only score the first frame will keep approving pages that fail on the second click. HTMLCure points at a different evaluation paradigm: judge code by interacting with it, and use that interaction itself as the repair signal.
