ArXiv Paper Digest: HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

2026-05-27

Authors: Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang

If you've ever asked an LLM to "build me a webpage," you've probably noticed something strange: the page looks great in the initial screenshot, but the moment you actually try to use it — scroll, click a button, resize the window, play the little game it generated — something falls apart. A dropdown doesn't open. A modal traps you. The layout collapses on mobile. The page looked fine; it just didn't work fine.

That gap between "looks correct" and "behaves correctly" is the problem HTMLCure tackles. The authors point out that most evaluation pipelines for AI-generated HTML judge pages from a single screenshot, which means tons of interactive bugs slip through. Worse, when a page does fail the screenshot test, it often gets thrown out entirely — even though many of those pages are almost right and could be fixed with a small repair.

HTMLCure's idea is to evaluate and repair HTML the way a real user would experience it. Specifically, it:

Actually drives the browser. It loads the page across multiple viewports (desktop, mobile, etc.) and exercises it through realistic interactions — scrolling, hovering, clicking, resizing, even playing through gameplay states.
Records what happens at each step. Instead of just capturing one screenshot at the end, it logs deterministic browser traces: what the DOM looked like, what events fired, what visual state the page was in at each interaction.
Feeds those traces back as repair signals. When something breaks — say, a button visually appears but does nothing on click — the trace gives an LLM precise, state-grounded evidence of the failure, not just a vague "this page looks wrong."

The "state-guided" part is the key insight. A static screenshot tells you what a page is; a recorded interaction trace tells you what a page does and fails to do. By treating the browser itself as the source of truth — and turning real interaction history into structured feedback an LLM can act on — HTMLCure can fix pages that screenshot-based evaluators would have silently discarded.

This matters more than it might sound. As LLMs increasingly generate real applications instead of static mockups, the bottleneck isn't generation quality — it's verification under use. Tools that only score the first frame will keep approving pages that fail on the second click. HTMLCure points at a different evaluation paradigm: judge code by interacting with it, and use that interaction itself as the repair signal.

Why it matters: As AI-generated interfaces move from demos to deployment, evaluation has to shift from "does it render?" to "does it behave?" — and HTMLCure shows that browser interaction traces can do both the judging and the fixing.

All newsletters