ArXiv Paper Digest: Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

2026-05-30

Authors: Nhat-Minh Nguyen

Most "AI writes code" stories are vibes-based: someone tries Copilot or Claude for a weekend, declares it amazing (or useless), and writes a blog post. This paper does something rarer — it's a careful, instrumented case study of one physicist using Claude Code over 12 working days and 57 sessions to build a real piece of scientific software, then counting exactly when and how the human had to step in.

The software in question is CLAX-PT, a "differentiable one-loop perturbation theory module in JAX." In plain English: a chunk of math used in theoretical physics (calculating small quantum corrections to predictions), rewritten so a computer can also automatically compute its derivatives — which is what you need for modern optimization and machine learning. It's the kind of code where a single wrong sign or misplaced factor of 2π silently corrupts your physics results.

Nguyen tracked every moment supervision was needed and sorted them into 15 "intervention events" by how heavily the human had to intervene:

10 events: The AI resolved on its own by iterating against oracle tests — pre-written checks that say "for this input, the answer must equal X." Tight feedback loops carried the work.
2 events: Required the physicist's specialized domain knowledge to unblock — judgment calls about which physical convention to follow, or recognizing a subtle conceptual error the tests didn't catch.
3 events: The AI could not resolve, period. These are the interesting failures — places where no amount of iteration substituted for deep understanding.

The key insight is in the framing question: Are AI agents tools, co-authors, or researchers? The data suggests "powerful tool, sometimes co-author, not yet researcher." The agent is remarkably good at converging on correct code when the success criterion is machine-checkable — that's what oracle tests provide. It struggles exactly where physics-as-a-discipline struggles: choosing the right formulation, knowing which approximation is valid in which regime, recognizing when an answer is "technically correct but physically wrong."

This matters because scientific software is a domain where bugs don't crash — they publish. A subtle error in a perturbation theory module can produce plausible-looking numbers that mislead an entire subfield for years. Nguyen's taxonomy gives a useful template: build oracle tests aggressively (the AI will use them), but don't outsource the parts that require knowing why the math is the math.

Why it matters: A rigorous, quantified look at where AI coding agents actually break down on real scientific software — turning AI-pair-programming hype into an evidence-based picture of which tasks need a human expert and which don't.

All newsletters