Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

2026-05-30

Authors: Nhat-Minh Nguyen

ArXiv: 2605.30353v1

PDF: Download PDF

Most "AI writes code" stories are vibes-based: someone tries Copilot or Claude for a weekend, declares it amazing (or useless), and writes a blog post. This paper does something rarer — it's a careful, instrumented case study of one physicist using Claude Code over 12 working days and 57 sessions to build a real piece of scientific software, then counting exactly when and how the human had to step in.

The software in question is CLAX-PT, a "differentiable one-loop perturbation theory module in JAX." In plain English: a chunk of math used in theoretical physics (calculating small quantum corrections to predictions), rewritten so a computer can also automatically compute its derivatives — which is what you need for modern optimization and machine learning. It's the kind of code where a single wrong sign or misplaced factor of 2π silently corrupts your physics results.

Nguyen tracked every moment supervision was needed and sorted them into 15 "intervention events" by how heavily the human had to intervene:

The key insight is in the framing question: Are AI agents tools, co-authors, or researchers? The data suggests "powerful tool, sometimes co-author, not yet researcher." The agent is remarkably good at converging on correct code when the success criterion is machine-checkable — that's what oracle tests provide. It struggles exactly where physics-as-a-discipline struggles: choosing the right formulation, knowing which approximation is valid in which regime, recognizing when an answer is "technically correct but physically wrong."

This matters because scientific software is a domain where bugs don't crash — they publish. A subtle error in a perturbation theory module can produce plausible-looking numbers that mislead an entire subfield for years. Nguyen's taxonomy gives a useful template: build oracle tests aggressively (the AI will use them), but don't outsource the parts that require knowing why the math is the math.

Why it matters: A rigorous, quantified look at where AI coding agents actually break down on real scientific software — turning AI-pair-programming hype into an evidence-based picture of which tasks need a human expert and which don't.

All newsletters