2026-04-26
Here's the problem: we've been testing how smart language models are at math by giving them the same fixed set of problems over and over. The top models now ace these benchmarks, scoring so close together that the tests can't really tell us which model is actually better. It's like giving a spelling bee where every contestant gets a perfect score — the test has hit its ceiling.
MathDuels flips the script with a beautifully simple idea: instead of just solving problems, make the models write problems for each other. Think of it like a round-robin tournament where every player is both a quiz-master and a contestant. Each model crafts math problems designed to be tricky (but still solvable), and then every other model tries to solve them. Your final score depends on both how well you stump your opponents and how well you handle theirs.
The problem-generation pipeline has three stages to keep things fair:
The key insight is that creating a hard-but-valid math problem is a fundamentally different skill than solving one, and testing both gives you a much richer picture of mathematical reasoning. A model that can craft a devious algebra problem understands the structure of algebra differently than one that can merely follow solution patterns. It's the difference between a student who can answer test questions and a professor who can write them.
This also solves the contamination problem that plagues static benchmarks. Since the problems are freshly generated each round, there's no possibility that a model has seen them during training. The benchmark essentially refreshes itself every time you run it.
The dual-role framing produces a natural ranking system that's more discriminating than pass/fail on a fixed test. Even when two models solve the same percentage of standard problems, one might generate significantly harder or more creative challenges, revealing a deeper level of mathematical understanding that static benchmarks completely miss.
