ArXiv Paper Digest: MathDuels: Evaluating LLMs as Problem Posers and Solvers

MathDuels: Evaluating LLMs as Problem Posers and Solvers

2026-04-26

Authors: Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik

ArXiv: 2604.21916v1

Here's the problem: we've been testing how smart language models are at math by giving them the same fixed set of problems over and over. The top models now ace these benchmarks, scoring so close together that the tests can't really tell us which model is actually better. It's like giving a spelling bee where every contestant gets a perfect score — the test has hit its ceiling.

MathDuels flips the script with a beautifully simple idea: instead of just solving problems, make the models write problems for each other. Think of it like a round-robin tournament where every player is both a quiz-master and a contestant. Each model crafts math problems designed to be tricky (but still solvable), and then every other model tries to solve them. Your final score depends on both how well you stump your opponents and how well you handle theirs.

The problem-generation pipeline has three stages to keep things fair:

Models generate candidate problems under adversarial prompting (they're encouraged to be clever)
Problems are filtered for quality — they must have verifiable solutions and meet difficulty criteria
The surviving problems get served to all participants in a solve-off

The key insight is that creating a hard-but-valid math problem is a fundamentally different skill than solving one, and testing both gives you a much richer picture of mathematical reasoning. A model that can craft a devious algebra problem understands the structure of algebra differently than one that can merely follow solution patterns. It's the difference between a student who can answer test questions and a professor who can write them.

This also solves the contamination problem that plagues static benchmarks. Since the problems are freshly generated each round, there's no possibility that a model has seen them during training. The benchmark essentially refreshes itself every time you run it.

The dual-role framing produces a natural ranking system that's more discriminating than pass/fail on a fixed test. Even when two models solve the same percentage of standard problems, one might generate significantly harder or more creative challenges, revealing a deeper level of mathematical understanding that static benchmarks completely miss.

Why it matters: By making LLMs both authors and solvers of math problems in a competitive setting, MathDuels creates a self-renewing benchmark that can differentiate models long after static tests have hit their ceiling — and reveals that problem creation is a powerful, underexplored measure of reasoning ability.

All newsletters