ArXiv Paper Digest: Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

2026-05-31

Authors: Xiang Liu, Sa Song, Zhaowei Zhang, Huiying Lan

Imagine you and four friends are trying to agree on where to eat dinner, except you're all texting from different cities, some of your phones randomly die, and a couple of you might be lying about what messages you received. Somehow, you all need to end up at the same restaurant. That's the job of a consensus protocol — the algorithm that lets a bunch of computers agree on something even when machines crash, networks drop messages, or nodes behave badly. These protocols (Raft, Paxos, PBFT, and their cousins) sit underneath basically everything: databases, cloud storage, blockchains, and the financial systems that move trillions of dollars.

Here's the scary part: bugs in these protocols can cause data corruption, lost transactions, or split-brain scenarios where two halves of a system both think they're in charge. And they're notoriously hard to find. The bugs that matter most aren't typos — they're logic bugs that only surface in specific sequences of events, like "node A crashes, then node B becomes leader, then a stale message from A arrives mid-election." Human reviewers miss these. Traditional testing tools miss them. And it turns out LLMs miss them too when you just throw code at them and ask "any bugs?"

The authors built Agora, a system of cooperating LLM agents that tries to find these deep bugs autonomously. The key insight: instead of asking an LLM to spot bugs by reading code (which it's bad at for stateful, multi-stage logic), Agora gets the LLM to generate hypotheses about what could go wrong — "what if a leader election happens during a log replication?" — and then systematically tests each hypothesis against the actual protocol implementation.

This is a meaningful shift in how to use LLMs for hard verification work:

Hypothesis-driven, not pattern-matching. The LLM acts more like a curious systems engineer brainstorming failure modes than a linter scanning for known smells.
Multi-agent decomposition. Different agents handle different parts of the workflow — proposing scenarios, building test cases, analyzing results — so no single LLM call has to hold the whole problem in its head.
Domain-aware. Agora encodes what consensus protocols actually do (leaders, terms, logs, quorums), so its hypotheses are grounded in the real failure surface instead of generic "what if null?" guesses.

The result is a tool that can find protocol-level bugs in production-grade consensus implementations — the kind of bugs that historically required PhD-level distributed systems experts staring at TLA+ specs for weeks.

Why it matters: Consensus protocol bugs silently corrupt the world's most critical infrastructure, and Agora demonstrates that LLM agents — properly structured around hypothesis-driven testing rather than code-reading — can autonomously catch the kind of subtle, state-dependent bugs that have historically demanded scarce distributed systems expertise.

All newsletters