ArXiv Paper Digest: Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

2026-05-15

Authors: Xinran Zheng, Alfredo Pesoli, Marco Valleri, Suman Jana

Imagine you're a security researcher handed a compiled program — a binary blob of machine code — and asked to find dangerous bugs like buffer overflows or use-after-free errors that attackers love to exploit. You don't have the original source code. All the helpful names, comments, and type information that programmers wrote are gone. What's left looks like assembly soup. This is the everyday reality for people who audit closed-source software, firmware, and malware.

The problem: Traditional tools for finding memory corruption bugs in binaries are either fast-but-shallow (pattern matching that misses real bugs and flags lots of false ones) or deep-but-brittle (symbolic execution that explodes computationally on real programs). Recent attempts to throw a large language model at the problem help with "understanding" what code does, but LLMs hallucinate confidently, miss subtle interprocedural bugs (where the flaw spans multiple functions), and can't tell whether a suspicious-looking pattern is actually reachable at runtime.

What Veritas does: The authors built an agentic framework — meaning a coordinated team of LLM-driven components that each handle a piece of the job — anchored by two ideas that keep it from drifting into fantasy:

Semantic grounding: A static slicer first carves out only the program slices relevant to memory operations (allocations, frees, pointer arithmetic, buffer accesses), recovering object semantics from the stripped binary. The LLM reasons over this distilled view rather than raw assembly.
Runtime feasibility evidence: Instead of trusting the LLM's claim that "this looks like a heap overflow," Veritas tries to produce a concrete trigger — an input or execution trace that actually exercises the bug. If it can't, the finding gets downgraded.

The agentic part matters because memory bugs aren't local. A vulnerability often involves one function allocating an object, another using it, and a third freeing it under some weird condition. Veritas's agents propagate information across these boundaries, mimicking how a human reverse engineer hops around in IDA or Ghidra building a mental model.

The key insight: LLMs are good at pattern recognition and bad at proof. The right design isn't "LLM replaces the analyzer" — it's "LLM steers a classical analyzer toward the suspicious parts, and the classical analyzer keeps the LLM honest." Veritas treats the LLM as a strong but unreliable collaborator and pairs it with the kinds of evidence (slices, runtime triggers) that don't lie.

Why it matters: As more critical infrastructure runs on closed-source binaries and firmware, scalable binary vulnerability detection that combines LLM reasoning with grounded program analysis could meaningfully shift the offense-defense balance toward defenders.

All newsletters