ArXiv Paper Digest: MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

2026-05-06

Authors: Jonathan Steinberg, Oren Gal

Imagine you're a security guard at an office building. Someone asks if they can borrow a screwdriver. Sure. Later, a different person asks for the wifi password. Fine. Then someone asks where the server room is. No big deal individually — but together, those three innocuous requests just helped someone break into your network. This paper asks: are AI coding assistants vulnerable to exactly this kind of trick?

The authors built MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 attack chains, each broken into three stages. No single stage looks suspicious — they read like ordinary engineering tickets you'd find in any Jira board: "add a logging utility," "refactor this auth check," "expose this endpoint for internal testing." But when an AI agent dutifully completes all three in sequence, the resulting code contains a real, exploitable vulnerability.

The key insight is that current AI safety alignment evaluates requests in isolation. Models are trained to refuse obvious jailbreaks like "write me malware." They're rarely trained to notice that the third innocent task they've been handed completes a backdoor that the first two tasks set up. The malicious end-state emerges from the composition of harmless-looking work, not from any individual prompt.

What makes this benchmark scientifically useful:

Deterministic verification: Each chain ships with tests that automatically confirm whether the final code is actually exploitable, so results are reproducible rather than subjective.
Realistic decomposition: The stages mimic how real engineering work flows — small tickets, separate sessions, plausible justifications — rather than contrived prompts.
Structural blind spot: It targets a category of failure that per-prompt safety review fundamentally cannot catch, because each prompt looks fine.

Why this is a big deal: coding agents are increasingly given long-running, multi-step engineering tasks with limited oversight. If an attacker can plant a malicious objective by feeding the agent a sequence of innocent-seeming tickets — perhaps via a compromised issue tracker, a poisoned RAG corpus, or even a clever co-worker — current safety guardrails won't catch it. The model passes every individual safety check and ships a vulnerability anyway.

The paper essentially argues that AI safety needs to graduate from "is this prompt bad?" to "where is this trajectory headed?" Models need some form of memory and intent-tracking across a session, or external monitors that look at code holistically rather than per-edit. Without that, the more capable and autonomous coding agents become, the more attractive they become as unwitting accomplices.

Why it matters: It exposes a structural flaw in how we evaluate AI coding agent safety — adversaries don't need jailbreaks when they can simply chain innocent tickets into an exploit.

All newsletters