ArXiv Paper Digest: CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

2026-04-24

Authors: Arunabh Majumdar

ArXiv: 2604.21917v1

Imagine two people each carrying half of a key. Neither one can open the vault alone, so a guard checking them individually waves them both through. But once they meet up inside, the vault is wide open. That's essentially the class of software vulnerability this paper tackles.

Most security scanning tools — the automated sentries that watch over codebases — work by examining each code change (commit) in isolation. If a single commit doesn't introduce an obvious flaw, it gets a clean bill of health. But what if a vulnerability only emerges from the combination of two or more commits, each harmless on its own? This paper calls these cross-commit vulnerabilities, and argues they represent a serious blind spot in current security tooling.

The author curated a benchmark of 15 real-world Python vulnerabilities (all with official CVE identifiers, meaning they were serious enough to be catalogued in the global vulnerability database). For each one:

The exploitable condition was introduced across multiple commits, not a single bad change.
Each individual commit looks benign to standard static analysis tools.
Only when you consider the commits together does the security flaw become apparent.

To validate the blind spot, the author ran two popular Python security scanners — Semgrep and Bandit — in two modes: scanning each commit individually (the normal workflow), and scanning the cumulative codebase after all contributing commits landed. The per-commit scans missed the vulnerabilities, confirming that these tools genuinely cannot catch threats that build up incrementally.

Each CVE in the benchmark is annotated with the full chain of contributing commits and a structured explanation of why each commit dodges per-commit detection. This makes the dataset useful not just as a test suite, but as a teaching tool for understanding how vulnerabilities can be smuggled in piecemeal — whether accidentally through normal development, or deliberately by a sophisticated attacker.

The key insight is deceptively simple: security is a property of the whole system, not of individual changes. A function added in January might be perfectly safe until a configuration change in March removes the guard rail that kept it harmless. No per-commit scanner would flag either change, yet together they create an exploit path. This has implications for supply-chain security, where malicious contributors could theoretically spread an attack across innocent-looking pull requests.

Why it matters: This benchmark exposes a fundamental limitation in how the industry scans code for security flaws — one commit at a time — and provides a concrete dataset to drive development of tools that can reason about vulnerabilities emerging across multiple changes.

All newsletters