ArXiv Paper Digest: Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

2026-05-08

Authors: Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo

When you ask an LLM to write Python code, it doesn't just produce logic — it also tells you which third-party libraries to install, often pinned to specific version numbers like requests==2.25.0 or numpy==1.19.2. Those version numbers seem like a minor detail, but they're effectively security and compatibility decisions. This paper is the first big systematic look at whether LLMs are picking good versions, and the answer is: not really.

The researchers built a benchmark called PinTrace — 1,000 real Stack Overflow-style coding tasks — and ran 10 different LLMs on it, then inspected every library version those models suggested. They checked each pinned version against public vulnerability databases and against actual Python ecosystem rules (does this version even exist? Is it compatible with the other libraries the model picked?).

The headline finding: even when an LLM produces code that works perfectly, the dependencies it specifies frequently come with known security holes or are simply incompatible with each other. The model's training data is frozen at some past date, so it tends to recommend versions that were popular at that time — many of which have since had CVEs disclosed against them. Worse, models sometimes confidently produce version numbers that never existed at all, a kind of dependency hallucination.

Why this matters in practice:

A developer copy-pasting an LLM's pip install command is potentially pulling in vulnerable code paths that a security scanner will flag the moment it runs.
"The code works on my machine" hides the issue — functional correctness and dependency safety are completely orthogonal.
Most code-generation benchmarks (HumanEval, MBPP, etc.) only check whether the function returns the right answer. They never look at the requirements.txt the model would write.

The key insight is conceptual as much as empirical: an LLM coding assistant is implicitly making supply-chain decisions every time it writes an import statement, and those decisions inherit all the messiness of the real package ecosystem — versioning, deprecation, vulnerability disclosure, dependency conflicts — none of which the model has any live awareness of. The training cutoff isn't just a knowledge gap about new APIs; it's a security liability.

The paper essentially argues we need a new evaluation axis for code-generating models. Beyond "does the code run?" we should be asking "would a security scanner approve the dependency manifest this model just wrote?" That's a much harder bar, and current models clear it much less often than their pass@1 scores would suggest.

Why it matters: Functionally correct LLM-generated code can still ship known vulnerabilities through the version numbers it pins, exposing a blind spot in how we currently evaluate AI coding assistants.

All newsletters