2026-05-08
When you ask an LLM to write Python code, it doesn't just produce logic — it also tells you which third-party libraries to install, often pinned to specific version numbers like requests==2.25.0 or numpy==1.19.2. Those version numbers seem like a minor detail, but they're effectively security and compatibility decisions. This paper is the first big systematic look at whether LLMs are picking good versions, and the answer is: not really.
The researchers built a benchmark called PinTrace — 1,000 real Stack Overflow-style coding tasks — and ran 10 different LLMs on it, then inspected every library version those models suggested. They checked each pinned version against public vulnerability databases and against actual Python ecosystem rules (does this version even exist? Is it compatible with the other libraries the model picked?).
The headline finding: even when an LLM produces code that works perfectly, the dependencies it specifies frequently come with known security holes or are simply incompatible with each other. The model's training data is frozen at some past date, so it tends to recommend versions that were popular at that time — many of which have since had CVEs disclosed against them. Worse, models sometimes confidently produce version numbers that never existed at all, a kind of dependency hallucination.
Why this matters in practice:
pip install command is potentially pulling in vulnerable code paths that a security scanner will flag the moment it runs.requirements.txt the model would write.The key insight is conceptual as much as empirical: an LLM coding assistant is implicitly making supply-chain decisions every time it writes an import statement, and those decisions inherit all the messiness of the real package ecosystem — versioning, deprecation, vulnerability disclosure, dependency conflicts — none of which the model has any live awareness of. The training cutoff isn't just a knowledge gap about new APIs; it's a security liability.
The paper essentially argues we need a new evaluation axis for code-generating models. Beyond "does the code run?" we should be asking "would a security scanner approve the dependency manifest this model just wrote?" That's a much harder bar, and current models clear it much less often than their pass@1 scores would suggest.
