2026-06-08
AI coding agents like Claude Code and Gemini CLI have a feature that's become wildly popular: skills. A skill is a little bundle — a markdown file with natural-language instructions, some executable scripts, and a list of tool permissions — that you can drop into your agent to teach it a new trick. Want your agent to know how to deploy to your specific cloud setup? Install a skill. Want it to handle a niche file format? Install a skill.
The problem: skills are basically a new kind of software supply chain, and nobody knows how dangerous it is. A skill is simultaneously code (the scripts it runs) and prompt (the instructions it feeds the agent). That hybrid nature means traditional malware scanners miss things — a script might look benign, but the markdown instructions could nudge the agent into doing something harmful with totally legitimate tools. And prompt-injection detectors miss the other half: a perfectly innocent-looking instruction file paired with a malicious script.
This paper introduces MalSkillBench, the first benchmark designed to measure how well detection tools catch malicious skills. The key contribution is the word runtime-verified: rather than just labeling skills as "looks suspicious," the authors actually ran each malicious skill and confirmed it does the bad thing it claims to do. That ground truth is what's been missing — previous security tools were essentially being graded against vibes.
What's in the benchmark:
The key insight is that a skill's risk lives in the seam between code and language. A reviewer reading just the script sees nothing wrong. A reviewer reading just the markdown sees nothing wrong. Only when an agent stitches them together does the attack materialize — and that's exactly the analysis gap attackers will exploit.
For anyone running agents with third-party skills (which is increasingly everyone), this is the first honest measuring stick for whether your defenses work. Expect this to become the de facto evaluation for skill-scanning tools, the same way SWE-bench became the yardstick for coding agents.
