ArXiv Paper Digest: MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills

MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills

2026-06-08

Authors: Wenbo Guo, Wei Zeng, Chengwei Liu, Xiaojun Jia

AI coding agents like Claude Code and Gemini CLI have a feature that's become wildly popular: skills. A skill is a little bundle — a markdown file with natural-language instructions, some executable scripts, and a list of tool permissions — that you can drop into your agent to teach it a new trick. Want your agent to know how to deploy to your specific cloud setup? Install a skill. Want it to handle a niche file format? Install a skill.

The problem: skills are basically a new kind of software supply chain, and nobody knows how dangerous it is. A skill is simultaneously code (the scripts it runs) and prompt (the instructions it feeds the agent). That hybrid nature means traditional malware scanners miss things — a script might look benign, but the markdown instructions could nudge the agent into doing something harmful with totally legitimate tools. And prompt-injection detectors miss the other half: a perfectly innocent-looking instruction file paired with a malicious script.

This paper introduces MalSkillBench, the first benchmark designed to measure how well detection tools catch malicious skills. The key contribution is the word runtime-verified: rather than just labeling skills as "looks suspicious," the authors actually ran each malicious skill and confirmed it does the bad thing it claims to do. That ground truth is what's been missing — previous security tools were essentially being graded against vibes.

What's in the benchmark:

A curated collection of skills covering the hybrid attack surface — pure code attacks, pure prompt-injection attacks, and the nasty in-between cases where the code and instructions only become malicious in combination.
Verified execution traces proving each malicious skill actually does damage when an agent loads it.
A way to evaluate detection tools across the whole spectrum, so we can finally see where current defenses fall down.

The key insight is that a skill's risk lives in the seam between code and language. A reviewer reading just the script sees nothing wrong. A reviewer reading just the markdown sees nothing wrong. Only when an agent stitches them together does the attack materialize — and that's exactly the analysis gap attackers will exploit.

For anyone running agents with third-party skills (which is increasingly everyone), this is the first honest measuring stick for whether your defenses work. Expect this to become the de facto evaluation for skill-scanning tools, the same way SWE-bench became the yardstick for coding agents.

Why it matters: Agent skills are a fast-growing supply chain attack vector that sits awkwardly between code and prompts, and MalSkillBench gives the security community its first verified ground truth for measuring whether defenses actually work.

All newsletters