SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks
By
mustaphah
Hot, fresh, and worth queueing round the block for.
Summary
SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11 domains with curated skills and deterministic verifiers. Testing 7 agent-model configurations over 7,308 trajectories showed that curated skills improve average pass rates by 16.2 percentage points, though effectiveness varies significantly by domain (from +4.5pp in Software Engineering to +51.9pp in Healthcare). Notably, 16 of 84 tasks showed negative performance impacts from skills. Self-generated skills provided no benefit on average, indicating models cannot reliably author the procedural knowledge they benefit from consuming. The research also found that focused skills with 2-3 modules outperform comprehensive documentation, and smaller models with skills can match larger models without them.
Key quotes
· 4 pulledCurated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)
Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming
Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them
16 of 84 tasks show negative deltas
You might also wanna read
Skills Refiner: AI Agent Skills Refactoring and Localization Tool with 210,000+ GitHub Skills Dataset
Skills Refiner is a tool for refactoring and localizing AI agent skills, featuring a dataset of 210,000+ skills from GitHub and a benchmarki
ITBench-AA Benchmark Launched: Frontier AI Models Score Below 50% on Enterprise IT Tasks
Artificial Analysis and IBM Software Innovation Lab have launched ITBench-AA, a new benchmark series evaluating AI models on agentic enterpr
New ITBench-AA Benchmark Reveals AI Models Struggle with Enterprise SRE Tasks
ITBench-AA, a new benchmark developed by Artificial Analysis and IBM Research over six months, reveals that leading AI models like Claude Op
Web Bench: A Comprehensive Benchmark for AI Browser Agent Performance
Web Bench is a new benchmark platform designed to evaluate and compare AI browser agents' performance in web navigation tasks. It provides c
Agent Skills Directory: Cross-Platform Search for AI Agent Capabilities
The article presents a cross-platform directory for AI agent skills called 'Agent Skills' that aggregates over 100,000 skills across 30+ pla
Baseline Core: Open-Source Skills System for AI Agents to Perform Business Tasks
Baseline Core is an open-source skills system for AI agents that enables AI tools to perform business tasks like market research, writing PR
