All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

By

mustaphah

3mo ago· 2 min readenInsight

Summary

SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11 domains with curated skills and deterministic verifiers. Testing 7 agent-model configurations over 7,308 trajectories showed that curated skills improve average pass rates by 16.2 percentage points, though effectiveness varies significantly by domain (from +4.5pp in Software Engineering to +51.9pp in Healthcare). Notably, 16 of 84 tasks showed negative performance impacts from skills. Self-generated skills provided no benefit on average, indicating models cannot reliably author the procedural knowledge they benefit from consuming. The research also found that focused skills with 2-3 modules outperform comprehensive documentation, and smaller models with skills can match larger models without them.

Key quotes

· 4 pulled
Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)
Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming
Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them
16 of 84 tasks show negative deltas
Snippet from the RSS feed
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domai

You might also wanna read