agent-skills-eval: An open-source test framework for measuring AI agent skill effectiveness
By
darkrishabh
Crisp on the outside, thoughtful on the inside. A keeper.
Summary
agent-skills-eval is an open-source test runner for evaluating AI agent skills (SKILL.md files) based on the Agent Skills standard from Anthropic. It runs the same prompts twice — once with the skill loaded and once without — then uses a judge model to grade both outputs and produces a side-by-side comparison report. This allows developers to measure whether a skill actually improves agent performance or not, providing empirical evidence (receipts) for skill effectiveness.
Key quotes
· 3 pulledAgent Skills — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a SKILL.md and assume your agent is now better at the task. The hard part is proving it.
agent-skills-eval is the missing piece. It runs your skill against the same prompts twice — once with_skill loaded into context, once without_skill (baseline) — has a judge model grade both outputs, and gives you a side-by-side report.
If the skill doesn't make a measurable difference, you'll see it. If it does, you have receipts.
You might also wanna read
AI Skills Manager: Centralized Platform for Managing AI Agent Skills Across Coding Agents
AI Skills Manager is a desktop application that provides a centralized platform for managing AI agent skills across major coding agents, all
Agent Skills Directory: Cross-Platform Search for AI Agent Capabilities
The article presents a cross-platform directory for AI agent skills called 'Agent Skills' that aggregates over 100,000 skills across 30+ pla
Skills Refiner: AI Agent Skills Refactoring and Localization Tool with 210,000+ GitHub Skills Dataset
Skills Refiner is a tool for refactoring and localizing AI agent skills, featuring a dataset of 210,000+ skills from GitHub and a benchmarki
Skilled: A Local Terminal Dashboard for Tracking AI Coding Skill Usage
Skilled is a terminal dashboard tool that aggregates and visualizes usage data for custom AI coding skills/agents across tools like Claude C
Handit.ai: Open-Source Engine for Automatically Improving AI Agents
Handit.ai is an open-source engine that automatically improves AI agents by evaluating their decisions, generating better prompts and datase
Skillkit: Universal Skill Platform for AI Coding Agents
Skillkit is a universal skill platform for AI coding agents that allows users to auto-generate instructions with Primer, persist learnings w
