SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We…

Read the full article

mustaphah5mo ago2 min readenInsight

technology science artificial intelligence benchmarking

You might also wanna read

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

arXiv:2607.08768v1 Announce Type: new Abstract: The rapid development of large language models and multimodal large language models has acce

machinebrief.com·7d ago

EdgeBench: A Benchmark for Measuring AI Environment Learning Through Extended Real-World Tasks

EdgeBench studies how agents learn from real-world environments across 134 day-long executable tasks.

edge-bench.org·14d ago

VideoWeaver: A Benchmark for Evaluating AI Agent Skills in Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long

arxiv.org·1mo ago

New Alibaba AI framework skips loading every tool, cutting agent token use 99%

As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skil

VentureBeat·14d ago

MCPEvol-Bench: Benchmarking LLM Agent Performance Across Dynamic Evolutions of MCP Servers

arXiv:2607.14642v1 Announce Type: new Abstract: As Model Context Protocol (MCP) servers emerge as the core infrastructure for connecting LLM

machinebrief.com·11h ago

Skill-MAS: A Meta-Skill Approach to Improving Multi-Agent Systems Without Retraining

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. Ho

arxiv.org·26d ago

Skill-MAS: A Meta-Skill Approach to Improving Multi-Agent Systems Without Retraining

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. Ho

arxiv.org·26d ago

Comments

No comments yet. Be the first.