OSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasks
By
XLANG Lab
Summary
This article introduces OSWorld 2.0, a new benchmark for evaluating computer-use AI agents on long-horizon, real-world tasks. Unlike existing benchmarks that involve simple, short tasks (around 30 tool calls), OSWorld 2.0 features 108 complex workflows that take human users a median of 1.6 hours to complete and require an average of 318 tool calls with Claude Opus 4.5 using maximum thinking. The benchmark aims to better capture the realism, complexity, and long-horizon demands of actual computer use, revealing limitations of current frontier AI agents.
Source
Key quotes
· 3 pulledExisting computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents.
We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks.
Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.5 using maximum thinking, compared with about 30 in OSWorld 1.0.
You might also wanna read
SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks
SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11
PA Bench: A New Benchmark for Evaluating AI Web Agents on Real-World Personal Assistant Workflows
The article introduces PA Bench, a new benchmark for evaluating web-based AI agents on real-world personal assistant workflows. It addresses
AI Task Completion Capabilities Show Exponential Growth, Could Handle Most Software Tasks Within a Decade
The article presents a methodology for measuring AI performance based on the length of tasks AI agents can complete independently. It shows
Assessing the Real-World Impact of AI on Open-Source Developer Productivity in Early 2025
This article examines the limitations of current AI coding benchmarks, arguing that they sacrifice realism for scale and efficiency. Benchma
ProgramBench: New Benchmark Reveals Language Models Struggle to Build Complete Software Projects From Scratch
This paper introduces ProgramBench, a new benchmark designed to evaluate the ability of language model-based software engineering agents to
Web Bench: A Comprehensive Benchmark for AI Browser Agent Performance
Web Bench is a new benchmark platform designed to evaluate and compare AI browser agents' performance in web navigation tasks. It provides c

Comments
Sign in to join the conversation.
No comments yet. Be the first.