OSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasks

XLANG Lab

5h ago· 4 min readenNews

technology programming ai benchmarks computer-use agents

Summary

This article introduces OSWorld 2.0, a new benchmark for evaluating computer-use AI agents on long-horizon, real-world tasks. Unlike existing benchmarks that involve simple, short tasks (around 30 tool calls), OSWorld 2.0 features 108 complex workflows that take human users a median of 1.6 hours to complete and require an average of 318 tool calls with Claude Opus 4.5 using maximum thinking. The benchmark aims to better capture the realism, complexity, and long-horizon demands of actual computer use, revealing limitations of current frontier AI agents.

Source

Twitter / XOSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasksosworld-v2.xlang.ai

Key quotes

· 3 pulled

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents.

We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks.

Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.5 using maximum thinking, compared with about 30 in OSWorld 1.0.

Snippet from the RSS feed

OSWorld 2.0: Benchmarking computer-use agents on long-horizon real-world tasks

You might also wanna read

SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11

arxiv.org·4mo ago

PA Bench: A New Benchmark for Evaluating AI Web Agents on Real-World Personal Assistant Workflows

The article introduces PA Bench, a new benchmark for evaluating web-based AI agents on real-world personal assistant workflows. It addresses

vibrantlabs.com·4mo ago

AI Task Completion Capabilities Show Exponential Growth, Could Handle Most Software Tasks Within a Decade

The article presents a methodology for measuring AI performance based on the length of tasks AI agents can complete independently. It shows

metr.org·6mo ago

Assessing the Real-World Impact of AI on Open-Source Developer Productivity in Early 2025

This article examines the limitations of current AI coding benchmarks, arguing that they sacrifice realism for scale and efficiency. Benchma

metr.org·11mo ago

ProgramBench: New Benchmark Reveals Language Models Struggle to Build Complete Software Projects From Scratch

This paper introduces ProgramBench, a new benchmark designed to evaluate the ability of language model-based software engineering agents to

arXiv.org·1mo ago

Web Bench: A Comprehensive Benchmark for AI Browser Agent Performance

Web Bench is a new benchmark platform designed to evaluate and compare AI browser agents' performance in web navigation tasks. It provides c

Product Hunt·1y ago

Comments

No comments yet. Be the first.