All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

OSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasks

By

XLANG Lab

5h ago· 4 min readenNews

Summary

This article introduces OSWorld 2.0, a new benchmark for evaluating computer-use AI agents on long-horizon, real-world tasks. Unlike existing benchmarks that involve simple, short tasks (around 30 tool calls), OSWorld 2.0 features 108 complex workflows that take human users a median of 1.6 hours to complete and require an average of 318 tool calls with Claude Opus 4.5 using maximum thinking. The benchmark aims to better capture the realism, complexity, and long-horizon demands of actual computer use, revealing limitations of current frontier AI agents.

Source

Twitter / XOSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasksosworld-v2.xlang.ai

Key quotes

· 3 pulled
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents.
We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks.
Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.5 using maximum thinking, compared with about 30 in OSWorld 1.0.
Snippet from the RSS feed
OSWorld 2.0: Benchmarking computer-use agents on long-horizon real-world tasks

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.