OSWorld 2.0: A new benchmark for testing AI agents on complex, real-world computer tasks
This article introduces OSWorld 2.0, a new benchmark for evaluating computer-use AI agents on long-horizon, real-world tasks. Unlike existing benchmarks that involve simple, short tasks (around 30 tool calls), OSWorld 2.0 features 108 complex workflows that take human users a med