All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

CursorBench 3.1: Benchmarking AI Coding Agents on Real-World Multi-File Tasks

By

handfuloflight

6h ago· 7 min readenInsight

Summary

CursorBench is a benchmark developed by Cursor to evaluate AI coding agents on ambiguous, multi-file tasks drawn from real Cursor sessions. The article presents CursorBench 3.1 results comparing various models (Fable 5, Opus 4.8, GPT-5.5, Sonnet 5, etc.) on their performance scores versus average cost per task. Fable 5 Max leads with a 72.9% score at $18.02 per task, while other Fable 5 variants and models show varying cost-performance tradeoffs. The benchmark aims to measure how well agents handle realistic, messy coding scenarios.

Source

Hacker NewsCursorBench 3.1: Benchmarking AI Coding Agents on Real-World Multi-File Taskscursor.com

Key quotes

· 3 pulled
We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions.
Higher scores are better.
A scatter and line chart comparing Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, Sonnet 5, Sonnet 4.6, GLM 5.2, Composer 2.5, and Composer 2 scores against average cost per task.
Snippet from the RSS feed
Compare CursorBench 3.1 results across the models Cursor evaluates.

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.