All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

DeepSWE: A New Long-Horizon Benchmark for Evaluating Frontier Coding Agents on Complex Engineering Tasks

By

ammar_x

5d ago· 17 min readenInsight

Summary

DeepSWE is a new long-horizon software engineering benchmark designed to evaluate frontier coding agents on original, complex engineering tasks. It addresses shortcomings in existing benchmarks like SWE-bench Pro, which averages only 120 lines of code per task and suffers from verifier misgrading (8% false positives, 24% false negatives). The benchmark also tackles growing concerns about benchmark contamination in frontier AI labs.

Key quotes

· 5 pulled
DeepSWE is a long-horizon software engineering benchmark that delivers four major advances over today's public benchmarks
Existing benchmarks fall short on several of these axes
SWE-bench Pro, the leading agentic coding benchmark, has tasks averaging just 120 lines of code to solve
our audit found its verifier misgrades agent outputs at rates of 8% false positives and 24% false negatives
Frontier labs are also raising growing concerns about benchmark contamination
Snippet from the RSS feed
DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

You might also wanna read