DeepSWE: A New Long-Horizon Benchmark for Evaluating Frontier Coding Agents on Complex Engineering Tasks

ammar_x

5d ago· 17 min readenInsight

100/100

Golden Brown

Bagelometer↗

Front-window bakery material. Catches the eye, delivers the goods.

Score100TypeanalysisSentimentneutral

Summary

DeepSWE is a new long-horizon software engineering benchmark designed to evaluate frontier coding agents on original, complex engineering tasks. It addresses shortcomings in existing benchmarks like SWE-bench Pro, which averages only 120 lines of code per task and suffers from verifier misgrading (8% false positives, 24% false negatives). The benchmark also tackles growing concerns about benchmark contamination in frontier AI labs.

Key quotes

· 5 pulled

DeepSWE is a long-horizon software engineering benchmark that delivers four major advances over today's public benchmarks

Existing benchmarks fall short on several of these axes

SWE-bench Pro, the leading agentic coding benchmark, has tasks averaging just 120 lines of code to solve

our audit found its verifier misgrades agent outputs at rates of 8% false positives and 24% false negatives

Frontier labs are also raising growing concerns about benchmark contamination

Snippet from the RSS feed

DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

You might also wanna read

Datacurve's DeepSWE Benchmark Shows GPT-5.5 Leading AI Coding Models with 70% Pass Rate

A new benchmark called DeepSWE, released by startup Datacurve, reveals significant performance differences among AI coding models that were

share.transistor.fm·4d ago

ITBench-AA Benchmark Launched: Frontier AI Models Score Below 50% on Enterprise IT Tasks

Artificial Analysis and IBM Software Innovation Lab have launched ITBench-AA, a new benchmark series evaluating AI models on agentic enterpr

huggingface.co·4d ago