All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

SWE-bench Verified benchmark no longer accurately measures AI coding capabilities due to contamination

By

kmdupree

1mo ago· 8 min readenInsight

Summary

OpenAI's analysis finds that SWE-bench Verified, a benchmark for measuring AI coding capabilities, is increasingly contaminated and no longer accurately measures frontier coding progress. The benchmark suffers from test leakage (models trained on benchmark data) and flawed test cases that don't properly evaluate autonomous software engineering. The article recommends transitioning to SWE-bench Pro as a more robust alternative for measuring coding model performance.

Key quotes

· 5 pulled
Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks.
SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases.
Tracking and forecasting progress of these capabilities is also an important part of OpenAI's Preparedness Framework.
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.
Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
Snippet from the RSS feed
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

You might also wanna read