Study Finds Half of AI-Generated SWE-bench Pull Requests Would Not Be Merged by Human Maintainers

Summary: We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise…

Read the full article

mustaphah4mo ago68 min readenInsight

technology programming software development ai research

You might also wanna read

Microsoft Study: AI Coding Agents Raise Pull Requests 24%, But Review Queues Pile Up

A Microsoft study found AI coding agents boosted merged pull requests by 24% over four months, but review capacity and legacy codebases tell

Lumien News·3d ago

Three-Week GitHub Star Tracker Reveals Shift From AI Frameworks to Agent Skills

A developer running an AI trends site tracked daily star counts across 611 GitHub repositories from June 19 to July 10, accumulating 2,671 d

ShortSingh·7d ago

OpenAI retracts SWE-Bench Pro endorsement after audit finds 30% of tasks flawed

OpenAI has withdrawn its recommendation of SWE-Bench Pro, a coding benchmark it had promoted as a replacement for the previously discredited

ShortSingh·7d ago

Microsoft Study Finds AI Coding Agents Lift Pull Requests by 24%

A Microsoft study found command-line AI coding agents were linked to more merged pull requests, but adoption and review capacity shaped the

TechRepublic·3d ago

AI Agent Shipped 3 Pull Requests in 75 Minutes, But Needed 12 Corrections Along the Way

A developer used a multi-model AI pipeline — Claude for planning, DeepSeek as orchestrator, and Codex for implementation — to submit three p

ShortSingh·18h ago

OpenAI Scraps Support for Flawed AI Coding Benchmark

OpenAI has rescinded its support for the SWE-Bench Pro, revealing that 30% of its tasks are ineffective. This calls the reliability of AI co

machinebrief.com·7d ago

Comments

No comments yet. Be the first.