Study Finds Half of AI-Generated SWE-bench Pull Requests Would Not Be Merged by Human Maintainers
By
mustaphah
Fresh out the oven, still warm. Top of the tray.
Summary
A study reveals that approximately half of test-passing SWE-bench Verified PRs (pull requests) created by AI agents from mid-2024 to mid/late-2025 would not be merged into main repositories by human maintainers, even after accounting for noise in maintainer decisions. The research cautions against naive interpretation of benchmark scores, noting that AI agents lack the iterative feedback loop available to human developers, potentially leading to overestimation of their practical utility without additional human feedback or elicitation.
Key quotes
· 3 pulledWe find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions.
Since the agents are not given a chance to iterate on their solution in response to feedback the way a human developer would, we do not claim that this represents a fundamental capability limitation.
Rather, our results indicate that a naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.
You might also wanna read

AI and Publish-or-Perish Culture Are Overwhelming Academic Peer Review, Study Finds
This article, authored by the AI Task Force for Organization Science, examines how generative AI is reshaping academic peer review and resea
The Verification Crisis: How AI-Generated Code Is Reshaping Software Development
The article examines the rapid integration of AI in software development, highlighting staggering statistics: Cursor alone generates nearly
dev.to·1d agoNew ITBench-AA Benchmark Reveals AI Models Struggle with Enterprise SRE Tasks
ITBench-AA, a new benchmark developed by Artificial Analysis and IBM Research over six months, reveals that leading AI models like Claude Op
Study finds most developers refuse to code without AI, raising quality concerns
A February 2026 study by AI research lab METR reveals that most developers now refuse to work without AI coding tools. While these tools hel
Gartner Predicts 40% of Corporate AI Agent Projects Will Fail Due to Poor Risk Controls
AI agents, touted as the next big thing after generative AI failed to deliver productive returns, are now facing a reckoning. Up to 79% of U
MIT study finds 47% drop in brain activity when using AI writing tools, raising concerns about cognitive delegation
An article examining the cognitive costs of AI-assisted writing, citing an MIT Media Lab study showing a 47% drop in brain activity (measure
uxdesign.cc·4d ago