All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Study Finds Half of AI-Generated SWE-bench Pull Requests Would Not Be Merged by Human Maintainers

By

mustaphah

2mo ago· 68 min readenInsight

Summary

A study reveals that approximately half of test-passing SWE-bench Verified PRs (pull requests) created by AI agents from mid-2024 to mid/late-2025 would not be merged into main repositories by human maintainers, even after accounting for noise in maintainer decisions. The research cautions against naive interpretation of benchmark scores, noting that AI agents lack the iterative feedback loop available to human developers, potentially leading to overestimation of their practical utility without additional human feedback or elicitation.

Key quotes

· 3 pulled
We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions.
Since the agents are not given a chance to iterate on their solution in response to feedback the way a human developer would, we do not claim that this represents a fundamental capability limitation.
Rather, our results indicate that a naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.
Snippet from the RSS feed
Summary: We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions. Since the agents are not

You might also wanna read