Codex iteratively optimized AGENTS.md 8 times against real PRs, but the best version regressed on a clean holdout
By
Ben Redmond
Fresh out the oven, still warm. Top of the tray.
Summary
The author describes using Codex (an AI coding agent) to iteratively optimize their AGENTS.md file (a configuration file that guides AI agent behavior) against a benchmark of real pull requests from their Stet repository. After 8 iterations, the best-performing version improved performance on the training data but regressed on a clean holdout set, meaning it wasn't safe to deploy. The article explores the tension between vibe-coded configurations and data-driven optimization, the risks of overfitting agent instructions, and the importance of rigorous evaluation for AI agent behavior files.
Key quotes
· 4 pulledI vibe-coded my AGENTS.md, and I'm pretty sure it's slop.
Codex used a benchmark on my repo to measure each change, and optimized AGENTS.md against the data, instead of on pure vibes.
Someone adds a rule that sounds smart, senior, and reasonable, commits it, and hopes the agent behaves better.
The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.
You might also wanna read
AGENTS.md: Standardized Documentation Format for AI Agents Adopted by Major Platforms
The article introduces AGENTS.md, a standardized format for AI agents that serves as a structured alternative to human-readable README files
How I Used Coding Agents to Automate My AI Research Work in Copilot Applied Science
An AI researcher shares their experience using coding agents to automate intellectual work, specifically building agents that automate parts
OpenAI's Codex 3.0 becomes an autonomous cross-app coding agent with GPT-5.5
OpenAI's Codex 3.0, powered by GPT-5.5, has evolved into a cross-app coding agent that can autonomously navigate browsers, interact with web
AGENTS.md: An Open Format for Guiding AI Coding Agents in Open-Source Projects
AGENTS.md is a simple, open format for guiding AI coding agents, functioning as a README specifically designed for agents rather than humans
OpenAI Updates Agents SDK with Codex-Style Harness and Enhanced Sandboxing
OpenAI's Build Hour session, led by engineer Steve Corley, introduced key updates to the Agents SDK, including a new "Codex-style harness" t
Scorecard CEO warns of AI agent dangers in high-stakes domains, offers evaluation platform
Darius, CEO of Scorecard, shares a cautionary tale about building AI agents in high-stakes domains. He describes how his EMR agent for docto
