All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Codex iteratively optimized AGENTS.md 8 times against real PRs, but the best version regressed on a clean holdout

By

Ben Redmond

4d ago· 11 min readenInsight

Summary

The author describes using Codex (an AI coding agent) to iteratively optimize their AGENTS.md file (a configuration file that guides AI agent behavior) against a benchmark of real pull requests from their Stet repository. After 8 iterations, the best-performing version improved performance on the training data but regressed on a clean holdout set, meaning it wasn't safe to deploy. The article explores the tension between vibe-coded configurations and data-driven optimization, the risks of overfitting agent instructions, and the importance of rigorous evaluation for AI agent behavior files.

Key quotes

· 4 pulled
I vibe-coded my AGENTS.md, and I'm pretty sure it's slop.
Codex used a benchmark on my repo to measure each change, and optimized AGENTS.md against the data, instead of on pure vibes.
Someone adds a rule that sounds smart, senior, and reasonable, commits it, and hopes the agent behaves better.
The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.
Snippet from the RSS feed
Codex optimized its own AGENTS.md against real Stet repo tasks. The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.

You might also wanna read