All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Red Queen Gödel Machine: An Evolutionary Framework for Self-Improving AI with Dynamic Evaluation

By

[Submitted on 24 Jun 2026]

10h ago· 3 min readenInsight

Summary

This paper introduces the Red Queen Gödel Machine (RQGM), an evolutionary framework for recursive self-improvement of AI agents under non-stationary evaluation criteria. Unlike prior self-improving agents that assume fixed benchmarks or verifiers, RQGM allows the evaluation utility to evolve alongside the agent, organized into epochs with fixed within-epoch criteria and updated objectives at epoch boundaries. The framework is tested across three domains: (1) verifiable coding tasks, where it improves test pass rates over prior SOTA while using fewer tokens; (2) scientific paper writing and reviewing, where co-evolved writers achieve 1.78x-1.86x higher acceptance rates and co-evolved graders reach 9% higher accuracy; and (3) Olympiad-level proof writing and grading. Notably, RQGM corrects a bias in baseline reviewers that over-accept AI-generated papers by introducing adversarial objectives that enforce equal stringency on AI and human work.

Source

Twitter / XRed Queen Gödel Machine: An Evolutionary Framework for Self-Improving AI with Dynamic Evaluationarxiv.org

Key quotes

· 4 pulled
We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks.
The RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens.
Co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy.
The strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.
Snippet from the RSS feed
Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled da

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.