Red Queen Gödel Machine: An Evolutionary Framework for Self-Improving AI with Dynamic Evaluation

[Submitted on 24 Jun 2026]

10h ago· 3 min readenInsight

technology science artificial intelligence machine learning research

Summary

This paper introduces the Red Queen Gödel Machine (RQGM), an evolutionary framework for recursive self-improvement of AI agents under non-stationary evaluation criteria. Unlike prior self-improving agents that assume fixed benchmarks or verifiers, RQGM allows the evaluation utility to evolve alongside the agent, organized into epochs with fixed within-epoch criteria and updated objectives at epoch boundaries. The framework is tested across three domains: (1) verifiable coding tasks, where it improves test pass rates over prior SOTA while using fewer tokens; (2) scientific paper writing and reviewing, where co-evolved writers achieve 1.78x-1.86x higher acceptance rates and co-evolved graders reach 9% higher accuracy; and (3) Olympiad-level proof writing and grading. Notably, RQGM corrects a bias in baseline reviewers that over-accept AI-generated papers by introducing adversarial objectives that enforce equal stringency on AI and human work.

Source

Twitter / XRed Queen Gödel Machine: An Evolutionary Framework for Self-Improving AI with Dynamic Evaluationarxiv.org

Key quotes

· 4 pulled

We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks.

The RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens.

Co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy.

The strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.

Snippet from the RSS feed

Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled da

You might also wanna read

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

arxiv.org·1y ago

The Evolution of AI: From Static Benchmarks to Inference-Time Search for Autonomous Agents

The article explores the shift from traditional AI benchmarking to inference-time search as the future of AI development. It discusses how c

adlrocha.substack.com·5mo ago

Survey of Self-Evolving AI Agents: Bridging Foundation Models and Lifelong Adaptability

The article surveys the emerging field of self-evolving AI agents, which aim to bridge the static capabilities of foundation models with the

arxiv.org·10mo ago

Google DeepMind's Aletheia: An Autonomous AI System for Mathematical Research and Proof Generation

Google DeepMind researchers introduce Aletheia, an autonomous mathematics research agent that can generate, verify, and revise mathematical

arxiv.org·4mo ago

Turing-RL: A Reinforcement Learning Approach for Training User Simulators Using Turing Test Rewards

This paper introduces Turing-RL, a novel reinforcement learning approach for training user simulator models that can mimic human users in in

arxiv.org·9d ago

R-Zero: A Self-Evolving LLM Framework That Generates Its Own Training Data Without Human Input

R-Zero is a fully autonomous framework for training self-evolving Large Language Models (LLMs) that generates its own training data from scr

arxiv.org·9mo ago

Comments

No comments yet. Be the first.