Scaling Karpathy's Autoresearch: Parallel GPU Processing Enables New AI Experimentation Strategies
By
hopechong
Kettled twice. Extra chewy, extra trustworthy.
Summary
The article describes an experiment where researchers scaled Andrej Karpathy's autoresearch system by giving it access to 16 GPUs on a Kubernetes cluster instead of running single experiments sequentially. Over 8 hours, the AI agent submitted approximately 910 experiments, discovering that scaling model width was more important than individual hyperparameters. The agent autonomously learned to use H200 GPUs for validation while screening ideas on H100s, achieving a 2.87% improvement in validation bits per byte (val_bpb) from 1.003 to 0.974. The key insight was that parallel processing fundamentally changed the agent's search strategy from greedy hill-climbing to running factorial grids of 10-13 experiments per wave, enabling it to catch interactions between hyperparameters that would be missed in sequential execution.
Key quotes
· 5 pulledOver 8 hours it submitted ~910 experiments, found that scaling model width mattered more than any single hyperparameter
taught itself to use H200s for validation while screening ideas on H100s
drove val_bpb from 1.003 down to 0.974 - a 2.87% improvement over baseline
With one GPU, it's stuck doing greedy hill-climbing - try one thing, check, repeat
With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interactions
