Systematic Evaluation of Deep Learning Optimizers Reveals Limited Speedup Over AdamW in Language Model Pretraining
By
fzliu
A good honest bake. Not flashy, but you'll finish the whole bagel.
Summary
This research paper systematically evaluates ten deep learning optimizers for language model pretraining, challenging previous claims of 1.4-2x speedups over AdamW. The study identifies methodological flaws in prior comparisons, including unequal hyperparameter tuning and misleading evaluation setups. Through rigorous testing across four model scales (0.1B-1.2B parameters) and data-to-model ratios, the researchers found that actual speedups are lower than claimed (1.1x for 1.2B models) and decrease with model size. Matrix-based optimizers like Muon and Soap show the best performance but their advantage diminishes at larger scales.
Key quotes
· 5 pulledAdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup
We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups
The actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models
All the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars
The speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models
You might also wanna read
PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding
The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o
Unified Framework for Variational Quantum Knowledge Graph Embeddings on NISQ Devices
This paper introduces a unified framework for variational quantum algorithms (VQAs) applied to knowledge graph embeddings on near-term NISQ
Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards
This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses
Eureka: An LLM-Driven Framework for Automated Feature Engineering in Enterprise AI
This paper presents Eureka, an LLM-driven framework for automated feature engineering in machine learning. It treats feature engineering as
Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models
This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attentio
PICO: A Practical Learned Image Codec Optimized for Human Visual Perception
The article introduces PICO (Perceptual Image Codec), a learned image compression codec optimized for the human visual system. It was develo
