All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Systematic Evaluation of Deep Learning Optimizers Reveals Limited Speedup Over AdamW in Language Model Pretraining

By

fzliu

8mo ago· 2 min readenInsight

Summary

This research paper systematically evaluates ten deep learning optimizers for language model pretraining, challenging previous claims of 1.4-2x speedups over AdamW. The study identifies methodological flaws in prior comparisons, including unequal hyperparameter tuning and misleading evaluation setups. Through rigorous testing across four model scales (0.1B-1.2B parameters) and data-to-model ratios, the researchers found that actual speedups are lower than claimed (1.1x for 1.2B models) and decrease with model size. Matrix-based optimizers like Muon and Soap show the best performance but their advantage diminishes at larger scales.

Key quotes

· 5 pulled
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup
We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups
The actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models
All the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars
The speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models
Snippet from the RSS feed
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adop

You might also wanna read