Systematic Evaluation of Deep Learning Optimizers Reveals Limited Speedup Over AdamW in Language Model Pretraining

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings…

Read the full article

fzliu10mo ago2 min readenInsight

technology science machine learning research methods

You might also wanna read

Is the Softmax Bottleneck Holding Back AI Progress?

Neural language models face a hidden bottleneck limiting both expressivity and optimization. Is it time for a rethink on LM design?

machinebrief.com·4d ago

Per-Token Fixed-Point Convergence in Depth-Recurrent Transformers

arXiv:2607.14427v1 Announce Type: new Abstract: A depth-recurrent transformer applies a weight-tied core a variable number of times, and pri

machinebrief.com·7h ago

Flash-MSA Method Aims to Speed Up AI Training on Million-Token Sequences

Researchers have introduced Flash-MSA, a technique designed to accelerate the training of large language models on very long sequences of up

ShortSingh·4d ago

AI: Deep-Thinking Tokens Outpace Length in Language Models

Language models shine when prioritizing deep-thinking tokens over sheer length. Think@n optimizes this approach, enhancing accuracy and cost

machinebrief.com·6d ago

LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate

arxiv.org·1mo ago

Accelerating Large-Scale LLM Inference on AMD Instinct MI350X/MI355X with Eagle3 and AMD Quark

Large language model (LLM) inference is increasingly constrained by autoregressive decoding. Even when prefill is highly optimized, the deco

AMD·14d ago

Comments

No comments yet. Be the first.