Study Reveals Convergent Evolution in How Language Models Learn Number Representations
By
Anon84
Properly proved. Has structure, has flavour, has a point.
Summary
This research paper investigates how different language models (Transformers, Linear RNNs, LSTMs, and classical word embeddings) learn to represent numbers using periodic features with dominant periods at T=2, 5, and 10. The authors identify a two-tiered hierarchy: while all models learn features with period-T spikes in the Fourier domain, only some learn geometrically separable features for mod-T classification. They prove that Fourier domain sparsity is necessary but not sufficient for geometric separability. The study finds that data, architecture, optimizer, and tokenizer all influence whether models acquire geometrically separable features, which can come from complementary co-occurrence signals in language data or from multi-token addition problems. The paper highlights convergent evolution in feature learning across diverse model architectures.
Key quotes
· 3 pulledLanguage models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$
Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability
Our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals
You might also wanna read
Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs
This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables

AI and Publish-or-Perish Culture Are Overwhelming Academic Peer Review, Study Finds
This article, authored by the AI Task Force for Organization Science, examines how generative AI is reshaping academic peer review and resea
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Researchers Develop Method to Predict Real-Time Progress in Reasoning Language Models
This research paper investigates whether real-time progress prediction is feasible for reasoning language models that use long latent chains

AI systems achieve 50% pass rate in standard three-party Turing test, study finds
This paper demonstrates that three current AI systems (when suitably prompted) achieve a pass rate of at least 50% in a standard three-party
RICP: A Teacher-Student Framework for Retrieved In-Context Principles from Mistakes in LLMs
This paper introduces Retrieved In-Context Principles (RICP), a novel teacher-student framework for improving Large Language Models (LLMs) t
