All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Study Reveals Convergent Evolution in How Language Models Learn Number Representations

By

Anon84

1mo ago· 2 min readenInsight

Summary

This research paper investigates how different language models (Transformers, Linear RNNs, LSTMs, and classical word embeddings) learn to represent numbers using periodic features with dominant periods at T=2, 5, and 10. The authors identify a two-tiered hierarchy: while all models learn features with period-T spikes in the Fourier domain, only some learn geometrically separable features for mod-T classification. They prove that Fourier domain sparsity is necessary but not sufficient for geometric separability. The study finds that data, architecture, optimizer, and tokenizer all influence whether models acquire geometrically separable features, which can come from complementary co-occurrence signals in language data or from multi-token addition problems. The paper highlights convergent evolution in feature learning across diverse model architectures.

Key quotes

· 3 pulled
Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$
Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability
Our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals
Snippet from the RSS feed
Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical wo

You might also wanna read