Transformers Can Learn to Predict Permuted Congruential Generator Sequences Through Curriculum Learning and Scaling Laws
By
[Submitted on 30 Oct 2025 (v1), last revised 16 Feb 2026 (this version, v2)]
Properly proved. Has structure, has flavour, has a point.
Summary
This research paper investigates whether Transformer models can learn to predict sequences generated by Permuted Congruential Generators (PCGs), a family of pseudorandom number generators more complex than linear congruential generators (LCGs). The authors demonstrate that Transformers can successfully perform in-context prediction on unseen PCG sequences, even surpassing classical attack capabilities. Key findings include: (1) models can predict outputs even when truncated to a single bit; (2) models can jointly learn multiple distinct PRNGs simultaneously; (3) a scaling law exists where the number of in-context elements needed for near-perfect prediction grows as the square root of the modulus; (4) learning large moduli (≥2^20) requires curriculum learning with smaller moduli data; and (5) a novel clustering phenomenon emerges in embedding layers where integer inputs form bitwise rotationally-invariant clusters, enabling transfer learning from smaller to larger moduli.
Key quotes
· 5 pulledWe show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks.
Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model.
We demonstrate a scaling law with modulus m: the number of in-context sequence elements required for near-perfect prediction grows as √m.
For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli m ≥ 2^20 requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning.
We analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
