Theoretical Analysis Reveals Why Linear RNNs Are More Parallelizable Than Nonlinear RNNs
By
[Submitted on 4 Mar 2026 (v1), last revised 1 Jun 2026 (this version, v3)]
Lightly toasted, lightly seasoned, mostly correct.
Summary
This paper establishes a theoretical connection between types of RNNs and standard complexity classes to explain why linear RNNs (LRNNs) are more parallelizable than traditional nonlinear RNNs. The authors show that LRNNs can be viewed as log-depth arithmetic circuits (slight depth overhead vs. transformers' boolean circuits), while nonlinear RNNs can solve L-complete and even P-complete problems, creating a fundamental barrier to parallelization. The theory also identifies fine-grained expressivity differences between LRNN variants (permutation-diagonal vs. diagonal-plus-low-rank) and associates each RNN type with corresponding automata-theoretic models. The work aims to provide a foundation for designing LLM architectures that balance expressivity and parallelism.
Key quotes
· 4 pulledWe show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit.
We show that nonlinear RNNs can solve L-complete problems (and even P-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers.
Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are NC1-complete whereas diagonal-plus-low-rank LRNNs are more expressive (PNC1-complete).
Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.
You might also wanna read
Study Reveals Convergent Evolution in How Language Models Learn Number Representations
This research paper investigates how different language models (Transformers, Linear RNNs, LSTMs, and classical word embeddings) learn to re
GPU Programming Project: Implementing Parallelizable RNNs with CUDA
A student's final project for CS179: GPU Programming implementing the paper "Were RNNs All We Needed?" by Feng et al. The project focuses on
dhruvmsheth.github.io·8mo agoThe surprising structural similarities between neural networks and cryptographic ciphers
This article explores the structural and algorithmic similarities between neural network architectures (like recurrent neural networks) and
Transformers Can Learn to Predict Permuted Congruential Generator Sequences Through Curriculum Learning and Scaling Laws
This research paper investigates whether Transformer models can learn to predict sequences generated by Permuted Congruential Generators (PC
Understanding Continuous Batching in Large Language Models: From Attention Mechanisms to Throughput Optimization
This technical blog post explains continuous batching in large language models (LLMs) by starting from first principles of attention mechani
Research Directions for Overcoming Memory and Interconnect Challenges in Large Language Model Inference Hardware
This article discusses the technical challenges of Large Language Model (LLM) inference, highlighting how the autoregressive Decode phase ma
