Minimal Transformer Circuits Achieve Perfect Indirect Object Identification with Only Two Attention Heads
By
[Submitted on 28 Oct 2025 (v1), last revised 29 Jun 2026 (this version, v2)]
Summary
This paper presents research on mechanistic interpretability of transformers, specifically training small attention-only models from scratch on a symbolic Indirect Object Identification (IOI) task. The authors find that a single-layer model with just two attention heads achieves perfect IOI accuracy without MLPs or normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, they discover the two heads specialize into additive and contrastive subcircuits. A two-layer, one-head model composes information across layers primarily through query-key interactions. The work demonstrates that task-specific training induces highly interpretable, minimal circuits for studying transformer reasoning.
Source
Key quotes
· 3 pulledSurprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers.
Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution.
These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
You might also wanna read
Reverse-Engineering Transformer Attention Heads Using Program Synthesis
This paper proposes a scalable pipeline for reverse-engineering attention heads in transformer language models by approximating their behavi
Understanding Transformer Circuits: A Mechanistic Interpretability Perspective
This article explores mechanistic interpretability of transformer neural networks, focusing on understanding how transformers work mathemati
Research Proves Transformer Language Models Are Injective and Invertible
This research paper challenges the conventional view that transformer language models are non-injective due to non-linear components. The au
New Method Enables Constant-Cost Self-Attention Computation for Transformers
Researchers present a novel mathematical approach to compute self-attention in Transformer AI models with constant cost per token, rather th
Research: 224× Compression of Llama-70B Achieved with Improved Accuracy Through Meaning Field Extraction
This research paper introduces a novel method for eliminating transformers from inference while maintaining or improving accuracy. The appro
Systematic Study Shows Transformers Can Drop One or More QKV Projections Without Quality Loss
This research paper systematically evaluates whether Transformers need all three QKV (query, key, value) projections in attention mechanisms
Systematic Study Shows Transformers Can Drop One or More QKV Projections Without Quality Loss
This research paper systematically evaluates whether Transformers need all three QKV (query, key, value) projections in attention mechanisms

Comments
Sign in to join the conversation.
No comments yet. Be the first.