Minimal Transformer Circuits Achieve Perfect Indirect Object Identification with Only Two Attention Heads

[Submitted on 28 Oct 2025 (v1), last revised 29 Jun 2026 (this version, v2)]

4d ago· 2 min readenInsight

technology science artificial intelligence research mechanistic interpretability

Summary

This paper presents research on mechanistic interpretability of transformers, specifically training small attention-only models from scratch on a symbolic Indirect Object Identification (IOI) task. The authors find that a single-layer model with just two attention heads achieves perfect IOI accuracy without MLPs or normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, they discover the two heads specialize into additive and contrastive subcircuits. A two-layer, one-head model composes information across layers primarily through query-key interactions. The work demonstrates that task-specific training induces highly interpretable, minimal circuits for studying transformer reasoning.

Source

bskyMinimal Transformer Circuits Achieve Perfect Indirect Object Identification with Only Two Attention Headsarxiv.org

Key quotes

· 3 pulled

Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers.

Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution.

These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

Snippet from the RSS feed

Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks.

You might also wanna read

Reverse-Engineering Transformer Attention Heads Using Program Synthesis

This paper proposes a scalable pipeline for reverse-engineering attention heads in transformer language models by approximating their behavi

arxiv.org·4d ago

Understanding Transformer Circuits: A Mechanistic Interpretability Perspective

This article explores mechanistic interpretability of transformer neural networks, focusing on understanding how transformers work mathemati

connorjdavis.com·3mo ago

Research Proves Transformer Language Models Are Injective and Invertible

This research paper challenges the conventional view that transformer language models are non-injective due to non-linear components. The au

arxiv.org·8mo ago

New Method Enables Constant-Cost Self-Attention Computation for Transformers

Researchers present a novel mathematical approach to compute self-attention in Transformer AI models with constant cost per token, rather th

arxiv.org·5mo ago

Research: 224× Compression of Llama-70B Achieved with Improved Accuracy Through Meaning Field Extraction

This research paper introduces a novel method for eliminating transformers from inference while maintaining or improving accuracy. The appro

zenodo.org·6mo ago

Systematic Study Shows Transformers Can Drop One or More QKV Projections Without Quality Loss

This research paper systematically evaluates whether Transformers need all three QKV (query, key, value) projections in attention mechanisms

arxiv.org·1mo ago

Systematic Study Shows Transformers Can Drop One or More QKV Projections Without Quality Loss

This research paper systematically evaluates whether Transformers need all three QKV (query, key, value) projections in attention mechanisms

arxiv.org·1mo ago

Comments

No comments yet. Be the first.