Reverse-Engineering Transformer Attention Heads Using Program Synthesis
By
[Submitted on 17 Jun 2026 (v1), last revised 29 Jun 2026 (this version, v2)]
Summary
This paper proposes a scalable pipeline for reverse-engineering attention heads in transformer language models by approximating their behavior with executable Python programs. The approach computes attention matrices from training examples, uses a pre-trained language model to generate candidate programs that reproduce attention patterns, and re-ranks them based on predictive accuracy on held-out inputs. The method achieves over 75% average Intersection-over-Union similarity on TinyStories across GPT-2, TinyLlama-1.1B, and Llama-3B models. Replacing 25% of attention heads with programmatic surrogates causes only a 16% average perplexity increase while maintaining downstream QA performance. The work advances symbolic transparency in neural models by producing human-readable, executable code that explains attention head behavior.
Source
Key quotes
· 4 pulledA longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions.
We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories.
Replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks.
This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.
You might also wanna read
Minimal Transformer Circuits Achieve Perfect Indirect Object Identification with Only Two Attention Heads
This paper presents research on mechanistic interpretability of transformers, specifically training small attention-only models from scratch
Research Proves Transformer Language Models Are Injective and Invertible
This research paper challenges the conventional view that transformer language models are non-injective due to non-linear components. The au
New Method Enables Constant-Cost Self-Attention Computation for Transformers
Researchers present a novel mathematical approach to compute self-attention in Transformer AI models with constant cost per token, rather th
NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling
The article introduces NSA (Natively trainable Sparse Attention), a novel sparse attention mechanism designed to improve efficiency in long-
JavelinGuard: Low-Cost Transformer Architectures for LLM Security
Tauformer: A Topological Transformer Architecture Using Laplacian-Derived Scalar Attention
The article discusses Tauformer, a novel topological transformer architecture that replaces traditional dot-product attention with a Laplaci

Comments
Sign in to join the conversation.
No comments yet. Be the first.