Reverse-Engineering Transformer Attention Heads Using Program Synthesis

[Submitted on 17 Jun 2026 (v1), last revised 29 Jun 2026 (this version, v2)]

4d ago· 2 min readenInsight

technology science artificial intelligence machine learning interpretability

Summary

This paper proposes a scalable pipeline for reverse-engineering attention heads in transformer language models by approximating their behavior with executable Python programs. The approach computes attention matrices from training examples, uses a pre-trained language model to generate candidate programs that reproduce attention patterns, and re-ranks them based on predictive accuracy on held-out inputs. The method achieves over 75% average Intersection-over-Union similarity on TinyStories across GPT-2, TinyLlama-1.1B, and Llama-3B models. Replacing 25% of attention heads with programmatic surrogates causes only a 16% average perplexity increase while maintaining downstream QA performance. The work advances symbolic transparency in neural models by producing human-readable, executable code that explains attention head behavior.

Source

Twitter / XReverse-Engineering Transformer Attention Heads Using Program Synthesisarxiv.org

Key quotes

· 4 pulled

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions.

We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories.

Replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks.

This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

Snippet from the RSS feed

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with e

You might also wanna read

Minimal Transformer Circuits Achieve Perfect Indirect Object Identification with Only Two Attention Heads

This paper presents research on mechanistic interpretability of transformers, specifically training small attention-only models from scratch

arxiv.org·4d ago

Research Proves Transformer Language Models Are Injective and Invertible

This research paper challenges the conventional view that transformer language models are non-injective due to non-linear components. The au

arxiv.org·8mo ago

New Method Enables Constant-Cost Self-Attention Computation for Transformers

Researchers present a novel mathematical approach to compute self-attention in Transformer AI models with constant cost per token, rather th

arxiv.org·5mo ago

NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling

The article introduces NSA (Natively trainable Sparse Attention), a novel sparse attention mechanism designed to improve efficiency in long-

arxiv.org·11mo ago

JavelinGuard: Low-Cost Transformer Architectures for LLM Security

arxiv.org·1y ago

Tauformer: A Topological Transformer Architecture Using Laplacian-Derived Scalar Attention

The article discusses Tauformer, a novel topological transformer architecture that replaces traditional dot-product attention with a Laplaci

tuned.org.uk·5mo ago

Comments

No comments yet. Be the first.