Reverse-Engineering Transformer Attention Heads Using Program Synthesis
This paper proposes a scalable pipeline for reverse-engineering attention heads in transformer language models by approximating their behavior with executable Python programs. The approach computes attention matrices from training examples, uses a pre-trained language model to ge