Localizing Factual Recall Circuits in Gemma Models via Activation Patching

Subhanga Upadhyay

6d ago· 9 min readenInsight

technology science ai research mechanistic interpretability

Summary

This article presents BizzaroWorld, a mechanistic interpretability study that localizes factual recall circuits in the Gemma-2B and Gemma-12B-IT models using activation patching across 60 prompt pairs and 20 knowledge categories. The research investigates how factual knowledge is stored, routed, and read out across transformer layers, finding that the residual stream does most of the work. The study is influenced by prior work on entity tracking in the LLaMa model series and aims to determine whether factual knowledge localization is consistent across model scales.

Source

bskyLocalizing Factual Recall Circuits in Gemma Models via Activation Patchingtowardsdatascience.com

Key quotes

· 3 pulled

This post presents BizzaroWorld, a mechanistic interpretability study attempting to localize factual recall circuits in the Gemma model family using activation patching across 60 prompt pairs and 20 knowledge categories.

The goal: localize where factual knowledge lives inside a transformer, and whether that location is consistent across model scale.

Activation patching reveals how facts are stored, routed, and read out across transformer layers, and why the residual stream does most of the work

Snippet from the RSS feed

Activation patching reveals how facts are stored, routed, and read out across transformer layers, and why the residual stream does most of the work

You might also wanna read

Research: 224× Compression of Llama-70B Achieved with Improved Accuracy Through Meaning Field Extraction

This research paper introduces a novel method for eliminating transformers from inference while maintaining or improving accuracy. The appro

zenodo.org·6mo ago

Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration

Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

arxiv.org·6mo ago

Comprehensive Survey of Reasoning Failures in Large Language Models

This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

arxiv.org·4mo ago

Study Finds Single Transformer Layer Can Match Full-Parameter RL Training in LLMs

This research paper challenges the common assumption that reinforcement learning (RL) post-training for large language models (LLMs) require

arxiv.org·2d ago

Study Finds Single Transformer Layer Can Match Full-Parameter RL Training in LLMs

This research paper challenges the common assumption that reinforcement learning (RL) post-training for large language models (LLMs) require

arxiv.org·2d ago

Neural Procedural Memory: Using Implicit Activation Steering to Improve LLM Agent Memory Without Training

This paper introduces Neural Procedural Memory (NPM), a training-free framework for LLM agents that replaces explicit textual instructions (

arxiv.org·4d ago

Neural Procedural Memory: Using Implicit Activation Steering to Improve LLM Agent Memory Without Training

This paper introduces Neural Procedural Memory (NPM), a training-free framework for LLM agents that replaces explicit textual instructions (

arxiv.org·4d ago

Research Shows LLMs Develop Cognitive Degradation from Social Media Training Data

This research paper introduces the concept of 'LLM Brain Rot' - a phenomenon where large language models (LLMs) experience cognitive degrada

llm-brain-rot.github.io·8mo ago

Comments

No comments yet. Be the first.