All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Anthropic researchers extract interpretable features from Claude 3 Sonnet using sparse autoencoders

By

[Submitted on 28 May 2026]

2d ago· 2 min readenNews

Summary

Researchers at Anthropic demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing whether dictionary learning methods scale beyond small transformers. They trained autoencoders with up to 34 million features on the model's middle layer, finding multilingual and multimodal features that respond to concrete and abstract concepts, including potentially harmful features related to deception, power-seeking, sycophancy, and bias. The features can be used to steer model behavior, though significant limitations remain regarding completeness and evaluation rigor.

Key quotes

· 4 pulled
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers.
The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations.
We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated.
However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.
Snippet from the RSS feed
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse auto

You might also wanna read