All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Steerling-8B: Direct Concept Control in Language Models Through Internal Representation Editing

By

luulinh90s

3mo ago· 5 min readenInsight

Summary

Steerling-8B is a language model architecture that enables direct editing of internal representations to control concepts at inference time. Unlike traditional prompting, it allows injection and suppression of learned concepts without changing the input prompt. The model supports compositional control in multi-turn dialogues, enabling fine-grained manipulation of concepts like toxicity suppression while preserving fluency. This approach provides reliable, interpretable control over language model generation by directly manipulating human-interpretable concepts during inference.

Key quotes

· 5 pulled
Steerling-8B's architecture natively supports injecting and suppressing any concept the model has learned, directly at inference time.
In multi-turn dialog settings, steering one concept at a time is insufficient. You need compositional control, not just on a neutral prompt, but on a conversation that is already shaped by prior context.
Consider a content moderation that must suppress toxicity yet preserve fluency.
We demonstrate reliable, fine-grained control over language model generation by directly injecting, suppressing, and composing human-interpretable concepts at inference time.
What if you could directly edit the internal representations of a model towards any concept you care about, without changing the prompt?
Snippet from the RSS feed
We demonstrate reliable, fine-grained control over language model generation by directly injecting, suppressing, and composing human-interpretable concepts at inference time.

You might also wanna read