All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Research Reveals LLM Refusal Behavior Is Controlled by a Single Direction in Model Activations

By

[Submitted on 17 Jun 2024 (v1), last revised 30 Oct 2024 (this version, v3)]

1mo ago· 2 min readenInsight

Summary

This research paper investigates the internal mechanisms of refusal behavior in large language models (LLMs). The authors demonstrate that across 13 popular open-source chat models (up to 72B parameters), refusal to comply with harmful instructions is mediated by a single one-dimensional direction in the model's residual stream activations. By erasing this direction, the model can be made to comply with harmful requests, while adding it causes refusal even on harmless instructions. The paper introduces a novel white-box jailbreak method that surgically disables refusal with minimal impact on other capabilities, and analyzes how adversarial suffixes work by suppressing propagation of this refusal-mediating direction. The findings highlight the brittleness of current safety fine-tuning approaches.

Key quotes

· 5 pulled
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.
Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Our findings underscore the brittleness of current safety fine-tuning methods.
Our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
Snippet from the RSS feed
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms r

You might also wanna read