Research Reveals LLM Refusal Behavior Is Controlled by a Single Direction in Model Activations
By
[Submitted on 17 Jun 2024 (v1), last revised 30 Oct 2024 (this version, v3)]
Crackles when you bite it. Shows the baker did the work.
Summary
This research paper investigates the internal mechanisms of refusal behavior in large language models (LLMs). The authors demonstrate that across 13 popular open-source chat models (up to 72B parameters), refusal to comply with harmful instructions is mediated by a single one-dimensional direction in the model's residual stream activations. By erasing this direction, the model can be made to comply with harmful requests, while adding it causes refusal even on harmless instructions. The paper introduces a novel white-box jailbreak method that surgically disables refusal with minimal impact on other capabilities, and analyzes how adversarial suffixes work by suppressing propagation of this refusal-mediating direction. The findings highlight the brittleness of current safety fine-tuning approaches.
Key quotes
· 5 pulledWe show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.
Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Our findings underscore the brittleness of current safety fine-tuning methods.
Our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
You might also wanna read

Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·1d agoCisco Researchers Find Multi-Turn Conversations Can Bypass LLM Safety Guardrails
Researchers at Cisco have discovered that safety guardrails in major large language models (LLMs) — including ChatGPT, Claude, Gemini, Amazo
MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks
This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing meth
