Research Reveals LLM Refusal Behavior Is Controlled by a Single Direction in Model Activations

[Submitted on 17 Jun 2024 (v1), last revised 30 Oct 2024 (this version, v3)]

1mo ago· 2 min readenInsight

70/100

Toasty

Bagelometer↗

Crackles when you bite it. Shows the baker did the work.

Score70TypeanalysisSentimentneutral

Summary

This research paper investigates the internal mechanisms of refusal behavior in large language models (LLMs). The authors demonstrate that across 13 popular open-source chat models (up to 72B parameters), refusal to comply with harmful instructions is mediated by a single one-dimensional direction in the model's residual stream activations. By erasing this direction, the model can be made to comply with harmful requests, while adding it causes refusal even on harmless instructions. The paper introduces a novel white-box jailbreak method that surgically disables refusal with minimal impact on other capabilities, and analyzes how adversarial suffixes work by suppressing propagation of this refusal-mediating direction. The findings highlight the brittleness of current safety fine-tuning approaches.

Key quotes

· 5 pulled

We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.

Erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.

Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.

Our findings underscore the brittleness of current safety fine-tuning methods.

Our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Snippet from the RSS feed

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms r

You might also wanna read

Study finds large language models vulnerable to classic persuasion tactics for harmful requests

This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social

pnas.org·5d ago

Study finds LLMs persist in treating false claims as true despite explicit warnings

A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont

arstechnica.com·1d ago

Cisco Researchers Find Multi-Turn Conversations Can Bypass LLM Safety Guardrails

Researchers at Cisco have discovered that safety guardrails in major large language models (LLMs) — including ChatGPT, Claude, Gemini, Amazo

infosecurity-magazine.com·4d ago

MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks

This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing meth

arxiv.org·2d ago