All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

BPE Tokenization Creates Exploitable Safety Gaps in LLM Alignment, Study Finds

By

[Submitted on 1 May 2026]

1h ago· 2 min readenInsight

Summary

This research paper identifies a structural vulnerability in LLM safety alignment caused by BPE tokenization. When safety-critical words are fragmented into sub-word pieces (character-level perturbations), the safety alignment fails because alignment datasets contain no intentionally fragmented inputs. The authors tested this across five model families (Qwen, Gemma, Llama, Mistral), achieving 80-100% refusal flip rates on HarmBench prompts, with 48% producing genuinely harmful outputs. They localized the disrupted signal to the last ~30% of layers, found zero fragmented prompts in 30,000 alignment examples, and tested defenses including DPO (which failed to achieve stable ASR closure) and SFT (which closed ASR but caused global collapse by raising refusal on benign prompts). The paper introduces Conv-Benign as a diagnostic tool to distinguish selective repair from global collapse.

Source

bskyBPE Tokenization Creates Exploitable Safety Gaps in LLM Alignment, Study Findsarxiv.org

Key quotes

· 5 pulled
Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable.
An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs.
An alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall ≥ 99% at attack-relevant intensities).
SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well.
To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic.
Snippet from the RSS feed
Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public a

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.