BPE Tokenization Creates Exploitable Safety Gaps in LLM Alignment, Study Finds
By
[Submitted on 1 May 2026]
Summary
This research paper identifies a structural vulnerability in LLM safety alignment caused by BPE tokenization. When safety-critical words are fragmented into sub-word pieces (character-level perturbations), the safety alignment fails because alignment datasets contain no intentionally fragmented inputs. The authors tested this across five model families (Qwen, Gemma, Llama, Mistral), achieving 80-100% refusal flip rates on HarmBench prompts, with 48% producing genuinely harmful outputs. They localized the disrupted signal to the last ~30% of layers, found zero fragmented prompts in 30,000 alignment examples, and tested defenses including DPO (which failed to achieve stable ASR closure) and SFT (which closed ASR but caused global collapse by raising refusal on benign prompts). The paper introduces Conv-Benign as a diagnostic tool to distinguish selective repair from global collapse.
Source
Key quotes
· 5 pulledCharacter-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable.
An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs.
An alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall ≥ 99% at attack-relevant intensities).
SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well.
To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic.
You might also wanna read
Open-Source LLM Safety Vulnerabilities: How Chat Template Formatting Gates Alignment in Models Like Gemma and Qwen
This article reveals a critical vulnerability in open-source large language models (LLMs) where safety alignment can be bypassed by simply o

Security Risks of Malicious Backdoors in Large Language Models
The article explores the security risks associated with Large Language Models (LLMs), particularly the potential for embedding malicious bac
pub.aimind.so·10mo agoStudy Reveals Emergent Misalignment in Language Models Due to Narrow Finetuning
The article discusses the emergent misalignment observed in language models (LLMs) when fine-tuned to output insecure code without user disc
Formal Framework for LLM-Verifier Systems: Convergence Theorem and 4/δ Latency Bound
This research paper presents a formal framework for integrating Large Language Models with Formal Verification tools, addressing reliability
Research on LLM Output Drift in Financial Workflows: Quantifying Consistency Across Model Sizes
This research paper examines the critical issue of output drift in Large Language Models (LLMs) deployed for financial workflows. The study
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

Comments
Sign in to join the conversation.
No comments yet. Be the first.