BPE Tokenization Creates Exploitable Safety Gaps in LLM Alignment, Study Finds

[Submitted on 1 May 2026]

1h ago· 2 min readenInsight

technology science cybersecurity ai safety

Summary

This research paper identifies a structural vulnerability in LLM safety alignment caused by BPE tokenization. When safety-critical words are fragmented into sub-word pieces (character-level perturbations), the safety alignment fails because alignment datasets contain no intentionally fragmented inputs. The authors tested this across five model families (Qwen, Gemma, Llama, Mistral), achieving 80-100% refusal flip rates on HarmBench prompts, with 48% producing genuinely harmful outputs. They localized the disrupted signal to the last ~30% of layers, found zero fragmented prompts in 30,000 alignment examples, and tested defenses including DPO (which failed to achieve stable ASR closure) and SFT (which closed ASR but caused global collapse by raising refusal on benign prompts). The paper introduces Conv-Benign as a diagnostic tool to distinguish selective repair from global collapse.

Source

bskyBPE Tokenization Creates Exploitable Safety Gaps in LLM Alignment, Study Findsarxiv.org

Key quotes

· 5 pulled

Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable.

An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs.

An alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall ≥ 99% at attack-relevant intensities).

SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well.

To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic.

Snippet from the RSS feed

Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public a

You might also wanna read

Open-Source LLM Safety Vulnerabilities: How Chat Template Formatting Gates Alignment in Models Like Gemma and Qwen

This article reveals a critical vulnerability in open-source large language models (LLMs) where safety alignment can be bypassed by simply o

teendifferent.substack.com·5mo ago

Security Risks of Malicious Backdoors in Large Language Models

The article explores the security risks associated with Large Language Models (LLMs), particularly the potential for embedding malicious bac

pub.aimind.so·10mo ago

Study Reveals Emergent Misalignment in Language Models Due to Narrow Finetuning

The article discusses the emergent misalignment observed in language models (LLMs) when fine-tuned to output insecure code without user disc

arxiv.org·11mo ago

Formal Framework for LLM-Verifier Systems: Convergence Theorem and 4/δ Latency Bound

This research paper presents a formal framework for integrating Large Language Models with Formal Verification tools, addressing reliability

arxiv.org·6mo ago

Research on LLM Output Drift in Financial Workflows: Quantifying Consistency Across Model Sizes

This research paper examines the critical issue of output drift in Large Language Models (LLMs) deployed for financial workflows. The study

arxiv.org·7mo ago

Comprehensive Survey of Reasoning Failures in Large Language Models

This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

arxiv.org·4mo ago

Comments

No comments yet. Be the first.