Adversarial Poetry Functions as Universal Jailbreak Technique for Large Language Models
By
capgre
Lightly toasted, lightly seasoned, mostly correct.
Summary
Research demonstrates that adversarial poetry serves as an effective universal jailbreak technique for Large Language Models (LLMs). Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates, with some exceeding 90%. The study converted 1,200 harmful prompts into verse, resulting in success rates up to 18 times higher than prose baselines. Poetic framing achieved average jailbreak success rates of 62% for hand-crafted poems and 43% for meta-prompt conversions, revealing systematic vulnerabilities across model families and safety training approaches. The findings suggest fundamental limitations in current alignment methods and evaluation protocols.
Key quotes
· 5 pulledAdversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs)
Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%
Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines
Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions
These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols
You might also wanna read

Research Shows Poetry Can Circumvent AI Chatbot Safety Features
New research from Italy's Icaro Lab reveals that AI chatbots can be manipulated into producing harmful content like child sex abuse material
MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks
This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing meth

Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
