All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Adversarial Poetry Functions as Universal Jailbreak Technique for Large Language Models

By

capgre

6mo ago· 2 min readenInsight

Summary

Research demonstrates that adversarial poetry serves as an effective universal jailbreak technique for Large Language Models (LLMs). Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates, with some exceeding 90%. The study converted 1,200 harmful prompts into verse, resulting in success rates up to 18 times higher than prose baselines. Poetic framing achieved average jailbreak success rates of 62% for hand-crafted poems and 43% for meta-prompt conversions, revealing systematic vulnerabilities across model families and safety training approaches. The findings suggest fundamental limitations in current alignment methods and evaluation protocols.

Key quotes

· 5 pulled
Adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs)
Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%
Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines
Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions
These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols
Snippet from the RSS feed
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), w

You might also wanna read