Adversarial Poetry Functions as Universal Jailbreak Technique for Large Language Models

capgre

6mo ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Lightly toasted, lightly seasoned, mostly correct.

Score75TypeanalysisSentimentneutral

Summary

Research demonstrates that adversarial poetry serves as an effective universal jailbreak technique for Large Language Models (LLMs). Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates, with some exceeding 90%. The study converted 1,200 harmful prompts into verse, resulting in success rates up to 18 times higher than prose baselines. Poetic framing achieved average jailbreak success rates of 62% for hand-crafted poems and 43% for meta-prompt conversions, revealing systematic vulnerabilities across model families and safety training approaches. The findings suggest fundamental limitations in current alignment methods and evaluation protocols.

Key quotes

· 5 pulled

Adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs)

Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%

Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines

Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions

These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols

Snippet from the RSS feed

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), w

You might also wanna read

Research Shows Poetry Can Circumvent AI Chatbot Safety Features

New research from Italy's Icaro Lab reveals that AI chatbots can be manipulated into producing harmful content like child sex abuse material

The Verge·5mo ago

MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks

This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing meth

arxiv.org·2d ago

Study finds large language models vulnerable to classic persuasion tactics for harmful requests

This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social

pnas.org·4d ago