MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks
By
[Submitted on 28 May 2026]
Crackles when you bite it. Shows the baker did the work.
Summary
This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing methods that rely on heuristic search or unstructured strategy pools, MemoAttack systematically organizes attack experience through three key components: (1) Skill-Structured Memory Modeling that abstracts attack experience into reusable units pairing skills with templates, evidence, and lifecycle states; (2) Lifecycle-Driven Memory Evolution that manages memory through probation, promotion, retirement, and elimination; and (3) Explore-Exploit Balanced Memory Selection using contextual Thompson Sampling. Experiments on AdvBench show MemoAttack achieves a 98.00% average attack success rate, outperforming the strongest baseline by 16.67 percentage points while reducing request count by 45.9%.
Key quotes
· 4 pulledMemoAttack achieves an average attack success rate of 98.00%, outperforming the strongest baseline by 16.67 percentage points, while reducing request count by 45.9%.
Existing black-box jailbreak methods either depend on sample-wise heuristic search or leverage attack experience through accumulating strategy pools or method libraries, lacking a systematic organization and management of attack experience.
MemoAttack comprises three key designs: (1) Skill-Structured Memory Modeling, (2) Lifecycle-Driven Memory Evolution, and (3) Explore-Exploit Balanced Memory Selection.
MemoAttack continuously improves as memory accumulates over more samples.
You might also wanna read
AI Researcher Discovers Echo Chamber Attack Bypassing LLM Guardrails
An AI Researcher at Neural Trust has discovered a novel jailbreak technique called the Echo Chamber Attack that bypasses the safety mechanis
δ-mem: A Compact Online Memory Mechanism for Efficient Long-Context LLM Processing
The article presents δ-mem, a lightweight memory mechanism for large language models that augments frozen full-attention backbones with a co
Adversarial Poetry Functions as Universal Jailbreak Technique for Large Language Models
Research demonstrates that adversarial poetry serves as an effective universal jailbreak technique for Large Language Models (LLMs). Across
Alignment Whack-a-Mole: Code Repository for Research on LLM Copyrighted Book Memorization via Finetuning
This repository provides the official code for the paper "Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books
Research Reveals LLM Refusal Behavior Is Controlled by a Single Direction in Model Activations
This research paper investigates the internal mechanisms of refusal behavior in large language models (LLMs). The authors demonstrate that a
LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models
The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive
