MemoAttack: A Memory-Driven Framework for Automated LLM Jailbreak Attacks

[Submitted on 28 May 2026]

1d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Crackles when you bite it. Shows the baker did the work.

Score75TypeanalysisSentimentneutral

Summary

This paper introduces MemoAttack, a novel memory-driven black-box jailbreak framework for large language models (LLMs). Unlike existing methods that rely on heuristic search or unstructured strategy pools, MemoAttack systematically organizes attack experience through three key components: (1) Skill-Structured Memory Modeling that abstracts attack experience into reusable units pairing skills with templates, evidence, and lifecycle states; (2) Lifecycle-Driven Memory Evolution that manages memory through probation, promotion, retirement, and elimination; and (3) Explore-Exploit Balanced Memory Selection using contextual Thompson Sampling. Experiments on AdvBench show MemoAttack achieves a 98.00% average attack success rate, outperforming the strongest baseline by 16.67 percentage points while reducing request count by 45.9%.

Key quotes

· 4 pulled

MemoAttack achieves an average attack success rate of 98.00%, outperforming the strongest baseline by 16.67 percentage points, while reducing request count by 45.9%.

Existing black-box jailbreak methods either depend on sample-wise heuristic search or leverage attack experience through accumulating strategy pools or method libraries, lacking a systematic organization and management of attack experience.

MemoAttack comprises three key designs: (1) Skill-Structured Memory Modeling, (2) Lifecycle-Driven Memory Evolution, and (3) Explore-Exploit Balanced Memory Selection.

MemoAttack continuously improves as memory accumulates over more samples.

Snippet from the RSS feed

Jailbreak attacks on large language models (LLMs) aim to induce LLMs to produce content that they are expected to refuse. Automated black-box jailbreak generation is especially important for safety evaluation, where the attacker observes only model output

You might also wanna read

AI Researcher Discovers Echo Chamber Attack Bypassing LLM Guardrails

An AI Researcher at Neural Trust has discovered a novel jailbreak technique called the Echo Chamber Attack that bypasses the safety mechanis

neuraltrust.ai·11mo ago

δ-mem: A Compact Online Memory Mechanism for Efficient Long-Context LLM Processing

The article presents δ-mem, a lightweight memory mechanism for large language models that augments frozen full-attention backbones with a co

arxiv.org·15d ago

Adversarial Poetry Functions as Universal Jailbreak Technique for Large Language Models

Research demonstrates that adversarial poetry serves as an effective universal jailbreak technique for Large Language Models (LLMs). Across

arxiv.org·6mo ago

Alignment Whack-a-Mole: Code Repository for Research on LLM Copyrighted Book Memorization via Finetuning

This repository provides the official code for the paper "Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books

github.com·1mo ago

Research Reveals LLM Refusal Behavior Is Controlled by a Single Direction in Model Activations

This research paper investigates the internal mechanisms of refusal behavior in large language models (LLMs). The authors demonstrate that a

arxiv.org·29d ago

LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models

The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive

llmskirmish.com·3mo ago