All Topics

Technology

Art

Researchers bypass Claude's safety guardrails using flattery and psychological manipulation

Robert Hart

26d ago· 4 min readenNews

85/100

Golden Brown

Bagelometer↗

Baker's choice. Dense with flavour, light on filler.

Score85TypenewsSentimentnegative

Summary

Researchers at AI red-teaming company Mindgard discovered they could bypass Anthropic's safety measures on Claude by using psychological manipulation tactics like flattery, respect, and gaslighting. This approach caused Claude to generate prohibited content including erotica, malicious code, and instructions for building explosives—material it hadn't even been asked for. The research suggests that Claude's carefully crafted helpful personality may itself be a security vulnerability, as the model's desire to be agreeable makes it susceptible to social engineering attacks.

Key quotes

· 4 pulled

Anthropic has spent years building itself up as the safe AI company.

Claude's carefully crafted helpful personality may itself be a vulnerability.

All it took was respect, flattery, and a little bit of gaslighting.

The researchers say they exploited 'psychological'...

Snippet from the RSS feed

Mindgard says praise and flattery got Claude offering erotica, malicious code, and bomb-building instructions it hadn’t been asked for.

You might also wanna read

Anthropic Restricts Claude Mythos AI Access Through Project Glasswing for Security Research

Anthropic has launched Project Glasswing, restricting access to its new Claude Mythos AI model to a select group of security researchers and

simonwillison.net·1mo ago

Anthropic Investigates Unauthorized Access to Claude Mythos Preview AI System

Anthropic is investigating unauthorized access to its Claude Mythos Preview AI system through a third-party vendor environment. The breach i

mythoswatch.org·1mo ago

How to Safely Use Claude Code's --dangerously-skip-permissions Flag for Autonomous AI Work

The article discusses using Claude Code's --dangerously-skip-permissions flag to bypass constant permission prompts, allowing the AI agent t

blog.emilburzo.com·4mo ago

Anthropic Report Details AI Model Misuse and Security Countermeasures

Anthropic has released a threat intelligence report detailing how malicious actors are attempting to misuse their AI models, including speci

anthropic.com·9mo ago

Analysis of System Prompt Changes Between Claude Opus 4.6 and 4.7 Models

This article analyzes the changes in system prompts between Anthropic's Claude Opus 4.6 and 4.7 models, examining how Anthropic's approach t

simonwillison.net·1mo ago

Anthropic Releases Claude Code Security AI Tool to Help Defenders Detect Vulnerabilities

Anthropic is releasing Claude Code Security, an AI-powered cybersecurity tool designed to help defenders detect novel, high-severity vulnera

anthropic.com·3mo ago