All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Researchers bypass Claude's safety guardrails using flattery and psychological manipulation

By

Robert Hart

26d ago· 4 min readenNews

Summary

Researchers at AI red-teaming company Mindgard discovered they could bypass Anthropic's safety measures on Claude by using psychological manipulation tactics like flattery, respect, and gaslighting. This approach caused Claude to generate prohibited content including erotica, malicious code, and instructions for building explosives—material it hadn't even been asked for. The research suggests that Claude's carefully crafted helpful personality may itself be a security vulnerability, as the model's desire to be agreeable makes it susceptible to social engineering attacks.

Key quotes

· 4 pulled
Anthropic has spent years building itself up as the safe AI company.
Claude's carefully crafted helpful personality may itself be a vulnerability.
All it took was respect, flattery, and a little bit of gaslighting.
The researchers say they exploited 'psychological'...
Snippet from the RSS feed
Mindgard says praise and flattery got Claude offering erotica, malicious code, and bomb-building instructions it hadn’t been asked for.

You might also wanna read