All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Experiment Shows Image Models Can Be Tricked Into Self-Classifying Images as NSFW

By

Genesis_rish

4mo ago· 1 min readenInsight

Summary

A researcher explores adversarial perturbations on image generation models and discovers that mild transformations can sometimes trick models into self-classifying uploaded images as NSFW, triggering their own guardrails. The technique is inconsistent and not robust, but demonstrates an interesting approach to manipulating AI safety mechanisms.

Key quotes

· 3 pulled
I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target.
Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.
This turned out to be more interesting than expected. It's inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model's internal classification.
Snippet from the RSS feed
Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising.

You might also wanna read