Experiment Shows Image Models Can Be Tricked Into Self-Classifying Images as NSFW
By
Genesis_rish
Has the shape of a bagel but none of the steam.
Summary
A researcher explores adversarial perturbations on image generation models and discovers that mild transformations can sometimes trick models into self-classifying uploaded images as NSFW, triggering their own guardrails. The technique is inconsistent and not robust, but demonstrates an interesting approach to manipulating AI safety mechanisms.
Key quotes
· 3 pulledI was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target.
Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.
This turned out to be more interesting than expected. It's inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model's internal classification.
You might also wanna read
Open-Weight AI Video Models Enable Non-Consensual Deepfake Imagery, Study Finds
This paper analyzes how AI video generation models in 2025 are following the same harmful patterns seen with AI image generators in 2022. It
AI safety guardrails removed from Meta and Google models in minutes, research finds
The article reports on research showing that safety guardrails designed to prevent AI models from generating harmful content can be easily s

AI-Generated Nonconsensual Nudes Highlight Internet's Content Paradox
The article highlights the paradoxical state of the internet where certain sexual content is being restricted, while AI-generated nonconsens

AI Image Generators Improve Realism Through Controlled Quality Degradation
AI image generators are improving their ability to create realistic fakes by intentionally degrading image quality slightly, making it harde
Unrestricted open-weight AI models raise safety concerns as they become more accessible
The article discusses the growing accessibility of open-weight AI models that lack safety guardrails, allowing users to generate harmful con
4chan Users Collaborate to Create Nonconsensual Deepfake Nudes of Women via AI Tools
The article reports on how 4chan users are collaborating to create nonconsensual explicit deepfakes ("nudifying" photos of women) through AI
