All Topics

Technology

Art

Experiment Shows Image Models Can Be Tricked Into Self-Classifying Images as NSFW

Genesis_rish

4mo ago· 1 min readenInsight

38/100

Stale

Bagelometer↗

Has the shape of a bagel but none of the steam.

Score38TypeanalysisSentimentneutral

Summary

A researcher explores adversarial perturbations on image generation models and discovers that mild transformations can sometimes trick models into self-classifying uploaded images as NSFW, triggering their own guardrails. The technique is inconsistent and not robust, but demonstrates an interesting approach to manipulating AI safety mechanisms.

Key quotes

· 3 pulled

I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target.

Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.

This turned out to be more interesting than expected. It's inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model's internal classification.

Snippet from the RSS feed

Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising.

You might also wanna read

Open-Weight AI Video Models Enable Non-Consensual Deepfake Imagery, Study Finds

This paper analyzes how AI video generation models in 2025 are following the same harmful patterns seen with AI image generators in 2022. It

arxiv.org·5d ago

AI safety guardrails removed from Meta and Google models in minutes, research finds

The article reports on research showing that safety guardrails designed to prevent AI models from generating harmful content can be easily s

ft.com·5d ago

AI-Generated Nonconsensual Nudes Highlight Internet's Content Paradox

The article highlights the paradoxical state of the internet where certain sexual content is being restricted, while AI-generated nonconsens

The Verge·9mo ago

AI Image Generators Improve Realism Through Controlled Quality Degradation

AI image generators are improving their ability to create realistic fakes by intentionally degrading image quality slightly, making it harde

The Verge·5mo ago

Unrestricted open-weight AI models raise safety concerns as they become more accessible

The article discusses the growing accessibility of open-weight AI models that lack safety guardrails, allowing users to generate harmful con

npr.org·1h ago

4chan Users Collaborate to Create Nonconsensual Deepfake Nudes of Women via AI Tools

The article reports on how 4chan users are collaborating to create nonconsensual explicit deepfakes ("nudifying" photos of women) through AI

Wired·11d ago