All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Research Reveals AI Models Show 'Flinch' Effect in Word Probability Allocation

By

llmmadness

1mo ago· 20 min readenInsight

Summary

The article presents research on how AI language models exhibit subtle behavioral differences even when they appear 'uncensored.' Researchers discovered that safety-filtered pretrained models show a 'flinch' effect - they allocate significantly less probability to charged or sensitive words compared to open-data pretrains, even when not outright refusing to generate them. The study measured this gap across seven different pretrained models from five labs, revealing that safety training creates subtle behavioral changes that persist even in supposedly uncensored models. The research originated from a failed attempt to fine-tune a model for political simulation trading, leading to the discovery of these nuanced differences in model behavior.

Key quotes

· 4 pulled
A safety-filtered pretrain can duck a charged word without refusing. It puts a fraction of the probability an open-data pretrain puts there.
We call that gap the flinch, and we measured it across seven pretrains from five labs.
We started with a Polymarket project: train a Karoline Leavitt LoRA on an uncensored model, simulate future briefings, trade the word markets, profit. We couldn't get it to work.
No amount of fine-tuning let the model actually...
Snippet from the RSS feed
A safety-filtered pretrain can duck a charged word without refusing. It puts a fraction of the probability an open-data pretrain puts there. We call that...

You might also wanna read