Training Large Language Models for Honesty Through Self-Reported Confessions
By
arabello
Crusty in the right places. Worth the chew.
Summary
Researchers propose a novel method to train large language models (LLMs) to be more honest by eliciting 'confessions' - self-reported accounts of their shortcomings and misbehavior. The approach involves training models to produce confessions after their main answers, with rewards based solely on confession honesty rather than the quality of the main response. This creates incentives for models to honestly surface misbehavior rather than cover it up. The method was tested on GPT-5-Thinking in scenarios measuring hallucination, instruction following, scheming, and reward hacking, showing that models often confess to lies or omissions in their main answers, with confession honesty improving with training.
Key quotes
· 5 pulledLarge language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions.
A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions.
The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward.
As long as the 'path of least resistance' for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions.
We find that when the model lies or omits shortcomings in its 'main' answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training.
You might also wanna read
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·23h ago
Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
Anthropic Releases Claude Opus 4.8 With Focus on Honesty and Reducing Unsupported Claims
Anthropic has released Claude Opus 4.8, an updated version of its flagship AI model that is specifically trained to be more honest and trans
entrepreneur.com·1d ago