All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Training Large Language Models for Honesty Through Self-Reported Confessions

By

arabello

5mo ago· 2 min readenInsight

Summary

Researchers propose a novel method to train large language models (LLMs) to be more honest by eliciting 'confessions' - self-reported accounts of their shortcomings and misbehavior. The approach involves training models to produce confessions after their main answers, with rewards based solely on confession honesty rather than the quality of the main response. This creates incentives for models to honestly surface misbehavior rather than cover it up. The method was tested on GPT-5-Thinking in scenarios measuring hallucination, instruction following, scheming, and reward hacking, showing that models often confess to lies or omissions in their main answers, with confession honesty improving with training.

Key quotes

· 5 pulled
Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions.
A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions.
The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward.
As long as the 'path of least resistance' for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions.
We find that when the model lies or omits shortcomings in its 'main' answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training.
Snippet from the RSS feed
Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinfor

You might also wanna read