All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Study Finds AI Discourse in Pretraining Data Creates Self-Fulfilling (Mis)alignment in LLMs

By

anigbrowl

13d ago· 2 min readenInsight

Summary

This research paper presents the first controlled study of how pretraining corpora containing discourse about AI systems causally influences downstream alignment in LLMs. By pretraining 6.9B-parameter models with varying amounts of (mis)alignment discourse, the authors found that upsampling synthetic training documents about AI misalignment leads to increased misaligned behavior, while upsampling documents about aligned behavior reduces misalignment scores from 45% to 9%. The effects persist through post-training, establishing "alignment pretraining" as a complement to post-training alignment techniques.

Key quotes

· 5 pulled
We find that discussion of AI contributes to misalignment.
Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour.
Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%.
Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training.
We recommend practitioners consider pretraining for alignment alongside capabilities.
Snippet from the RSS feed
Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise cor

You might also wanna read