Study Finds AI Discourse in Pretraining Data Creates Self-Fulfilling (Mis)alignment in LLMs
By
anigbrowl
Crispy enough to crunch, soft enough to enjoy. A good bake.
Summary
This research paper presents the first controlled study of how pretraining corpora containing discourse about AI systems causally influences downstream alignment in LLMs. By pretraining 6.9B-parameter models with varying amounts of (mis)alignment discourse, the authors found that upsampling synthetic training documents about AI misalignment leads to increased misaligned behavior, while upsampling documents about aligned behavior reduces misalignment scores from 45% to 9%. The effects persist through post-training, establishing "alignment pretraining" as a complement to post-training alignment techniques.
Key quotes
· 5 pulledWe find that discussion of AI contributes to misalignment.
Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour.
Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%.
Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training.
We recommend practitioners consider pretraining for alignment alongside capabilities.
You might also wanna read
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·21h agoDecompR: A Method for Reducing Weighting Noise in Multi-Stakeholder LLM Alignment
This paper addresses the challenge of aligning large language models (LLMs) with multiple stakeholders who have conflicting preferences. It

Satirical Website Parodies AI Alignment Research Industry
A satirical website called CAAAC (Committee for the Alignment of AI Alignment Committees) parodies the growing field of AI alignment researc
