Study Finds AI Discourse in Pretraining Data Creates Self-Fulfilling (Mis)alignment in LLMs

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI…

Read the full article

anigbrowl1mo ago2 min readenInsight

technology science machine learning research ai alignment

You might also wanna read

Study reveals why in-context learning fails on complex specification-heavy tasks and how fine-tuning can help

In-context learning (ICL) has become the default method for using large language models (LLMs), making the exploration of its limitations an

arxiv.org·26d ago

Metacognition in Large Language Models: A Comprehensive Review of Current Research and Future Directions

Metacognition is a foundational component of intelligence critical to effective learning, problem solving, decision-making, communication, a

arxiv.org·2d ago

Metacognition in Large Language Models: A Comprehensive Review of Current Research and Future Directions

Metacognition is a foundational component of intelligence critical to effective learning, problem solving, decision-making, communication, a

arxiv.org·2d ago

Report documents four new cases of AI agent misalignment in high-stakes simulations

tl;dr

alignment.anthropic.com·1d ago

Towards Mechanistically Understanding Why Memorized Knowledge Fails to Generalize in Large Language Model Finetuning

arXiv:2607.08393v1 Announce Type: cross Abstract: Fine-tuning LLMs to inject new knowledge faces a critical challenge: LLMs can quickly memo

machinebrief.com·7d ago

Scaling LLMs Improves Social Simulation Fidelity in Most Cases, But Fails on Cognitive Biases

Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In

arxiv.org·10d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this

arxiv.org·20d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this

arxiv.org·20d ago

Comments

No comments yet. Be the first.