Research on Hallucination-Associated Neurons in Large Language Models: Identification, Impact, and Origins
By
bilsbie
A weekday bagel. Dependable, satisfying, no fuss.
Summary
This research paper investigates hallucination-associated neurons (H-Neurons) in large language models, examining their identification, behavioral impact, and origins. The study finds that a remarkably sparse subset of neurons (less than 0.1% of total neurons) can reliably predict hallucination occurrences and are causally linked to over-compliance behaviors. These neurons originate in pre-trained base models and remain predictive for hallucination detection, bridging macroscopic behavioral patterns with microscopic neural mechanisms.
Key quotes
· 5 pulledLarge language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability.
We demonstrate that a remarkably sparse subset of neurons (less than 0.1% of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios.
Controlled interventions reveal that these neurons are causally linked to over-compliance behaviors.
We trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training.
Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
You might also wanna read
Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards
This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses
Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models
This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attentio
Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations
This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from ex
Research Reveals LLMs Contain Built-In Persona Subnetworks Without External Training
This research paper reveals that large language models (LLMs) already contain specialized persona subnetworks within their parameter space,
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty
This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.
