Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty
By
stansApprentice
A weekday bagel. Dependable, satisfying, no fuss.
Summary
This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment. The study trains linear probes across layers and token positions on 60 models using mathematical and coding subsets of Easy2HardBench. Findings show human-labeled difficulty is strongly linearly decodable and scales with model size, while LLM-derived difficulty is weaker and scales poorly. Steering models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training, human-difficulty probes strengthen and correlate positively with test accuracy, while LLM-difficulty probes degrade and correlate negatively, suggesting human annotations provide stable difficulty signals that reinforcement learning amplifies.
Key quotes
· 5 pulledLarge language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones.
We find that human-labeled difficulty is strongly linearly decodable (AMC: $ρ≈0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly.
Steering along the difficulty direction reveals that pushing models toward 'easier' representations reduces hallucination and improves accuracy.
During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance.
These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve.
You might also wanna read

Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Study finds LLMs persist in treating false claims as true despite explicit warnings
A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont
arstechnica.com·1d agoDecompR: A Method for Reducing Weighting Noise in Multi-Stakeholder LLM Alignment
This paper addresses the challenge of aligning large language models (LLMs) with multiple stakeholders who have conflicting preferences. It
