Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty

stansApprentice

6mo ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

A weekday bagel. Dependable, satisfying, no fuss.

Score75TypeanalysisSentimentneutral

Summary

This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment. The study trains linear probes across layers and token positions on 60 models using mathematical and coding subsets of Easy2HardBench. Findings show human-labeled difficulty is strongly linearly decodable and scales with model size, while LLM-derived difficulty is weaker and scales poorly. Steering models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training, human-difficulty probes strengthen and correlate positively with test accuracy, while LLM-difficulty probes degrade and correlate negatively, suggesting human annotations provide stable difficulty signals that reinforcement learning amplifies.

Key quotes

· 5 pulled

Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones.

We find that human-labeled difficulty is strongly linearly decodable (AMC: $ρ≈0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly.

Steering along the difficulty direction reveals that pushing models toward 'easier' representations reduces hallucination and improves accuracy.

During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance.

These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve.

Snippet from the RSS feed

Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this

You might also wanna read

Study finds large language models vulnerable to classic persuasion tactics for harmful requests

This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social

pnas.org·4d ago

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·3d ago

Study finds LLMs persist in treating false claims as true despite explicit warnings

A study on fine-tuning large language models (LLMs) reveals that even after explicit warnings that certain claims are false, the models cont

arstechnica.com·1d ago

DecompR: A Method for Reducing Weighting Noise in Multi-Stakeholder LLM Alignment

This paper addresses the challenge of aligning large language models (LLMs) with multiple stakeholders who have conflicting preferences. It

arxiv.org·3d ago