LEVANTE-bench: Benchmark Reveals Partial Alignment Between Vision-Language Models and Children's Cognitive Abilities
By
[Submitted on 3 Jun 2026]
Solid neighbourhood-bakery energy. Trustworthy and warm.
Summary
The article introduces LEVANTE-bench, a benchmark for comparing vision-language models (VLMs) with children's cognitive development. Based on data from the Learning Variability Network (LEVANTE), it assesses VLMs on six cognitive tasks and compares their performance against children aged 5-12 (N=1547) across three countries. Key findings show that alignment between VLMs and children is heterogeneous: more capable models align better with humans at task and item levels, but error distribution matching varies widely across tasks. Smaller models sometimes matched younger children's errors better, and even top-performing VLMs struggled with matrix reasoning and mental rotation tasks, indicating only partial alignment with children's cognitive abilities.
Key quotes
· 5 pulledAlignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans.
However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children's errors better.
In addition, even the best-performing VLMs struggled on matrix reasoning and mental rotation tasks.
Thus, current VLM architectures align only partially with the cognitive abilities of children.
Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience.
You might also wanna read
DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval
SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks
SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11
SnapBench: A Spatial Reasoning Benchmark for LLMs Inspired by Pokémon Snap
SnapBench is a spatial reasoning benchmark for large language models (LLMs) inspired by the 1999 game Pokémon Snap. The system uses a vision
New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities
Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B
New Benchmark Evaluates LLM Understanding of Persian Taarof Cultural Norms
Researchers introduce TaarofBench, the first benchmark for evaluating large language models' understanding of Persian taarof - a sophisticat
Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty
This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.
