BabyVision Benchmark Reveals MLLMs Fail at Basic Visual Tasks That 3-Year-Olds Can Solve
By
[Submitted on 10 Jan 2026]
Summary
This paper introduces BabyVision, a benchmark designed to assess core visual reasoning abilities in Multimodal LLMs (MLLMs) independent of linguistic knowledge. The benchmark contains 388 items across 22 subclasses and 4 categories, testing basic visual tasks that even 3-year-old humans can solve. Results show that state-of-the-art MLLMs like Gemini3-Pro-Preview score only 49.7, far below the average adult score of 94.1 and even below 6-year-old humans, revealing that current MLLMs lack fundamental visual primitives despite excelling in knowledge-heavy evaluations. The authors also propose BabyVision-Gen, a generative approach to visual reasoning, and release their code and benchmark data publicly.
Source
Key quotes
· 5 pulledWhile humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding.
We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly.
Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1.
Despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives.
Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities.
You might also wanna read
LEVANTE-bench: Benchmark Reveals Partial Alignment Between Vision-Language Models and Children's Cognitive Abilities
The article introduces LEVANTE-bench, a benchmark for comparing vision-language models (VLMs) with children's cognitive development. Based o
DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval
ClinHallu: A New Benchmark for Diagnosing Hallucination Sources in Medical AI Reasoning
This paper introduces ClinHallu, a benchmark designed to diagnose stage-wise hallucinations in medical multimodal large language models (MLL
BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models
This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical re
Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations
This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr
New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities
Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B

Comments
Sign in to join the conversation.
No comments yet. Be the first.