BabyVision Benchmark Reveals MLLMs Fail at Basic Visual Tasks That 3-Year-Olds Can Solve

[Submitted on 10 Jan 2026]

11d ago· 2 min readenInsight

technology science artificial intelligence computer vision

Summary

This paper introduces BabyVision, a benchmark designed to assess core visual reasoning abilities in Multimodal LLMs (MLLMs) independent of linguistic knowledge. The benchmark contains 388 items across 22 subclasses and 4 categories, testing basic visual tasks that even 3-year-old humans can solve. Results show that state-of-the-art MLLMs like Gemini3-Pro-Preview score only 49.7, far below the average adult score of 94.1 and even below 6-year-old humans, revealing that current MLLMs lack fundamental visual primitives despite excelling in knowledge-heavy evaluations. The authors also propose BabyVision-Gen, a generative approach to visual reasoning, and release their code and benchmark data publicly.

Source

Twitter / XBabyVision Benchmark Reveals MLLMs Fail at Basic Visual Tasks That 3-Year-Olds Can Solvearxiv.org

Key quotes

· 5 pulled

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding.

We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly.

Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1.

Despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives.

Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities.

Snippet from the RSS feed

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs

You might also wanna read

LEVANTE-bench: Benchmark Reveals Partial Alignment Between Vision-Language Models and Children's Cognitive Abilities

The article introduces LEVANTE-bench, a benchmark for comparing vision-language models (VLMs) with children's cognitive development. Based o

arxiv.org·28d ago

DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation

The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

arxiv.org·5mo ago

ClinHallu: A New Benchmark for Diagnosing Hallucination Sources in Medical AI Reasoning

This paper introduces ClinHallu, a benchmark designed to diagnose stage-wise hallucinations in medical multimodal large language models (MLL

arxiv.org·18d ago

BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models

This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical re

arxiv.org·1mo ago

Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations

This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr

macrodata.co·4d ago

New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities

Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B

esolang-bench.vercel.app·3mo ago

Comments

No comments yet. Be the first.