BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models
By
[Submitted on 29 May 2026]
Toasted just enough. A reliable bake, gently seasoned.
Summary
This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical reasoning using synthetic billiards environments. The benchmark uses a procedural engine to generate randomized billiard scenarios with friction and elastic collisions, testing three abilities: predicting ball-to-ball collisions, reasoning about wall bounces, and estimating final ball positions. Evaluations of MLLMs from GPT, Claude, Gemini, and Qwen families reveal that performance degrades with longer simulation times and more complex scene geometry. A key finding is "stasis bias"—a consistent failure mode where models predict no interaction when the correct physical outcome is harder to infer. The research highlights current limitations in visual dynamics understanding and argues for better physical inductive biases in multimodal architectures.
Key quotes
· 5 pulledCurrent multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness.
We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments.
Performance drops as simulation time increases and scene geometry grows more complex.
We also observe a consistent failure mode we call 'stasis bias': when the correct physical outcome is harder to infer, models tend to predict no interaction.
These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.
You might also wanna read
SnapBench: A Spatial Reasoning Benchmark for LLMs Inspired by Pokémon Snap
SnapBench is a spatial reasoning benchmark for large language models (LLMs) inspired by the 1999 game Pokémon Snap. The system uses a vision
LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models
The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive
SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks
SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11
Exploring Human-Like Reasoning Through Model Synthesis Architecture
The article explores how people synthesize probabilistic models to handle novel situations by combining distributed and symbolic representat
AI Model Benchmark: The Evolution from Zero-Shot to Agentic Approaches for Creative Tasks
The article discusses Simon Willison's informal benchmark test for AI models: generating an SVG image of a pelican riding a bicycle. This se
robert-glaser.de·6mo agoDatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval
