BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models

[Submitted on 29 May 2026]

1d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Toasted just enough. A reliable bake, gently seasoned.

Score75TypeanalysisSentimentneutral

Summary

This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical reasoning using synthetic billiards environments. The benchmark uses a procedural engine to generate randomized billiard scenarios with friction and elastic collisions, testing three abilities: predicting ball-to-ball collisions, reasoning about wall bounces, and estimating final ball positions. Evaluations of MLLMs from GPT, Claude, Gemini, and Qwen families reveal that performance degrades with longer simulation times and more complex scene geometry. A key finding is "stasis bias"—a consistent failure mode where models predict no interaction when the correct physical outcome is harder to infer. The research highlights current limitations in visual dynamics understanding and argues for better physical inductive biases in multimodal architectures.

Key quotes

· 5 pulled

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness.

We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments.

Performance drops as simulation time increases and scene geometry grows more complex.

We also observe a consistent failure mode we call 'stasis bias': when the correct physical outcome is harder to infer, models tend to predict no interaction.

These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

Snippet from the RSS feed

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a b

You might also wanna read

SnapBench: A Spatial Reasoning Benchmark for LLMs Inspired by Pokémon Snap

SnapBench is a spatial reasoning benchmark for large language models (LLMs) inspired by the 1999 game Pokémon Snap. The system uses a vision

github.com·4mo ago

LLM Skirmish: An Adversarial In-Context Learning Benchmark for Evaluating Large Language Models

The article discusses LLM Skirmish, an adversarial in-context learning benchmark designed to test large language models through competitive

llmskirmish.com·3mo ago

SkillsBench: A Benchmark for Evaluating AI Agent Skills Across Diverse Tasks

SkillsBench is a new benchmark for evaluating how well AI agent skills work across diverse tasks. The benchmark includes 86 tasks across 11

arxiv.org·3mo ago

Exploring Human-Like Reasoning Through Model Synthesis Architecture

The article explores how people synthesize probabilistic models to handle novel situations by combining distributed and symbolic representat

arxiv.org·10mo ago

AI Model Benchmark: The Evolution from Zero-Shot to Agentic Approaches for Creative Tasks

The article discusses Simon Willison's informal benchmark test for AI models: generating an SVG image of a pelican riding a bicycle. This se

robert-glaser.de·6mo ago

DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation

The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

arxiv.org·4mo ago