All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

BilliardPhys-Bench: New Benchmark Reveals Physical Reasoning Gaps in Multimodal AI Models

By

[Submitted on 29 May 2026]

1d ago· 2 min readenInsight

Summary

This paper introduces BilliardPhys-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on intuitive physical reasoning using synthetic billiards environments. The benchmark uses a procedural engine to generate randomized billiard scenarios with friction and elastic collisions, testing three abilities: predicting ball-to-ball collisions, reasoning about wall bounces, and estimating final ball positions. Evaluations of MLLMs from GPT, Claude, Gemini, and Qwen families reveal that performance degrades with longer simulation times and more complex scene geometry. A key finding is "stasis bias"—a consistent failure mode where models predict no interaction when the correct physical outcome is harder to infer. The research highlights current limitations in visual dynamics understanding and argues for better physical inductive biases in multimodal architectures.

Key quotes

· 5 pulled
Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness.
We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments.
Performance drops as simulation time increases and scene geometry grows more complex.
We also observe a consistent failure mode we call 'stasis bias': when the correct physical outcome is harder to infer, models tend to predict no interaction.
These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.
Snippet from the RSS feed
Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a b

You might also wanna read