Sensorimotor World Model: Learning Action-Aligned Representations via Inverse Dynamics Regularization
By
[Submitted on 18 Jun 2026]
Summary
This paper introduces a Sensorimotor World Model (SMWM), a latent world model trained end-to-end with inverse dynamics regularization. The approach addresses two key challenges in perception-for-action and JEPA-style world models: preventing representation collapse and inducing action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, the model biases toward controllable environmental degrees of freedom while discarding uncontrollable distractors. The method achieves stable training from offline, reward-free trajectories without complex regularizers, and demonstrates competitive planning performance across 2D and 3D control tasks.
Source
Key quotes
· 5 pulledPerception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions.
We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization.
By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors.
This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers.
Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.
You might also wanna read
DILLO: A Language-Based World Model for Proactive Agent Steering Without Visual Simulation
This paper introduces DILLO (DIstiLLed Language-ActiOn World Model), a proactive agent steering framework that replaces slow visual simulati
Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations
This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr
DynaFLIP: A Dynamics-Aware Multimodal Pre-Training Framework for Robot Manipulation Perception
DynaFLIP is a dynamics-aware multimodal pre-training framework for robot manipulation perception. It constructs image-language-3D flow tripl
ReMoT: A Reinforcement Learning Framework Using Motion Contrast Triplets to Improve VLM Spatio-Temporal Reasoning
ReMoT (Reinforcement Learning with Motion Contrast Triplets) is a unified training paradigm designed to address spatio-temporal consistency
JAMEL: A Framework for Joint Memory and Exploration Learning in Language Model Agents
This paper introduces JAMEL (Joint Agent Memory and Exploration Learning), a framework that trains language model agents to explore open-end
Qwen-AgentWorld: Language World Models for Simulating Agentic Environments Across 7 Domains
This paper introduces Qwen-AgentWorld, a family of language world models (35B-A3B and 397B-A17B) designed to simulate agentic environments a

Comments
Sign in to join the conversation.
No comments yet. Be the first.