Sensorimotor World Model: Learning Action-Aligned Representations via Inverse Dynamics Regularization

[Submitted on 18 Jun 2026]

2d ago· 2 min readenInsight

Summary

This paper introduces a Sensorimotor World Model (SMWM), a latent world model trained end-to-end with inverse dynamics regularization. The approach addresses two key challenges in perception-for-action and JEPA-style world models: preventing representation collapse and inducing action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, the model biases toward controllable environmental degrees of freedom while discarding uncontrollable distractors. The method achieves stable training from offline, reward-free trajectories without complex regularizers, and demonstrates competitive planning performance across 2D and 3D control tasks.

Source

Twitter / XSensorimotor World Model: Learning Action-Aligned Representations via Inverse Dynamics Regularizationarxiv.org

Key quotes

· 5 pulled

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions.

We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization.

By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors.

This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers.

Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

Snippet from the RSS feed

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dime

You might also wanna read

DILLO: A Language-Based World Model for Proactive Agent Steering Without Visual Simulation

This paper introduces DILLO (DIstiLLed Language-ActiOn World Model), a proactive agent steering framework that replaces slow visual simulati

arxiv.org·11d ago

Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations

This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr

macrodata.co·4d ago

DynaFLIP: A Dynamics-Aware Multimodal Pre-Training Framework for Robot Manipulation Perception

DynaFLIP is a dynamics-aware multimodal pre-training framework for robot manipulation perception. It constructs image-language-3D flow tripl

arxiv.org·29d ago

ReMoT: A Reinforcement Learning Framework Using Motion Contrast Triplets to Improve VLM Spatio-Temporal Reasoning

ReMoT (Reinforcement Learning with Motion Contrast Triplets) is a unified training paradigm designed to address spatio-temporal consistency

arxiv.org·21d ago

JAMEL: A Framework for Joint Memory and Exploration Learning in Language Model Agents

This paper introduces JAMEL (Joint Agent Memory and Exploration Learning), a framework that trains language model agents to explore open-end

arxiv.org·29d ago

Qwen-AgentWorld: Language World Models for Simulating Agentic Environments Across 7 Domains

This paper introduces Qwen-AgentWorld, a family of language world models (35B-A3B and 397B-A17B) designed to simulate agentic environments a

arxiv.org·10d ago

Comments

No comments yet. Be the first.