ReMoT: A Reinforcement Learning Framework Using Motion Contrast Triplets to Improve VLM Spatio-Temporal Reasoning
By
[Submitted on 28 Feb 2026 (v1), last revised 10 Jun 2026 (this version, v3)]
Summary
ReMoT (Reinforcement Learning with Motion Contrast Triplets) is a unified training paradigm designed to address spatio-temporal consistency shortcomings in Vision-Language Models (VLMs), particularly for navigation, robotics, and autonomous driving. It introduces two core components: (1) ReMoT-16K, a large-scale motion-contrast dataset of 16.5K triplets generated via a rule-based automatic framework from video meta-annotations, and (2) Group Relative Policy Optimization (GRPO) for optimal contrastive reasoning learning. The paper also presents the first benchmark for fine-grained motion contrast triplets. The resulting model achieves state-of-the-art performance, including a 25.1% improvement on spatio-temporal reasoning tasks.
Source
Key quotes
· 5 pulledWe present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving.
A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation.
Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning.
We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions).
The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
You might also wanna read
GLM-5V-Turbo: A Native Multimodal Foundation Model for Agentic AI Tasks
GLM-5V-Turbo is a new multimodal foundation model developed by the GLM-V Team that integrates perception, reasoning, planning, tool use, and
Skill-MAS: A Meta-Skill Approach to Improving Multi-Agent Systems Without Retraining
Skill-MAS proposes a novel approach to LLM-based automatic Multi-Agent Systems (MAS) generation that bridges the gap between inference-time
Skill-MAS: A Meta-Skill Approach to Improving Multi-Agent Systems Without Retraining
Skill-MAS proposes a novel approach to LLM-based automatic Multi-Agent Systems (MAS) generation that bridges the gap between inference-time
Research on Contrastive Representations for Temporal Reasoning in AI Systems
The article presents research on Contrastive Representations for Temporal Reasoning (CRTR), examining whether temporal reasoning can emerge
AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making
This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci
AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making
This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci
StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing
StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha
Ultralytics YOLO26: A Unified Real-Time Vision Model Family with NMS-Free Inference and Advanced Training Pipeline
Ultralytics YOLO26 is a new family of real-time vision models that addresses key limitations of prior YOLO detectors. It introduces a dual-h
Comments
Sign in to join the conversation.
No comments yet. Be the first.
