DataComp-VLM: A Benchmark Reveals Data Mixing Beats Filtering for Vision-Language Model Training
By
[Submitted on 26 Jun 2026 (v1), last revised 30 Jun 2026 (this version, v2)]
Summary
This paper introduces DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve Vision-Language Model (VLM) training. The authors collected 160 datasets spanning four data types into a corpus of 6 trillion multimodal tokens. Through extensive experiments, they found that data mixing (not filtering) is key to high-quality training, with instruction-heavy mixtures scaling better than caption-heavy ones. Their resulting DCVLM-Baseline dataset enables training an 8B VLM to 63.6% accuracy on a 33-task core suite, outperforming the state-of-the-art open VLM dataset FineVision by +5.4 percentage points.
Source
Key quotes
· 4 pulledWe introduce DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve VLM training.
We find that data mixing, not filtering, is key to a high-quality training dataset: instruction-heavy mixtures scale better than caption-heavy ones, with gains widening at larger scales.
Compared to FineVision, the state-of-the-art open VLM training dataset, this represents an improvement of +5.4pp.
DCVLM and all accompanying artifacts will be made publicly available at https://www.datacomp.ai/dcvlm/.
You might also wanna read
DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval
StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing
StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha
Efficient Vision Encoding for Vision Language Models
Vision Language Models (VLMs) combine visual understanding with textual inputs by utilizing pretrained vision encoders and Large Language Mo
Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations
This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
LoomVideo: A 5B-Parameter Unified Model for Efficient Video Generation and Editing
LoomVideo is a new 5-billion parameter unified architecture for video generation and editing that addresses computational bottlenecks in exi

Comments
Sign in to join the conversation.
No comments yet. Be the first.