DataComp-VLM: A Benchmark Reveals Data Mixing Beats Filtering for Vision-Language Model Training

[Submitted on 26 Jun 2026 (v1), last revised 30 Jun 2026 (this version, v2)]

8h ago· 3 min readenInsight

technology science machine learning computer vision

Summary

This paper introduces DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve Vision-Language Model (VLM) training. The authors collected 160 datasets spanning four data types into a corpus of 6 trillion multimodal tokens. Through extensive experiments, they found that data mixing (not filtering) is key to high-quality training, with instruction-heavy mixtures scaling better than caption-heavy ones. Their resulting DCVLM-Baseline dataset enables training an 8B VLM to 63.6% accuracy on a 33-task core suite, outperforming the state-of-the-art open VLM dataset FineVision by +5.4 percentage points.

Source

Twitter / XDataComp-VLM: A Benchmark Reveals Data Mixing Beats Filtering for Vision-Language Model Trainingarxiv.org

Key quotes

· 4 pulled

We introduce DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve VLM training.

We find that data mixing, not filtering, is key to a high-quality training dataset: instruction-heavy mixtures scale better than caption-heavy ones, with gains widening at larger scales.

Compared to FineVision, the state-of-the-art open VLM training dataset, this represents an improvement of +5.4pp.

DCVLM and all accompanying artifacts will be made publicly available at https://www.datacomp.ai/dcvlm/.

Snippet from the RSS feed

Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies. We introduce DataComp for VLMs (DCVLM), a benchmark for con

You might also wanna read

DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation

The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

arxiv.org·5mo ago

StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing

StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha

arxiv.org·8mo ago

Efficient Vision Encoding for Vision Language Models

Vision Language Models (VLMs) combine visual understanding with textual inputs by utilizing pretrained vision encoders and Large Language Mo

machinelearning.apple.com·11mo ago

Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations

This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr

macrodata.co·4d ago

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

LoomVideo: A 5B-Parameter Unified Model for Efficient Video Generation and Editing

LoomVideo is a new 5-billion parameter unified architecture for video generation and editing that addresses computational bottlenecks in exi

arxiv.org·28d ago

Comments

No comments yet. Be the first.