All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

DataComp-VLM: A Benchmark Reveals Data Mixing Beats Filtering for Vision-Language Model Training

By

[Submitted on 26 Jun 2026 (v1), last revised 30 Jun 2026 (this version, v2)]

8h ago· 3 min readenInsight

Summary

This paper introduces DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve Vision-Language Model (VLM) training. The authors collected 160 datasets spanning four data types into a corpus of 6 trillion multimodal tokens. Through extensive experiments, they found that data mixing (not filtering) is key to high-quality training, with instruction-heavy mixtures scaling better than caption-heavy ones. Their resulting DCVLM-Baseline dataset enables training an 8B VLM to 63.6% accuracy on a 33-task core suite, outperforming the state-of-the-art open VLM dataset FineVision by +5.4 percentage points.

Source

Twitter / XDataComp-VLM: A Benchmark Reveals Data Mixing Beats Filtering for Vision-Language Model Trainingarxiv.org

Key quotes

· 4 pulled
We introduce DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve VLM training.
We find that data mixing, not filtering, is key to a high-quality training dataset: instruction-heavy mixtures scale better than caption-heavy ones, with gains widening at larger scales.
Compared to FineVision, the state-of-the-art open VLM training dataset, this represents an improvement of +5.4pp.
DCVLM and all accompanying artifacts will be made publicly available at https://www.datacomp.ai/dcvlm/.
Snippet from the RSS feed
Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies. We introduce DataComp for VLMs (DCVLM), a benchmark for con

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.