DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
By
circuithunter
Fresh out the oven, still warm. Top of the tray.
Summary
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current evaluation practices. It identifies three key problems with existing benchmarks: (1) multiple-choice formats that reward guessing and saturate early, (2) 'blindly solvable' questions that don't require visual understanding (up to 70% of some evaluations), and (3) mislabeled or ambiguous samples (up to 42% in certain datasets). The authors propose a cleaned evaluation suite that converts multiple-choice to generative tasks (revealing 35% capability drops) and filters problematic samples, achieving 13x average speedup while maintaining discriminative power. The work aims to make VLM evaluations more faithful, discriminative, and efficient as models continue to scale.
Key quotes
· 4 pulledEmpirical evaluation serves as the primary compass guiding research progress in foundation models.
Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets.
We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%.
Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
You might also wanna read
Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs
This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Researchers Develop Method to Predict Real-Time Progress in Reasoning Language Models
This research paper investigates whether real-time progress prediction is feasible for reasoning language models that use long latent chains

AI systems achieve 50% pass rate in standard three-party Turing test, study finds
This paper demonstrates that three current AI systems (when suitably prompted) achieve a pass rate of at least 50% in a standard three-party
RICP: A Teacher-Student Framework for Retrieved In-Context Principles from Mistakes in LLMs
This paper introduces Retrieved In-Context Principles (RICP), a novel teacher-student framework for improving Large Language Models (LLMs) t
HSIR: New Method Improves Self-Improvement Training for Large Reasoning Models
This research paper identifies two key problems in self-improvement training for Large Reasoning Models (LRMs): data imbalance (too many sim
