MMR-GRPO: Diversity-Aware Reward Reweighting Accelerates Mathematical Reasoning Model Training
By
[Submitted on 14 Jan 2026 (v1), last revised 7 Jun 2026 (this version, v2)]
A respectable bake. You'd come back tomorrow for another.
Summary
This paper introduces MMR-GRPO, a method that integrates Maximal Marginal Relevance (MMR) into Group Relative Policy Optimization (GRPO) to reweight rewards based on completion diversity during training of mathematical reasoning models. The key insight is that semantically redundant completions provide limited marginal learning signal, so prioritizing diverse solutions yields more informative updates and accelerates convergence. Evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time.
Key quotes
· 3 pulledOur key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence.
MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time.
These gains are consistent across models, methods, and benchmarks.
You might also wanna read
Tiny Recursion Model Achieves Strong AGI Benchmark Results with Only 7M Parameters
The paper introduces Tiny Recursion Model (TRM), a recursive reasoning model that achieves impressive scores of 45% on ARC-AGI-1 and 8% on A
uGMM-NN: Neural Network Architecture with Gaussian Mixture Model Neurons for Probabilistic Reasoning
This research paper introduces uGMM-NN (Univariate Gaussian Mixture Model Neural Network), a novel neural architecture that embeds probabili
Tiny Recursion Model Achieves Strong AGI Benchmark Results with Minimal Parameters
The paper introduces Tiny Recursion Model (TRM), a recursive reasoning model that achieves impressive results on ARC-AGI benchmarks (45% on
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-
Tiny Recursive Model Outperforms Large Language Models on Complex Reasoning Tasks
Researchers propose Tiny Recursive Model (TRM), a simplified recursive reasoning approach that outperforms both the existing Hierarchical Re
Universal Reasoning Model (URM): Enhancing Transformer Performance for Complex AI Reasoning Tasks
This research paper analyzes Universal Transformers (UTs) used for complex reasoning tasks like ARC-AGI and Sudoku, finding that performance
