All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

MMR-GRPO: Diversity-Aware Reward Reweighting Accelerates Mathematical Reasoning Model Training

By

[Submitted on 14 Jan 2026 (v1), last revised 7 Jun 2026 (this version, v2)]

2d ago· 2 min readenInsight

Summary

This paper introduces MMR-GRPO, a method that integrates Maximal Marginal Relevance (MMR) into Group Relative Policy Optimization (GRPO) to reweight rewards based on completion diversity during training of mathematical reasoning models. The key insight is that semantically redundant completions provide limited marginal learning signal, so prioritizing diverse solutions yields more informative updates and accelerates convergence. Evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time.

Key quotes

· 3 pulled
Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence.
MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time.
These gains are consistent across models, methods, and benchmarks.
Snippet from the RSS feed
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the nu

You might also wanna read

Tiny Recursion Model Achieves Strong AGI Benchmark Results with Only 7M Parameters

The paper introduces Tiny Recursion Model (TRM), a recursive reasoning model that achieves impressive scores of 45% on ARC-AGI-1 and 8% on A

alexiajm.github.io·8mo ago

uGMM-NN: Neural Network Architecture with Gaussian Mixture Model Neurons for Probabilistic Reasoning

This research paper introduces uGMM-NN (Univariate Gaussian Mixture Model Neural Network), a novel neural architecture that embeds probabili

arxiv.org·9mo ago

Tiny Recursion Model Achieves Strong AGI Benchmark Results with Minimal Parameters

The paper introduces Tiny Recursion Model (TRM), a recursive reasoning model that achieves impressive results on ARC-AGI benchmarks (45% on

alexiajm.github.io·8mo ago

Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration

Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

arxiv.org·5mo ago

Tiny Recursive Model Outperforms Large Language Models on Complex Reasoning Tasks

Researchers propose Tiny Recursive Model (TRM), a simplified recursive reasoning approach that outperforms both the existing Hierarchical Re

arxiv.org·8mo ago

Universal Reasoning Model (URM): Enhancing Transformer Performance for Complex AI Reasoning Tasks

This research paper analyzes Universal Transformers (UTs) used for complex reasoning tasks like ARC-AGI and Sudoku, finding that performance

arxiv.org·5mo ago