ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs
By
[Submitted on 13 May 2026 (v1), last revised 30 May 2026 (this version, v3)]
Toasted just enough. A reliable bake, gently seasoned.
Summary
This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reasoning tasks. The authors identify two key limitations in GRPO: likelihood-misaligned surrogate scores (where clipped ratio-based scores are optimized instead of sequence likelihoods) and score-insensitive credit assignment (where rollout-level credit doesn't reflect current score gaps between positive and negative rollouts). To address these issues, they propose ConSPO (Contrastive Sequence-level Policy Optimization), which uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO employs a group-wise InfoNCE-style objective with curriculum-scheduled margin. Experiments show ConSPO outperforms strong baselines on challenging reasoning benchmarks.
Key quotes
· 5 pulledWe first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts.
This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores... and score-insensitive credit assignment...
To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores...
ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives...
Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks.
You might also wanna read
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-
Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT
The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty
This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.
Tiny Recursion Model Achieves Strong AGI Benchmark Results with Only 7M Parameters
The paper introduces Tiny Recursion Model (TRM), a recursive reasoning model that achieves impressive scores of 45% on ARC-AGI-1 and 8% on A
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
