All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs

By

[Submitted on 13 May 2026 (v1), last revised 30 May 2026 (this version, v3)]

6d ago· 2 min readenInsight

Summary

This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reasoning tasks. The authors identify two key limitations in GRPO: likelihood-misaligned surrogate scores (where clipped ratio-based scores are optimized instead of sequence likelihoods) and score-insensitive credit assignment (where rollout-level credit doesn't reflect current score gaps between positive and negative rollouts). To address these issues, they propose ConSPO (Contrastive Sequence-level Policy Optimization), which uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO employs a group-wise InfoNCE-style objective with curriculum-scheduled margin. Experiments show ConSPO outperforms strong baselines on challenging reasoning benchmarks.

Key quotes

· 5 pulled
We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts.
This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores... and score-insensitive credit assignment...
To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores...
ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives...
Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks.
Snippet from the RSS feed
Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization

You might also wanna read