All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards

By

@ai-firehose.column.social

5d ago· 2 min readenInsight

Summary

This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses inefficiencies in how rollouts are used during training of large language models. The authors formulate rollout scheduling as a contextual bandit problem, where each rollout is treated as an arm with reward defined by performance gain between optimization steps. The framework supports noise-aware intra-group selection and adaptive reuse of historical rollouts. Theoretical sublinear regret bounds are derived, and experiments on six mathematical reasoning benchmarks show consistent improvements in performance and training efficiency across multiple RLVR methods.

Key quotes

· 5 pulled
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.
Existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use.
We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.
We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound.
Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
Snippet from the RSS feed
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of h

You might also wanna read