Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards
By
@ai-firehose.column.social
Reliable enough to start your morning with. Toast it again tomorrow.
Summary
This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses inefficiencies in how rollouts are used during training of large language models. The authors formulate rollout scheduling as a contextual bandit problem, where each rollout is treated as an arm with reward defined by performance gain between optimization steps. The framework supports noise-aware intra-group selection and adaptive reuse of historical rollouts. Theoretical sublinear regret bounds are derived, and experiments on six mathematical reasoning benchmarks show consistent improvements in performance and training efficiency across multiple RLVR methods.
Key quotes
· 5 pulledReinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.
Existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use.
We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.
We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound.
Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
You might also wanna read
Adaptive LLM Routing Using Contextual Bandits and Shared Embedding Space
This research paper proposes a novel approach to LLM routing that treats it as a contextual bandit problem rather than supervised learning.
Challenges in Scaling Reinforcement Learning
Reinforcement learning (RL) is questioned for its scalability compared to other objectives like next-token prediction, denoising diffusion,
Introduction to Reinforcement Learning from Human Feedback in Jupyter Notebooks
This article introduces a reference implementation for Reinforcement Learning from Human Feedback (RLHF) in Jupyter notebooks, focusing on a
