Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
By
sonabinu
8mo ago· 2 min readNews
This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optim
