Why Would GLM-5.2 Move Away From GRPO?

Why Would GLM-5.2 Move Away From GRPO? 🌟Insights from Zhihu contributor 九老师 TL;DR: GLM-5.2 dropping GRPO does not mean GRPO is “bad.” It means the assumptions that made GRPO attractive for short LLM RL tasks may no longer hold for long-horizon agentic tasks. When rollouts get longer, environments get noisier, and credit assignment gets harder, PPO + value modeling starts looking useful again. The key question is not simply “why did GLM-5.2 stop using GRPO?” A better question is: why did GRPO become useful for LLM RL in the first place? If the reasons that made GRPO attractive no longer hold, then going back to PPO becomes natural. GRPO can be understood as a sampled-baseline method. Instead of training a separate value model, it samples multiple responses for the same prompt and uses the group average as a baseline. That is elegant. You get a relative reward signal without paying for a separate critic. In short tasks, this is very appealing. But there is a tradeoff.⚖️ PPO uses a learned value function, or critic. This critic is expensive and harder to tune. It also has its own problems: the policy keeps changing, so the value model is always trying to follow a moving target. That can introduce bias. GRPO avoids that by using an up-to-date sampled baseline. It is closer to low-bias, but it tends to have higher variance. For early LLM RL tasks, that tradeoff made sense

Why Would GLM-5.2 Move Away From GRPO?

Source

You might also wanna read

Apple Design Resources

'King of the Hill' Season 15 Episode Screens at Annecy Festival; Hulu Premiere Set for July 20

Out of all 13 studios LPs, these are the essentials

Priyanka Chopra Jonas Surprised by Global Success of Prime Video Pirate Film 'The Bluff'; Shares Update on Rajamouli's 'Varanasi'

ESPN draft blunder leaves Dirk Nowitzki red-faced in middle of heartfelt message

RT @TeleFootball: England were frustrated in a 0-0 draw with Ghana on Tuesday. @SamWallaceTel breaks down the key talking points from the…

Comments