All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Why Did GLM-5.2 Move Away From GRPO?

8h ago

Source

Twitter / XWhy Did GLM-5.2 Move Away From GRPO?zhihu.com
Snippet from the RSS feed
Why Did GLM-5.2 Move Away From GRPO? 🌟Insights from Zhihu contributor 划水的青蛙 TL;DR: GRPO is still a good algorithm, but long-horizon Agentic tasks break the assumptions that made it work well for short, verifiable tasks. The problem is not just theory. It is also reward sparsity, credit assignment, and throughput pressure. GLM-5.2 moving away from GRPO is more like a practical correction than a rejection of GRPO itself. GRPO is still useful. It is just no longer the best algorithm to carry long-horizon tasks. Think back to late 2024 and early 2025. Most models were still rough by today’s standards. Models that could really handle long-horizon tasks, such as Claude Sonnet 4.5 and the Opus series with Claude Code, only became truly impressive later. Before that, models like DeepSeek R1 and OpenAI’s O-series reasoning models were still mainly optimized for short tasks: math, coding unit tests, and other problems that were short and verifiable. But the industry moved extremely fast. Long-horizon coding went from an idea to a real training target in a very short time. If we force GRPO into long-horizon training, two problems become very clear: sparse rewards on the algorithm side, and painful throughput pressure on the engineering side. ⚙️ The Engineering Problem For long-horizon tasks, the hardest trade-off is throughput vs. sample diversity. If you train short tasks first and long tasks later, the gradient signal may swing violently. If you mix them together, short tasks finish early while long tasks keep running. The system then waits to score the whole group, wasting a lot of compute. So even before the algorithm question, the infrastructure pressure is already real. 🧠 The Algorithm Problem GRPO originally worked because of three assumptions

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.