All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

HSIR: New Method Improves Self-Improvement Training for Large Reasoning Models

By

@ai-firehose.column.social

5d ago· 2 min readenInsight

Summary

This research paper identifies two key problems in self-improvement training for Large Reasoning Models (LRMs): data imbalance (too many simple samples, too few challenging ones) and overthinking (redundant reasoning steps). The authors propose HSIR (Harnessing Self-Improvement in large Reasoning models), which uses a verify-then-exit sampling strategy to address data imbalance and an Intrinsic Diversity score to filter out overthinking. They also introduce H-GRPO, an enhanced reinforcement learning algorithm. Results show up to +10.9% performance gains and 42.4% reduction in inference overhead.

Key quotes

· 3 pulled
Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision.
We reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training.
HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.
Snippet from the RSS feed
Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tas

You might also wanna read