Feedback Distillation: A New Training Method for Improving LLM Reasoning in Theorem Proving
By
[Submitted on 29 May 2026]
Warm and crisp on the edges. A bagel with a bit of bite.
Summary
This paper introduces Feedback Distillation, a novel training method for reasoning models that improves upon standard GRPO (Group Relative Policy Optimization). The method trains a model to match its own token-level distribution conditioned on privileged feedback from a language model, offering denser supervision and better exploration. Applied to Lean4 theorem-proving, Feedback Distillation maintains greater trajectory diversity and higher policy entropy than GRPO, and combining both methods (initializing GRPO from a Feedback Distillation checkpoint) outperforms either approach alone.
Key quotes
· 4 pulledFeedback Distillation offers token-level supervision and can inject external knowledge.
Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling.
The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone.
All in all, our results suggest a promising avenue to improve post-training for complex reasoning.
You might also wanna read
DeepConf: Enhancing LLM Reasoning Through Confidence-Based Inference Methods
DeepConf is a novel test-time inference method that enhances Large Language Models' reasoning capabilities by using internal log-probabiliti
Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations
This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from ex
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Uncertainty-Aware AI Reasoning Using Logprobs and Self-Correcting Generation Loops
This technical notebook demonstrates a novel approach to AI model reasoning that uses token-level uncertainty metrics (logprobs) from OpenAI
LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training
The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui
