New Framework Formalizes Learning from Language Feedback with Provable Performance Guarantees
By
[Submitted on 12 Jun 2025 (v1), last revised 6 Jun 2026 (this version, v2)]
Crispy enough to crunch, soft enough to enjoy. A good bake.
Summary
This paper formalizes the Learning from Language Feedback (LLF) problem, providing a principled framework for interactive learning using language feedback rather than traditional reward signals. The authors introduce the concept of "transfer eluder dimension" to measure the hardness of LLF problems, and develop a no-regret algorithm called HELiX that provably solves LLF problems with performance guarantees. They demonstrate that learning from rich language feedback can be exponentially faster than learning from reward, and show empirical results where HELiX performs well even when repeated LLM prompting fails.
Key quotes
· 5 pulledWe formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a measure to characterize the hardness of LLF.
We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward.
We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension.
Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably.
Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.
You might also wanna read
Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT
The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-
Reinforcement Learning to Train Large Language Models to Explain Human Decisions
Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty
This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
Formal Framework for LLM-Verifier Systems: Convergence Theorem and 4/δ Latency Bound
This research paper presents a formal framework for integrating Large Language Models with Formal Verification tools, addressing reliability
