Proxy-KD: A Novel Method for Knowledge Distillation from Black-Box Large Language Models
By
[Submitted on 13 Jan 2024 (v1), last revised 9 Nov 2024 (this version, v2)]
Summary
This paper introduces Proxy-KD, a novel knowledge distillation method for transferring capabilities from black-box large language models (like GPT-4) to smaller models. Since proprietary LLMs do not expose their internal states, traditional knowledge distillation is limited. Proxy-KD uses a proxy model to facilitate efficient knowledge transfer without requiring access to the teacher's internal states. Experimental results show Proxy-KD not only improves performance over standard black-box KD but also surpasses traditional white-box distillation techniques, offering a new direction for distilling knowledge from advanced LLMs.
Source
Key quotes
· 4 pulledGiven the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers.
To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models.
Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.
This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.
You might also wanna read
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Feedback Distillation: A New Training Method for Improving LLM Reasoning in Theorem Proving
This paper introduces Feedback Distillation, a novel training method for reasoning models that improves upon standard GRPO (Group Relative P
RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models
This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology calle
LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs
This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
LLMs Can Describe Their Own Internal Decision-Making Processes, New Research Shows
This research paper demonstrates that large language models (LLMs) can accurately describe their own internal decision-making processes. The

Comments
Sign in to join the conversation.
No comments yet. Be the first.