Proxy-KD: A Novel Method for Knowledge Distillation from Black-Box Large Language Models

[Submitted on 13 Jan 2024 (v1), last revised 9 Nov 2024 (this version, v2)]

3h ago· 1 min readenInsight

technology science artificial intelligence machine learning research

Summary

This paper introduces Proxy-KD, a novel knowledge distillation method for transferring capabilities from black-box large language models (like GPT-4) to smaller models. Since proprietary LLMs do not expose their internal states, traditional knowledge distillation is limited. Proxy-KD uses a proxy model to facilitate efficient knowledge transfer without requiring access to the teacher's internal states. Experimental results show Proxy-KD not only improves performance over standard black-box KD but also surpasses traditional white-box distillation techniques, offering a new direction for distilling knowledge from advanced LLMs.

Source

Hacker NewsProxy-KD: A Novel Method for Knowledge Distillation from Black-Box Large Language Modelsarxiv.org

Key quotes

· 4 pulled

Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers.

To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models.

Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.

This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.

Snippet from the RSS feed

Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teacher

You might also wanna read

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·1mo ago

Feedback Distillation: A New Training Method for Improving LLM Reasoning in Theorem Proving

This paper introduces Feedback Distillation, a novel training method for reasoning models that improves upon standard GRPO (Group Relative P

arxiv.org·27d ago

RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models

This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology calle

arxiv.org·15d ago

LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a

arxiv.org·26d ago

RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment

This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

arxiv.org·29d ago

LLMs Can Describe Their Own Internal Decision-Making Processes, New Research Shows

This research paper demonstrates that large language models (LLMs) can accurately describe their own internal decision-making processes. The

arxiv.org·21d ago

Comments

No comments yet. Be the first.