Awesome
Awesome Knowledge Distillation of LLM Papers
<!-- Big font size --> <h2 align="center"> A Survey on Knowledge Distillation of Large Language Models </h2> <p align="center"> Xiaohan Xu<sup>1</sup>   Ming Li<sup>2</sup>   Chongyang Tao<sup>3</sup>   Tao Shen<sup>4</sup>   Reynold Cheng<sup>1</sup>   Jinyang Li<sup>1</sup>   Can Xu<sup>5</sup>   Dacheng Tao<sup>6</sup>   Tianyi Zhou<sup>2</sup>   </p> <p align="center"> <sup>1</sup> The University of Hong Kong    <sup>2</sup> University of Maryland    <sup>3</sup> Microsoft    <sup>4</sup> University of Technology Sydney    <sup>5</sup> Peking University    <sup>6</sup> The University of Sydney </p> <div align="center"> <img src="imgs/framework.png" width="700"><br> </div> <br>A collection of papers related to knowledge distillation of large language models (LLMs). If you want to use LLMs for benefitting your own smaller models training, or use self-generated knowledge to achieve the self-improvement, just take a look at this collection.
We will update this collection every week. Welcome to star ⭐️ this repo to keep track of the updates.
❗️Legal Consideration: It's crucial to note the legal implications of utilizing LLM outputs, such as those from ChatGPT (Restrictions), Llama (License), etc. We strongly advise users to adhere to the terms of use specified by the model providers, such as the restrictions on developing competitive products, and so on.
💡 News
-
2024-2-20: 📃 We released a survey paper "A Survey on Knowledge Distillation of Large Language Models". Welcome to read and cite it. We are looking forward to your feedback and suggestions.
-
Update Log
- 2024-3-19: Add 14 papers.
Contributing to This Collection
Feel free to open an issue/PR or e-mail shawnxxh@gmail.com, minglii@umd.edu, hishentao@gmail.com and chongyangtao@gmail.com if you find any missing taxonomies or papers. We will keep updating this collection and survey.
📝 Introduction
KD of LLMs: This survey delves into knowledge distillation (KD) techniques in Large Language Models (LLMs), highlighting KD's crucial role in transferring advanced capabilities from proprietary LLMs like GPT-4 to open-source counterparts such as LLaMA and Mistral. We also explore how KD enables the compression and self-improvement of open-source LLMs by using them as teachers.
KD and Data Augmentation: Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts.
Taxonomy: Our analysis is meticulously structured around three foundational pillars: algorithm, skill, and verticalization -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields.
KD Algorithms: For KD algorithms, we categorize it into two principal steps: "Knowledge Elicitation" focusing on eliciting knowledge from teacher LLMs, and "Distillation Algorithms" centered on injecting this knowledge into student models.
<div align="center"> <img src="imgs/knowledge.png" width="600"><br> <em>Figure: An illustration of different knowledge elicitation methods from teacher LLMs.</em> </div> <br>Skill Distillation: We delve into the enhancement of specific cognitive abilities, such as context following, alignment, agent, NLP task specialization, and multi-modality.
Verticalization Distillation: We explore the practical implications of KD across diverse fields, including law, medical & healthcare, finance, science, and miscellaneous domains.
Note that both Skill Distillation and Verticalization Distillation employ Knowledge Elicitation and Distillation Algorithms in KD Algorithms to achieve their KD. Thus, there are overlaps between them. However, this could also provide different perspectives for the papers.
Why KD of LLMs?
In the era of LLMs, KD of LLMs plays the following crucial roles:
<div align="center"> <img src="imgs/kd_role_bg.png" width="400"><br> </div> <br>Role | Description | Trend |
---|---|---|
① Advancing Smaller Models | Transferring advanced capabilities from proprietary LLMs to smaller models, such as open source LLMs or other smaller models. | Most common |
② Compression | Compressing open-source LLMs to make them more efficient and practical. | More popular with the prosperity of open-source LLMs |
③ Self-Improvement | Refining open-source LLMs' performance by leveraging their own knowledge, i.e. self-knowledge. | New trend to make open-source LLMs more competitive |
📒 Table of Contents
KD Algorithms
Knowledge Elicitation
Labeling
Expansion
Curation
Feature
Feedback
Self-Knowledge
Distillation Algorithms
Supervised Fine-Tuning
Due to the large number of works applying supervised fine-tuning, we only list the most representative ones here.
Divergence and Similarity
Reinforcement Learning
Rank Optimization
Skill Distillation
Context Following
Instruction Following
Multi-turn Dialogue
RAG Capability
Alignment
Thinking Pattern
Preference
Value
Agent
Tool Using
Planning
NLP Task Specialization
NLU
NLG
Information Retrieval
Recommendation
Text Generation Evaluation
Code
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Magicoder: Source Code Is All You Need | arXiv | 2023-12 | Github | Data <br> Data |
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation | arXiv | 2023-12 | ||
Instruction Fusion: Advancing Prompt Evolution through Hybridization | arXiv | 2023-12 | ||
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning | arXiv | 2023-11 | Github | Data <br> Data |
LLM-Assisted Code Cleaning For Training Accurate Code Generators | arXiv | 2023-11 | ||
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation | EMNLP | 2023-10 | Github | |
Code Llama: Open Foundation Models for Code | arXiv | 2023-08 | Github | |
Distilled GPT for Source Code Summarization | arXiv | 2023-08 | Github | Data |
Textbooks Are All You Need: A Large-Scale Instructional Text Data Set for Language Models | arXiv | 2023-06 | ||
Code Alpaca: An Instruction-following LLaMA model for code generation | - | 2023-03 | Github | Data |
Multi-Modality
Summary Table
<div align="center"> <img src="imgs/table.jpg"><br> <em>Figure: A summary of representative works about skill distillation.</em> </div> <br>Verticalization Distillation
Law
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Fuzi | - | 2023-08 | Github | |
ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases | arXiv | 2023-06 | Github | |
Lawyer LLaMA Technical Report | arXiv | 2023-05 | Github | Data |
Medical & Healthcare
Finance
Title | Venue | Date | Code | Data |
---|---|---|---|---|
XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters | CIKM | 2023-05 |
Science
Misc.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
OWL: A Large Language Model for IT Operations | arXiv | 2023-09 | Github | Data |
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education | arXiv | 2023-08 | Github | Data |
Encoder-based KD
Note: Our survey mainly focuses on generative LLMs, and thus the encoder-based KD is not included in the survey. However, we are also interested in this topic and continue to update the latest works in this area.
Title | Venue | Date | Code | Data |
---|---|---|---|---|
Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling | Findings of ACL | 2023-08 | ||
Better Together: Jointly Using Masked Latent Semantic Modeling and Masked Language Modeling for Sample Efficient Pre-training | CoNLL | 2023-08 |
Citation
If you find this repository helpful, please consider citing the following paper:
@misc{xu2024survey,
title={A Survey on Knowledge Distillation of Large Language Models},
author={Xiaohan Xu and Ming Li and Chongyang Tao and Tao Shen and Reynold Cheng and Jinyang Li and Can Xu and Dacheng Tao and Tianyi Zhou},
year={2024},
eprint={2402.13116},
archivePrefix={arXiv},
primaryClass={cs.CL}
}