Home

Awesome

MiniDisc

This repository contains code for paper titled MiniDisc: Minimal Distillation Schedule for Language Model Compression.

**************************** Updates ****************************

<!-- Thanks for your interest in our repo! -->

Quick Links

Overview

Recent studies have uncovered that language model distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is of vital importance to bring the knowledge from the teacher to the student. However, existing teacher assistant-based methods require maximally many trials before scheduling an optimal teacher assistant. To this end, we propose a minimal distillation schedule (MiniDisc) for scheduling the optimal teacher assistant in minimally one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, MiniDisc is designed with a λ-tradeoff to measure the optimality of the teacher assistant without trial distillation to the student. MiniDisc then can schedule the optimal teacher assistant with the best λ-tradeoff in a sandwich framework. MiniDisc is evaluated with an extensive set of experiments on GLUE. Experimental results demonstrate the improved efficiency our MiniDisc compared to several state-of-the-art baselines. We further apply MiniDisc to a language model with billions of parameters and show its scalability.

<img src="./assets/minidisc_motivation.png" alt="minidisc" align=center/> <img src="./assets/minidisc_method.png" alt="minidisc" align=center/>

Getting Started

Requirements

:warning: We only include task-specific distillation but not task-agnostic distillation in this repository due to code discrepancy, if you are interested in that part, please refer to our more recent work MiniMoE.

GLUE & CoNLL Data

Download GLUE data through the link, and CoNLL data through another link in exact CoNLL format. Put them to the corresponding directories. For example, MRPC dataset should be placed into datasets/mrpc.

Distillation

The distillation is achieved in several scripts. We provide example scripts in the followings.

Finetuning

We do not provide scripts of finetuning teacher models, but you can find ones in our previous work StarK, along with finetuned checkpoints. Otherwise, you can also use our code to realize the finetuning by ignoring the existence of teacher models, an example could be bert_scripts/run_finetuning_conll.sh.

Sparsification

We provide example scripts of sparsifying/pruning finetuned teacher models. The pruned models would be used to initialize the student models. For example, bert_scripts/run_sparsification_mrpc.sh is used to prune a teacher model finetuned on MRPC. We explain some key arguments in the following:

(Conventional) Distillation

We provide example scripts of conventionally distilling finetuned teacher models to layer-dropped or parameter-sparsified student models. For example, bert_scripts/run_distillation_mrpc.sh is used to distill a teacher model finetuned on MRPC to a properly-initialized (either layer-dropped or parameter-sparsified) student model. We explain some key arguments in the following:

MaxiDisc

We provide example scripts of distilling finetuned teacher models via teacher assistants with maximal efforts. For example, For example, bert_scripts/run_maxidisc_mrpc.sh is used to distill a teacher model finetuned on MRPC to a properly-initialized (either layer-dropped or parameter-sparsified) student model via teacher assistants. And you should find the optimal teacher assiatant by many trials. We explain some important arguments in the following:

MiniDisc

We provide example scripts of distilling finetuned teacher models via teacher assistants with minimal efforts. For example, For example, bert_scripts/run_minidisc_mrpc.sh is used to distill a teacher model finetuned on MRPC to a properly-initialized (either layer-dropped or parameter-sparsified) student model via teacher assistants. And you should find the optimal teacher assiatant in only one trial. We explain some important arguments in the following:

:warning: After experiments, we find that the optimal teacher assistants can hardly fall in sparsities smaller than 50%. So we directly truncate the number of teacher assitant candidates according to this obervation, leading to a further speedup in practice. However, we do think this heuristic may not fit for all cases (e.g., large language models) so we do not include it in the paper.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Chen (czhang@bit.edu.cn). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use the code in your work:

@inproceedings{zhang2022minidisc,
   title={MiniDisc: Minimal Distillation Schedule for Language Model Compression},
   author={Zhang, Chen and Yang, Yang and Wang, Qifan and Liu, Jiahao and Wang, Jingang and Xian, Yunsen and Wu, Wei and Song, Dawei},
   booktitle={arXiv},
   year={2022}
}