Awesome
GLMD
GLMD: General Language Model Distillation without Intermediate Layer Features and Hard Labels
- GLMD saves the tedious work on intermediate layers and golden labels, which allows distillation between different model structures without labeled dataset or the selection of intermediate layers.
- GLMD introduces a novel vocabulary compression method that further helps reducing the final model size.
- GLMD is implemented based on the GKD framework.
For more details about the techniques of GLMD, refer to our paper:
Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method
Related Model Files
Download link: https://pan.baidu.com/s/1Q2lUY96Ix5emMAb-fkJgbQ?pwd=wwm8
- The fine-tuning teacher models for glm-large, glm-base, glm-2b, glm-10b, and ibglm-large.
- The optimal models related to GLMD, along with models from certain other methods.
- The pretraining teacher models for ibglm-large and alglm-base.
GKD
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model
- It provides a flexible architecture to efficiently implement various language model distillation methods, while allowing the use of a combination of these methods.
- We have introduced techniques such as model parallelism (Megatron-LM) and ZeRO (DeepSpeed) in the toolkit to make it efficient for distilling very large models.
For more details about the techniques of GKD, refer to our paper:
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model
Get Started
Docker Image
We prepare a docker image based on Python 3.8.13, PyTorch 1.9.1, and CUDA 11.1. You can pull the pre-built images from Docker Hub and run with docker v19.03+
docker run --gpus all --rm -it --ipc=host aitsc/glm:v1.5
Manual Installation
git clone https://github.com/aitsc/GLMKD
cd GLMKD
conda create -n GLMKD python=3.8
conda activate GLMKD
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd .. && rm -rf apex
Model Parallelism
If your encounter the CUDA out of memory
error, which means you GPU memory is limited, you can try the model parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First run change_mp.py
to divide the checkpoint:
python change_mp.py path_to_the_checkpoint 2
Then change --model-parallel-size
in the command to 2
.
Usage of existing methods
We provide commands for distilling GLM on all methods with deepspeed.
Suppose we want to distill a 12-layer teacher model to a 6-layer student model and test it on the ReCoRD dataset. We can first define 4 command prefixes that are not related to the specific method.
1. Prefix-pretrain: NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port=13761 --include=localhost:0,1 distill/pretrain.py --deepspeed_config=config/config_block_tiny6.json --deepspeed-activation-checkpointing --deepspeed --block-lm --num-layers=6 --hidden-size=768 --num-attention-heads=12 --max-position-embeddings=512 --tokenizer-model-type=bert-base-uncased --tokenizer-type=BertWordPieceTokenizer --fp16 --checkpoint-activations --model-parallel-size=1 --save-interval=5000 --save=../GLM/data/checkpoints/distill/tiny6 --experiment-name=test --bert-prob=1.0 --train-data=bert-base --split=949,50,1 --distributed-backend=nccl --lr-decay-style=cosine --lr-decay-iters=120000 --lr-decay-ratio=0.05 --warmup=.05 --train-iters=150000 --no-lazy-loader --resume-dataloader
2. Prefix-finetune: NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port=20696 --include=localhost:0 --hostfile= distill/finetune.py --finetune --cloze-eval --experiment-name=blank-tiny6-ReCoRD-test --task=ReCoRD --data-dir=../GLM/data/english_data/superglue/ReCoRD --save=../GLM/data/checkpoints/distill/tiny6/test/ft --seq-length=512 --checkpoint-activations --eval-batch-size=16 --save-epoch=100000 --block-lm --num-layers=6 --hidden-size=768 --num-attention-heads=12 --max-position-embeddings=512 --tokenizer-model-type=bert-base-uncased --tokenizer-type=BertWordPieceTokenizer --load-pretrained=../GLM/data/checkpoints/distill/tiny6/test --fp16 --lr-decay-style=linear --warmup=0.1 --weight-decay=1.0e-1 --pattern-id=0 --save-interval=10000 --log-interval=50 --eval-interval=1000 --eval-iters=100 --batch-size=8 --epochs=5 --lr=1e-5 --overwrite --deepspeed-activation-checkpointing --deepspeed --deepspeed_config=config/config_block_tiny6.json --custom_first_eval
3. Prefix-single-teacher: --teacher_load_pretrained=../GLM/data/checkpoints/pretrain/blocklm-base-blank --teacher_num_layers=12 --teacher_hidden_size=768 --teacher_num_attention_heads=12 --teacher_max_position_embeddings=512 --teacher_fp16
4. Prefix-multi-teacher: --mt_num_attention_heads=a1:a2 --mt_hidden_size=h1:h2 --mt_num_layers=l1:l2 --mt_max_position_embeddings=m1:m2 --mt_load_pretrained=p1:p2 --teacher_fp16
Then we can build commands of different methods. (see distill/readme.md and distill/prepare.py for more detailed descriptions and parameters)
GLMD-vc
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=logitsdistil --distill_temperature=15 --logitsdistil_mask_pad
- [Prefix-finetune] [Prefix-single-teacher] --student_model=logitsdistil --distill_temperature=15 --logitsdistil_mask_pad
- [Prefix-finetune] [Prefix-single-teacher] --student_model=logitsdistil --distill_ft_soft --logitsdistil_wo_inter
KD
- [Prefix-finetune] [Prefix-single-teacher] --student_model=kd --distill_ft_soft --distill_ft_hard --distill_temperature=10
PD
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=kd --distill_ft_soft --distill_temperature=1
TinyBERT
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=tinybert
- [Prefix-finetune] [Prefix-single-teacher] --student_model=tinybert
- [Prefix-finetune] [Prefix-single-teacher] --student_model=tinybert --distill_ft_soft --tinybert_wo_inter
MiniLMv2
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=minilmv2 --minilmv2_relation_heads=48 --minilmv2_teacher_layer=12
- [Prefix-finetune]
MiniLM
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=minilm
- [Prefix-finetune]
DistilBERT
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=distilbert --distill_temperature=2 --distilbert_alpha_ce=5 --distilbert_alpha_mlm=2 --distilbert_alpha_cos=1 --distilbert_cos_mask_padding
- [Prefix-finetune]
PKD
- [Prefix-finetune] [Prefix-single-teacher] --student_model=pkd --distill_ft_soft --distill_ft_soft_kl --distill_ft_hard --distill_temperature=10 --pkd_normalized_patience --pkd_alpha=0.5 --pkd_beta=100 --student_truncate_tn=0 --pkd_wo_final --pkd_only_cls
RAIL_KD
- from DistilBERT
- [Prefix-finetune] [Prefix-single-teacher] --student_model=rail_kd --distill_ft_soft --distill_soft_rate=0.3333 --distill_ft_hard --distill_hard_rate=0.3333 --distill_temperature=10 --rail_kd_inter_rate=0.3333 --rail_kd_layer_wise_alpha=1 --rail_kd_u=128 --rail_kd_concatenated --rail_kd_epochs=1 --rail_kd_show_hook_change
MGSKD
- from TinyBERT
- [Prefix-finetune] [Prefix-single-teacher] --student_model=mgskd --mgskd_weight_sample=4 --mgskd_weight_token=1 --mgskd_weight_span=1 --mgskd_sample_level_m=3 --mgskd_triplet_k1=20 --mgskd_triplet_k2=20
- [Prefix-finetune] [Prefix-single-teacher] --student_model=mgskd --distill_ft_soft --distill_ft_soft_kl --distill_temperature=1 --mgskd_wo_inter
DIITO
- [Prefix-pretrain](w/o --checkpoint-activations) [Prefix-single-teacher] --student_model=diito --forward_repeat_num=1 --diito_alignment=full --diito_interchange_prop=0.3 --diito_interchange_way=consecutive --diito_interchange_max_token=-1 --diito_alpha_mlm=0.25 --diito_alpha_ce=0.25 --diito_alpha_causal_ce=0.25 --diito_alpha_cos=0.25 --diito_alpha_causal_cos=0 --distill_pt_soft --distill_pt_hard --distill_temperature=2
- [Prefix-finetune]
SID
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=sid --sid_accumulate_t=0 --sid_lim_e=avg --distill_ft_soft --distill_temperature=1
ALP_KD
- [Prefix-finetune] [Prefix-single-teacher] --student_model=alp_kd --alp_kd_lambda=0.2 --distill_soft_rate=0.7 --distill_hard_rate=0.1 --distill_temperature=20 --student_truncate_tn=0 --distill_ft_soft --distill_ft_hard
CKD
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=ckd --ckd_window_size=21 --ckd_wrdist_w=1 --ckd_ltrdist_w=1 --ckd_wrangle_w=10 --ckd_ltrangle_w=10 --distill_ft_soft --distill_ft_hard --distill_temperature=3 --distill_soft_rate=0.9 --distill_hard_rate=0.1
Theseus
- [Prefix-finetune] [Prefix-single-teacher] --student_model=theseus --distill_ft_hard --student_truncate_tn=0 --theseus_replacing_rate=0.3 --theseus_not_replaced_steps=0.66 --mt_disable_operation=1
Universal_KD
- [Prefix-finetune] [Prefix-single-teacher] --student_model=universal_kd --distill_ft_soft --distill_ft_soft_kl --distill_soft_rate=0.5 --universal_kd_gamma=0.5 --student_truncate_tn=0 --universal_kd_size=0
- [Prefix-finetune]
LRC_BERT
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=lrc_bert --lrc_bert_gard_perturb --ignore_first_backward_gard --forward_repeat_num=1 --lrc_bert_gather_dp --fix_variable_num_choices
- [Prefix-finetune] [Prefix-single-teacher] --student_model=lrc_bert --lrc_bert_alpha=1 --distill_ft_soft --distill_ft_soft_kl --distill_soft_rate=1 --distill_ft_hard --distill_hard_rate=3 --distill_temperature=1.1 --lrc_bert_gard_perturb --ignore_first_backward_gard --forward_repeat_num=1 --lrc_bert_gather_dp --fix_variable_num_choices
Annealing_KD
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=annealing_kd --annealing_kd_max_t=7 --distill_ft_soft --distill_ft_soft_mse
- [Prefix-finetune]
MobileBERT
- [Prefix-pretrain] --inverted_bottleneck_mode --ib_hidden_size=1024 --ib_ffn_num=1 --hidden-size=512 --num-attention-heads=4 --ib_word_emb=128
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=mobilebert --mobilebert_kd_w=0.5 --mobilebert_pkt_small_lr=0.1 --distill_pt_hard --inverted_bottleneck_mode --ib_hidden_size=128 --ib_ffn_num=4 --hidden-size=512 --num-attention-heads=4 --ib_word_emb=128 --teacher_inverted_bottleneck_mode --teacher_ib_hidden_size=1024 --teacher_ib_ffn_num=1 --teacher_hidden_size=512 --teacher_num_attention_heads=4 --teacher_ib_word_emb=128
- [Prefix-finetune] --inverted_bottleneck_mode --ib_hidden_size=128 --ib_ffn_num=4 --hidden-size=512 --num-attention-heads=4 --ib_word_emb=128
Continuation_KD
- [Prefix-pretrain]
- [Prefix-finetune] [Prefix-single-teacher] --student_model=continuation_kd --continuation_kd_max_t=10 --continuation_kd_margin=1 --continuation_kd_psi_sep=0.666 --continuation_kd_psi_denominator=1.333 --distill_ft_soft --distill_ft_soft_mse --distill_ft_hard
TMKD
- [Prefix-pretrain] [Prefix-multi-teacher] --student_model=kd --distill_pt_soft --distill_pt_soft_mse --multi_teacher_model=tmkd --student_truncate_tn=0
- [Prefix-finetune] [Prefix-multi-teacher] --student_model=kd --distill_ft_soft --distill_ft_soft_mse --distill_ft_hard --distill_hard_rate=1/teacher_num --multi_teacher_model=tmkd
MT-BERT
- [Prefix-finetune] [Prefix-multi-teacher] --student_model=pkd --distill_ft_soft --distill_temperature=1 --pkd_alpha=1 --pkd_beta=1 --student_truncate_tn=0 --multi_teacher_model=mt_bert --mt_has_loss --mt_bert_fit_teacher
Uncertainty
- [Prefix-finetune] [Prefix-multi-teacher] --student_model=kd --distill_ft_soft --distill_temperature=1 --distill_ft_soft_kl --distill_soft_rate=0.5 --distill_hard_rate=0.5 --student_truncate_tn=0 --multi_teacher_model=uncertainty --uncertainty_hard
RL-KD
- [Prefix-finetune] [Prefix-multi-teacher] --student_model=kd --distill_ft_soft --distill_temperature=10 --student_truncate_tn=0 --multi_teacher_model=rl_kd --rl_kd_only_avg --rl_kd_alpha=0.5 --rl_kd_semantic_model_dim=768
- [Prefix-finetune] [Prefix-multi-teacher](One more base teacher) --student_model=kd --distill_ft_soft --distill_temperature=10 --multi_teacher_model=rl_kd --rl_kd_reward=1 --rl_kd_semantic_model=0 --mt_has_loss --rl_kd_alpha=0.5 --fix_variable_num_choices
ALBERT
- [Prefix-pretrain] --compress_word_emb=128 --cross_layer_parameter_sharing
- [Prefix-finetune] --compress_word_emb=128 --cross_layer_parameter_sharing
Other
- TAKD simply replaces [Prefix-single-teacher] with the student from the previous training using any of the single-teacher methods.
- DGKD just needs to replace [Prefix-multi-teacher] with all the teachers and students previously trained using any of the multi-teacher methods.
- Support for more robust distillation using randomly disturbed data. For example, add the parameters: --distill_random_data=replace --distill_random_data_n=1 --forward_repeat_num=0 --distill_random_data_method=sample at the end of the command.
Combined use of methods
For example: (pt/ft1: TinyBERT + MiniLMv2 + MiniLM + DistilBERT, ft2: KD + TinyBERT)
- [Prefix-pretrain] [Prefix-single-teacher] --student_model=mixbaseline --distill_temperature=2 --minilmv2_relation_heads=48 --minilmv2_teacher_layer=12 --distilbert_alpha_ce=5 --distilbert_alpha_mlm=2 --distilbert_alpha_cos=1 --distilbert_cos_mask_padding --mixbaseline_inter_bl=TinyBERT,MiniLMv2,MiniLM,DistilBERT --mixbaseline_pre_bl_pt_soft=DistilBERT
- [Prefix-finetune] [Prefix-single-teacher] --student_model=mixbaseline --distill_temperature=2 --minilmv2_relation_heads=48 --minilmv2_teacher_layer=12 --distilbert_alpha_ce=5 --distilbert_alpha_mlm=2 --distilbert_alpha_cos=1 --distilbert_cos_mask_padding --mixbaseline_inter_bl=TinyBERT,MiniLMv2,MiniLM,DistilBERT --mixbaseline_pre_bl_ft_soft=DistilBERT
- [Prefix-finetune] [Prefix-single-teacher] --student_model=mixbaseline --mixbaseline_wo_inter --tinybert_wo_inter --distill_ft_soft --distill_ft_hard --distill_temperature=10 --mixbaseline_tinybert_t=1 --mixbaseline_pre_bl_ft_soft=TinyBERT
Examples
GLMD+al 110M-66M
pre-training stage
deepspeed --master_port=12761 --include=localhost:4,5,6,7 distill/pretrain.py --deepspeed_config=config/config_block_tiny6.json --deepspeed-activation-checkpointing --deepspeed --block-lm --num-layers=6 --hidden-size=768 --num-attention-heads=12 --max-position-embeddings=512 --tokenizer-model-type=bert-base-uncased --tokenizer-type=BertWordPieceTokenizer --fp16 --model-parallel-size=1 --save-interval=5000 --save=../GLM/data/checkpoints/distill/paper --experiment-name=test --bert-prob=1.0 --train-data=bert-base --split=949,50,1 --distributed-backend=nccl --lr-decay-style=cosine --lr-decay-iters=120000 --lr-decay-ratio=0.05 --warmup=.05 --train-iters=150000 --no-lazy-loader --resume-dataloader --teacher_load_pretrained=../GLM/data/checkpoints/pretrain/blocklm-base-blank --teacher_num_layers=12 --teacher_hidden_size=768 --teacher_num_attention_heads=12 --teacher_max_position_embeddings=512 --teacher_fp16 --student_model=logitsdistil --distill_temperature=15 --map_vocab_size=0.5 --distill_logit_mask_map --student_build_map_vocab --student_map_vocab_tn=0 --student_map_vocab_method=decoder --unmap_vocab_output --logitsdistil_mask_pad --compress_word_emb=384
task-specific stage
python -u distill/auto_tune.py --py_file=distill/finetune.py \
--gpus=2,3 \
--model=block_tiny6 \
--model_path=checkpoints/distill/paper/12.768-6.768_64-15w_glmd-dta_vc.5de-albert \
--task_t_load=base \
--tasks=record,copa,wsc,rte,boolq,wic,cb,multirc,wsc_generative \
--student_model=logitsdistil --distill_temperature=15 --map_vocab_size=0.5 --distill_logit_mask_map --unmap_vocab_output --logitsdistil_mask_pad --compress_word_emb=384 --del_checkpoint_activations \
--again_1__distill_ft_soft \
--again_1__distill_temperature=1 \
--again_1__logitsdistil_wo_inter \
--seed=1759 \
--ds_train_micro_batch_size_per_gpu="16;16;32;16;16;16;32;16;32" \
--ds_gradient_accumulation_steps=2 \
--rate_ds_train_micro_batch_size_per_gpu=0.25 \
--rate_ds_gradient_accumulation_steps=1 \
--ds_optimizer__params__lr="2E-05;5E-06;2E-05;2E-05;2E-05;2E-05;2E-05;5E-06;2E-05"
GLMD-vc 10B-2B
pre-training stage
deepspeed --master_port=31761 --include=localhost:0,1,2,3,4,5,6,7 distill/pretrain.py --deepspeed_config=config/config_block_tiny6.json --deepspeed-activation-checkpointing --deepspeed --block-lm --task-mask --num-layers=36 --hidden-size=2048 --num-attention-heads=32 --max-position-embeddings=1024 --tokenizer-type=GPT2BPETokenizer --checkpoint-activations --model-parallel-size=4 --save-interval=5000 --save=../GLM/data/checkpoints/distill/paper --experiment-name=10b --bert-prob=0.5 --gap-sentence-prob=0.3 --avg-block-length=3 --gpt-min-ratio=0.25 --block-mask-prob=0.1 --short-seq-prob=0.02 --train-data=bert-large --split=949,50,1 --distributed-backend=nccl --lr-decay-style=cosine --lr-decay-iters=120000 --lr-decay-ratio=0.1 --warmup=.04 --train-iters=150000 --no-lazy-loader --resume-dataloader --log-interval=50 --teacher_load_pretrained=../GLM/data/checkpoints/pretrain/blocklm-xxlarge_MP4 --teacher_num_layers=48 --teacher_hidden_size=4096 --teacher_num_attention_heads=64 --teacher_max_position_embeddings=1024 --teacher_fp16 --batch-size=4 --gradient-accumulation-steps=8 --args_to_ds_config --student_model=logitsdistil --distill_temperature=15 --logitsdistil_mask_pad
task-specific stage
python -u distill/auto_tune.py --py_file=distill/finetune.py \
--gpus=0,1,2,3,4,5,6,7 \
--model=model_blocklm_10B \
--num-layers=36 \
--hidden-size=2048 \
--num-attention-heads=32 \
--model_path=checkpoints/distill/paper/48.4096-36.2048_64-15w_glmd-vc-dta_MP4 \
--model-parallel-size=2 \
--task_t_load=10b \
--tasks=record \
--student_model=logitsdistil --distill_temperature=15 --logitsdistil_mask_pad \
--again_1__distill_ft_soft \
--again_1__distill_temperature=1 \
--again_1__logitsdistil_wo_inter \
--seed=1234 \
--ds_train_micro_batch_size_per_gpu=1 \
--ds_gradient_accumulation_steps=4 \
--ds_optimizer__params__lr=1E-05 --eval-batch-size=4
TinyBERT 340M-66M
pre-training stage
deepspeed --master_port=18161 --include=localhost:4,5,6,7 distill/pretrain.py --deepspeed_config=config/config_block_tiny6.json --deepspeed-activation-checkpointing --deepspeed --block-lm --num-layers=6 --hidden-size=768 --num-attention-heads=16 --max-position-embeddings=512 --tokenizer-model-type=bert-base-uncased --tokenizer-type=BertWordPieceTokenizer --fp16 --checkpoint-activations --model-parallel-size=1 --save-interval=5000 --save=../GLM/data/checkpoints/distill/paper --experiment-name=test --bert-prob=1.0 --train-data=bert-base --split=949,50,1 --distributed-backend=nccl --lr-decay-style=cosine --lr-decay-iters=120000 --lr-decay-ratio=0.05 --warmup=.05 --train-iters=150000 --no-lazy-loader --resume-dataloader --teacher_load_pretrained=../GLM/data/checkpoints/pretrain/blocklm-large-blank --teacher_num_layers=24 --teacher_hidden_size=1024 --teacher_num_attention_heads=16 --teacher_max_position_embeddings=512 --teacher_fp16 --student_model=tinybert
task-specific stage
python -u distill/auto_tune.py --py_file=distill/finetune.py \
--gpus=0,1,2,3,4,5,6,7 \
--model=block_tiny6 \
--model_path=checkpoints/distill/paper/24.1024-6.768_64-15w_tinybert \
--task_t_load=large \
--tasks=record,copa,wsc,rte,boolq,wic,cb,multirc,wsc_generative \
--num-attention-heads=16 \
--student_model=tinybert \
--again_1__distill_ft_soft \
--again_1__tinybert_wo_inter \
--seed=6899 \
--ds_train_micro_batch_size_per_gpu="32;32;32;32;16;16;16;16;16" \
--ds_gradient_accumulation_steps=1 \
--rate_ds_train_micro_batch_size_per_gpu=0.125 \
--rate_ds_gradient_accumulation_steps=1 \
--ds_optimizer__params__lr="1E-05;5E-06;2E-05;1E-05;1E-05;1E-05;2E-05;2E-05;5E-06"
For a more detailed, task-specific distillation stage, please refer to distill/ft_logs.
Create a new distillation method
Implementing the new distillation method only requires adding a class to the distill/distill_model.py file. For example, a simple implementation for middle layer distillation:
class MethodName(GLMStudent):
def __init__(self, language_model, args, **kwargs):
super().__init__(language_model, args, **kwargs)
def get_teacher_hook(self, **kwargs):
return {The intermediate layer you want to use}
def get_student_hook(self, **kwargs):
return {The intermediate layer you want to use}
def inter_loss(self, s_inter_vars, t_inter_vars, s_hook, t_hook, **kwargs):
return Calculate the loss of intermediate layers
Citation
Part of the code is based on GLM.
Please cite our papers if you find this code useful for your research:
@article{tan2023glmd,
author = {Shicheng Tan and
Weng Lam Tam and
Yuanchun Wang and
Wenwen Gong and
Shu Zhao and
Peng Zhang and
Jie Tang},
title = {Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method},
booktitle = {ACL},
year = {2023},
}
@article{tan2023gkd,
author = {Shicheng Tan and
Weng Lam Tam and
Yuanchun Wang and
Wenwen Gong and
Shu Zhao and
Peng Zhang and
Jie Tang},
title = {GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model},
booktitle = {ACL},
year = {2023},
}