Home

Awesome

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan (Celine) Lin

Accepted at NeurIPS 2022. [ Paper | Video | Slide ]

S<sup>3</sup>-Router: Overview

S<sup>3</sup>-Router: Framework

<p align="center"> <img src="images/overview.png" width="700"> </p>

S<sup>3</sup>-Router's Application 1: A New Finetuning Paradigm

<p align="center"> <img src="images/finetuning.png" width="700"> </p>

S<sup>3</sup>-Router's Application 2: An Efficient Multilingual/Multitask Solution

<p align="center"> <img src="images/multilingual.png" width="550"> </p>

S<sup>3</sup>-Router's Application 3: A SOTA Pruning Scheme

<p align="center"> <img src="images/asr_pruning.png" width="700"> </p>

S<sup>3</sup>-Router's Application 4: Analyze Speech SSL Models

<p align="center"> <img src="images/analysis.png" width="500"> </p>

Code Usage

Our code is built on top of [Fairseq].

Installation

pip install --editable ./
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Finetune wav2vec 2.0 @ LibriSpeech via S<sup>3</sup>-Router

CUDA_VISIBLE_DEVICES=0,1 fairseq-hydra-train task.data=path-to-librispeech  model.w2v_path=path-to-wav2vec_small.pt dataset.train_subset=train-1h dataset.valid_subset=test-clean hydra.run.dir=outputs/wav2vec2-base dataset.max_tokens_valid=800000  dataset.max_tokens=1200000 distributed_training.distributed_init_method=tcp://localhost:15460 model._name=wav2vec_ctc_st model.prune_rate=0.9 checkpoint.no_epoch_checkpoints=false checkpoint.save_interval=1000 optimization.lr=[0.00005] model.fix_attn=true model.trainable_proj=true lr_scheduler.final_lr_scale=0.0001 model.init_score=weight_rank --config-dir examples/wav2vec/config/finetuning --config-name base_1h

Finetune data2vec @ LibriSpeech via S<sup>3</sup>-Router

CUDA_VISIBLE_DEVICES=0,1 fairseq-hydra-train task.data=path-to-librispeech model.w2v_path=path-to-audio_base_ls.pt dataset.train_subset=train-1h dataset.valid_subset=test-clean hydra.run.dir=outputs/data2vec dataset.max_tokens_valid=800000  dataset.max_tokens=1200000 distributed_training.distributed_init_method=tcp://localhost:15460 task.normalize=true model._name=wav2vec_ctc_st model.prune_rate=0.9 checkpoint.no_epoch_checkpoints=false checkpoint.save_interval=1000 optimization.lr=[0.00005] model.fix_attn=true model.trainable_proj=true lr_scheduler.final_lr_scale=0.0001 common.user_dir=examples/data2vec --config-dir examples/wav2vec/config/finetuning --config-name base_1h

Finetune xlsr @ CommonVoice via S<sup>3</sup>-Router

CUDA_VISIBLE_DEVICES=0,1 fairseq-hydra-train task.data=path-to-commonvoice-zh_TW-train model.w2v_path=path-to-xlsr2_300m.pt dataset.train_subset=zh_TW-train dataset.valid_subset=zh_TW-test hydra.run.dir=outputs/xlsr dataset.max_tokens_valid=800000 dataset.max_tokens=1200000 distributed_training.distributed_init_method=tcp://localhost:15460 model._name=wav2vec_ctc_st model.prune_rate=0.94 model.checkpoint_activations=true checkpoint.no_epoch_checkpoints=false checkpoint.save_interval=1000 optimization.lr=[0.00005] model.trainable_proj=true model.fix_attn=true checkpoint.best_checkpoint_metric=uer  model.init_score=weight_rank lr_scheduler.final_lr_scale=0.0001 --config-dir examples/wav2vec/xlsr/config --config-name finetune

Pruning wav2vec 2.0 @ LibriSpeech via S<sup>3</sup>-Router

CUDA_VISIBLE_DEVICES=0,1 fairseq-hydra-train task.data=path-to-librispeech  model.w2v_path=path-to-wav2vec_small.pt dataset.train_subset=train-1h     dataset.valid_subset=test-clean hydra.run.dir=outputs/finetune dataset.max_tokens_valid=800000  dataset.max_tokens=1200000 distributed_training.distributed_init_method=tcp://localhost:15460 model._name=wav2vec_ctc checkpoint.no_epoch_checkpoints=false checkpoint.save_interval=1000 optimization.lr=[0.00005] --config-dir examples/wav2vec/config/finetuning --config-name base_1h
CUDA_VISIBLE_DEVICES=0,1 fairseq-hydra-train task.data=path-to-librispeech  model.w2v_path=path-to-wav2vec_small.pt dataset.train_subset=train-1h     dataset.valid_subset=test-clean hydra.run.dir=outputs/pruning_wav2vec2 dataset.max_tokens_valid=800000  dataset.max_tokens=1200000 distributed_training.distributed_init_method=tcp://localhost:15460 model._name=wav2vec_ctc_st model.prune_rate=0.3 checkpoint.no_epoch_checkpoints=false checkpoint.save_interval=1000 optimization.lr=[0.00005] model.fix_attn=false model.trainable_proj=true lr_scheduler.final_lr_scale=0.0001 model.init_score=weight_magnitude_with_scale checkpoint.finetune_from_model=outputs/finetune/checkpoints/checkpoint_best.pt --config-dir examples/wav2vec/config/finetuning --config-name base_1h

Citation

@article{fu2022losses,
  title={Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing},
  author={Fu, Yonggan and Zhang, Yang and Qian, Kaizhi and Ye, Zhifan and Yu, Zhongzhi and Lai, Cheng-I and Lin, Yingyan},
  journal={arXiv preprint arXiv:2211.01522},
  year={2022}
}