Awesome
NISQA: Speech Quality and Naturalness Assessment
+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.
Speech Quality Prediction:
NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness to give more insight into the cause of the quality degradation.
TTS Naturalness Prediction:
The NISQA-TTS model weights can be used to estimate the Naturalness of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.).
Training/Finetuning:
NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) .
Speech Quality Datasets:
We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels.
Table of Contents
More information about the deep learning model structure, the used training datasets, and the training options, see the NISQA paper and the Wiki.
Installation
To install requirements install Anaconda and then use:
conda env create -f env.yml
This will create a new environment with the name "nisqa". Activate this environment to go on:
conda activate nisqa
Using NISQA
We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model.
There are three different model weights available, the appropriate weights should be loaded depending on the domain:
Model | Prediction Output | Domain | Filename |
---|---|---|---|
NISQA (v2.0) | Overall Quality, Noisiness, Coloration, Discontinuity, Loudness | Transmitted Speech | nisqa.tar |
NISQA (v2.0) mos only | Overall Quality only (for finetuning/transfer learning) | Transmitted Speech | nisqa_mos_only.tar |
NISQA-TTS (v1.0) | Naturalness | Synthesized Speech | nisqa_tts.tar |
Prediction
There are three modes available to predict the quality of speech via command line arguments:
- Predict a single file
- Predict all files in a folder
- Predict all files in a CSV table
Important: Select "nisqa.tar" to predict the quality of a transmitted speech sample and "nisqa_tts.tar" to predict the Naturalness of a synthesized speech sample.
To predict the quality of a single .wav file use:
python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results
To predict the quality of all .wav files in a folder use:
python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
To predict the quality of all .wav files listed in a csv table use:
python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel.
Training
Finetuning / Transfer Learning
To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows:
python run_train.py --yaml config/finetune_nisqa.yaml
-
If the NISQA Corpus is used, only two arguments need to updated in the YAML file and you are ready to go: The
data_dir
to the extracted NISQA_Corpus folder and theoutput_dir
, where the results should be stored. -
If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed.
Your CSV file needs to contain at least three columns with the following names
db
with the individual dataset names for each filefilepath_deg
filepath to the degraded WAV file, either absolute paths or relative to thedata_dir
(CSV column name can be changed in YAML)mos
with the target labels (CSV column name can be changed in YAML)
The
finetune_nisqa.yaml
needs to be updated as follows:data_dir
path to the main folder, which contains the CSV file and the datasetsoutput_dir
path to output folder with saved model weights and resultspretrained_model
filename of the pretrained model, eithernisqa_mos_only.tar
for natural speech ornisqa_tts.tar
for synthesized speechcsv_file
name of the CSV with filepaths and target labelscsv_deg
CSV column name that contains filepaths (e.g.filepath_deg
)csv_mos_train
andcsv_mos_val
CSV column names of the target value (e.g.mos
)csv_db_train
andcsv_db_val
names of the datasets you want to use for training and validation. Datasets names must be in thedb
column.
See the comments in the YAML configuration file and the Wiki (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration.
Training a new model
NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows:
- Framewise model: CNN or Feedforward network
- Time-Dependency model: Self-Attention or LSTM
- Pooling: Average, Max, Attention or Last-Step-Pooling
The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the NISQA Corpus as follows:
python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml
If the NISQA Corpus is used, only the data_dir
needs to be updated to the unzipped NISQA_Corpus folder and the output_dir
in the YAML file. Otherwise, see the previous finetuning section for updating the YAML file if a custom dataset is applied.
It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the train_nisqa_cnn_lstm_avg.yaml
example configuration file is provided.
To train a double-ended model for full-reference speech quality prediction, the train_nisqa_double_ended.yaml
configuration file can be used as an example. See the comments in the YAML files and the Wiki (not yet added) for more details on different possible model structures and advanced training options.
Evaluation
Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation):
python run_evaluate.py
Before running, the options and paths inside the Python script run_evaluate.py
should be updated. If the NISQA Corpus is used, only the data_dir
and output_dir
paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus.
NISQA Corpus
The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions.
For the download link and more details on the datasets and used source speech samples see the NISQA Corpus Wiki.
Paper and License
- If you use the NISQA model or the NISQA Corpus for your research, please cite following paper:
G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech 2021, 2021. - Please cite following paper if you use the NISQA-TTS model for Naturalness prediction of synthesized speech:
G. Mittag and S. Moller, “Deep Learning Based Assessment of Synthetic Speech Naturalness,” in Proc. Interspeech 2020, 2020. - Please cite following paper if you use the double-ended NISQA model:
G. Mittag and S. Möller. Full-reference speech quality estimation with attentional Siamese neural networks. In Proc. ICASSP 2020, 2020. - The older NISQA (v0.42) model version is described in following paper:
G. Mittag and S. Möller, “Non-intrusive speech quality assessment for super-wideband speech communication networks,” in Proc. ICASSP 2019, 2019
The NISQA code is licensed under MIT License.
The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License
The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the NISQA Corpus Wiki.
Copyright © 2021 Gabriel Mittag
www.qu.tu-berlin.de