Home

Awesome

diar-az

Diarization A to Z - Kaldi to Gecko to Kaldi/corpus and back

There are two goals when automating the diarization process:

  1. Adding to an existing diarization corpus
  2. Using the corpus to train diarization models through Kaldi

Csv file creation was added to Gecko to allow for anonymous and named speaker names. You need to use this version of gecko or a fork of it to create the diarization corpus type specified here.

It is assumed that the corpus data will exist within the data directory specified at the top of the create_corpus bash script.

First data formatting tool supporting RUV-DI in Awesome Diarization. See PR.

Background

The corpus has the following format:

                       README.txt
                       reco2spk_num2spk_label.csv
                       rttm/
                       json/
                       segments/
                       wav/

The gecko archive should have the following format:

                       corrected_rttm/
                       json/
                       srt/
                       csv/

Installation

Before using the scripts in diar-az you must install the Levenshtein package pip install python-Levenshtein or if you're using conda environments: conda install python-Levenshtein

Running

In order to run all the steps do the following: bash create_corpus.sh <gecko archive> <audio directory>

bash create_corpus.sh gecko_files.zip /data/ruv_unprocessed/audio/

If you have multiple versions of srt, rttm, json, or rttm file and you know which one you want to exclude, you can do it with move_dups.sh

bash move_dups.sh filename-without-ext good-or-bad-dir

bash move_dups.sh 4882718R8_* good

to see the filenames

find data/ -maxdepth 3 -iname *4882718R8*

Notes

Everything but the people name validation should be done by calling just one script. This script can call other scripts but the user should only have to call one. So possibly two scripts.

You do not need to concern yourself with the wav folder for this project. Assume you'll be working on directory above the corpus.

I’ve created bash and python files using gawk, sed, sort -u, sox I believe. Create the appropriate folders.

Do not commit any files or information that is specific to this corpus, e.g. names, the corpus README.

Tasks

Possible tasks if the above are done

If have kaldi setup the run local/make_ruvdi.sh, fix_data_dir & utils/validate_data_dir.sh

TODO

License

This project is licensed under Apache 2.0.

Acknowledgements

This project was funded by the the Icelandic Directorate of Labour's student summer job program in 2020.