Awesome
diar-az
Diarization A to Z - Kaldi to Gecko to Kaldi/corpus and back
There are two goals when automating the diarization process:
- Adding to an existing diarization corpus
- Using the corpus to train diarization models through Kaldi
Csv file creation was added to Gecko to allow for anonymous and named speaker names. You need to use this version of gecko or a fork of it to create the diarization corpus type specified here.
It is assumed that the corpus data will exist within the data directory specified at the top of the create_corpus bash script.
First data formatting tool supporting RUV-DI in Awesome Diarization. See PR.
Background
The corpus has the following format:
README.txt
reco2spk_num2spk_label.csv
rttm/
json/
segments/
wav/
The gecko archive should have the following format:
corrected_rttm/
json/
srt/
csv/
Installation
Before using the scripts in diar-az you must install the Levenshtein package
pip install python-Levenshtein
or if you're using conda environments:
conda install python-Levenshtein
Running
In order to run all the steps do the following:
bash create_corpus.sh <gecko archive> <audio directory>
bash create_corpus.sh gecko_files.zip /data/ruv_unprocessed/audio/
If you have multiple versions of srt, rttm, json, or rttm file and you know which one you want to exclude, you can do it with move_dups.sh
bash move_dups.sh filename-without-ext good-or-bad-dir
bash move_dups.sh 4882718R8_* good
to see the filenames
find data/ -maxdepth 3 -iname *4882718R8*
- Use
sort -k3,3 -t, filename.csv
to look for longer name mistakes - Use
sort -k3,3 -u -t, filename.csv
if you only want one occurance of each name
Notes
Everything but the people name validation should be done by calling just one script. This script can call other scripts but the user should only have to call one. So possibly two scripts.
You do not need to concern yourself with the wav folder for this project. Assume you'll be working on directory above the corpus.
I’ve created bash and python files using gawk, sed, sort -u, sox I believe. Create the appropriate folders.
Do not commit any files or information that is specific to this corpus, e.g. names, the corpus README.
Tasks
- 1. Add audio filenames to rttm files, as the second field. See the template file in kaldi-speaker-diarization/master/templates.md for an example. DO NOT put angle brackets arounnd the recording-id/audio filenames.
- 2. Remove [] stuff (foreign, noise, music) from rttm files and srt segments. For rttm file that means remove the line or remove the [] portion of a line with speaker-ids as [foreign]+15. For srt segments that means only remove the segments which don't have any speech.
- 3. Rename the rttm/json/srt files themselves to just the audio filename.
- 4. Also include the command to call create_segments_and_text.py. It might be difficult due to where the resulting files are created. If so, then will need to generalize the python file. Do this and create a pull-request.
- 5. Generate text file with the updated corpus numbers in the corpus readme. If know how to, then also autoreplaces the values in the readme.
- 6. Create a csv file like in the corpus
<audio-filename>,<spk-num>,<speaker label>
. This involves pairing up all the written names across files then creating new speaker labels for speakers. This needs to be done with unknowns too but they also need to be renamed to the next numbered unknown available. - 7. Also create
<audio-filename>,<spk-num>,<speaker name>,<speaker label>
- 8. Allow there to be 1-3 spelling mistakes in the names which will then be manually validated and corrected.
Possible tasks if the above are done
If have kaldi setup the run local/make_ruvdi.sh, fix_data_dir & utils/validate_data_dir.sh
- 1. Split each week’s files into 70/15/15% with the 70% portion holding the extra audio files.
- 2. Run the kaldi recipe and split_rttm (I'll need to supply this file). Add them to the callhome_rttm directory.
- 3. Run the kaldi recipe (kaldi-speaker-diarization/v4) to evaluate the new DER% with the increased data.
- 4. Create a script which creates new segments based on 2-6 speaker turns which looks like the current corpus but with those new audio files.
TODO
- in rttm files identify spk_ids like 001 Jane Doe, 1.[noise], + and crosstalk
- preserve existing speaker labels
- check for rttm files with non specified channel numbers
License
This project is licensed under Apache 2.0.
Acknowledgements
This project was funded by the the Icelandic Directorate of Labour's student summer job program in 2020.