Home

Awesome

MSDWild Dataset

MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD

This dataset is designed for multi-modal speaker diarization and lip-speech synchronization in the wild. Demo

Dataset Statistics

<img src='imgs/metrics.png' width=70% />

Dataset Comparison

<img src='imgs/percentile_chart.png' width=70% />

Compared with other multi-modal datasets, the segment length distribution of our dataset is close to the audio-only in-the-wild diarization dataset, e.g., CALLHOME or DIHARD2.

Labels

rttms (all)

rttms (few train)

rttms (few val)

rttms (many val)

Wavs

md5: 0057f82daaddf2ce993d1bf0679929c4

Video part

The video file name corresponds to the audio file name.

(For Chinese researchers, you can use Baidu Drive or Quark Drive (5v8a)) to speed up downloads. )

Our multimodal speaker diarization baseline includes a subtask - active speaker detection. To train the active speaker detection algorithm (TalkNet mentioned in our paper), we utilize 'cropped faces.' These are randomly generated from videos based on video content and rttm labels, and subsequently, manually rectified. However, if you choose not to use these resources, you can ignore the 'cropped faces.'

There are four categories of cropped-face videos:

Time is denoted in seconds format, and Segment_id corresponds to the cropped face video id within each video folder.

[Updates] Please disregard files with negative filenames (approximately 90 files).

Notes:

Videos with frame-by-frame face position annotation

We have added additional bounding boxes for every facial image across the frames. Our trained annotators has reviewed the facial annotations on each frame to guarantee accuracy — no faces have been ignored or incorrectly tagged. Moreover, they have realigned any improperly positioned face bounding boxes. The refined annotations have been systematically archived in a correspondingly named directory, with the data structured in CSV files as outlined below. One Sample

CSV line: 3363,face,1,398,129,479,244,0

Description: frame id, face (fixed), face_id, x1, y1, x2, y2, 0 (fixed)

Download Link : Google Drive

(For Chinese researchers, you can use Baidu Drive or Quark Drive (5v8a)) to speed up downloads. )

How to Preview Annotations

<img src='imgs/boundingbox1.jpg' width=70% />

Clik DarkLabel.exe and select one video file to preview.

<img src='imgs/boundingbox2.jpg' width=70% />

Move the slider to preview the positions and ID information of faces on different frames, not altering any other default settings.

[Update] You can also use this Link to visualize the relationship between the audio label and the visual label.

Notes:

Baseline Code

You can easily reproduce the result by the following guide.

No other post-processing methods are used.

Baseline Result

<img src='imgs/baseline_results.png' width=70% />

Analysis Result

You can refer to URL to visualize the dataset based on your algorithm result.

<img src='imgs/via_example.png' width=70% />

Acknowledgments

Thanks for You Zhang for helping to point out some annotation issues and improve the quality of the dataset.

Reference

@inproceedings{liu22t_interspeech,
  author={Tao Liu and Shuai Fan and Xu Xiang and Hongbo Song and Shaoxiong Lin and Jiaqi Sun and Tianyuan Han and Siyuan Chen and Binwei Yao and Sen Liu and Yifei Wu and Yanmin Qian and Kai Yu},
  title={{MSDWild: Multi-modal Speaker Diarization Dataset in the Wild}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1476--1480},
  doi={10.21437/Interspeech.2022-10466}
}