Awesome

Seeking the Shape of Sound

An implement of the CVPR 2021 paper: Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association

Environments

Ubuntu 16.04
CUDA 10.2
Python 3.7.3
Pytorch 1.4.0

See requirement.txt.

Data preparation

Download VoxCeleb, VGGFace and unzip them to ./data.

Limited by file size, only part of the query lists is included in ./data. Other lists used in the article can be downloaded from Google drive or Baidu drive (passwd: rfri).

Training

Download pretrained models for backbones into ./pretrained_models.

Google drive:

SE-ResNet-50

Thin-ResNet-34

Baidu drive:

SE-ResNet-50 (passwd: jy55)

Thin-ResNet-34 (passwd: tc6i)

Train the model and update identity weights:

python3 train.py config/train_reweight.yaml

Extract identity weights from saved model file:

python3 extract_id_weight.py config/train_reweight.yaml

The 4. Retrain the final model:

python3 train.py config/train_main.yaml

The model and log are saved in save/vox1_train/Voice2Face/main by default.

Evaluation

Download the pretrained model from Google drive or Baidu drive (passwd: 4vyf).
Modify configures in config/train_main.yaml: change resume\_eval to the path where the model is saved.
Run

python3 eval.py config/train_main.yaml

Expected results (%):

	1:2 Matching (U)	1:2 Matching (G)	Verification (U)	Verification (G)	Retrieval
Voice-to-Face	87.2	77.7	87.2	77.5	5.5
Face-to-Voice	86.5	75.3	87.0	76.1	5.8

The results might slightly differ from the above due to random factors in the training process.

References

If this code is helpful to you, please consider citing our paper:

@inproceedings{wen2021seeking,
  title={Seeking the shape of sound: An adaptive framework for learning voice-face association},
  author={Wen, Peisong and Xu, Qianqian and Jiang, Yangbangyan and Yang, Zhiyong and He, Yuan and Huang, Qingming},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}