Home

Awesome

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

We provide PyTorch implementations for our arxiv paper "Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose"(http://arxiv.org/abs/2002.10137), and our IEEE TMM paper "Predicting Personalized Head Movement From Short Video and Speech Signal" (https://ieeexplore.ieee.org/document/9894719).

Note that this code is protected under patent. It is for research purposes only at your university (research institution) only. If you are interested in business purposes/for-profit use, please contact Prof.Liu (the corresponding author, email: liuyongjin@tsinghua.edu.cn).

We provide a demo video here (please search for "Talking Face" in this page and click the "demo video" button).

Colab

Our Proposed Framework

<img src = 'pipeline.jpg'>

Prerequisites

Getting Started

Installation

pip install -r requirements.txt

Download pre-trained models

Download face model for 3d face reconstruction

Fine-tune on a target peron's short video

Note: You can make a video to 25 fps by

ffmpeg -i xxx.mp4 -r 25 xxx1.mp4
cd Data/
python extract_frame1.py [person_id].mp4
cd Deep3DFaceReconstruction/
CUDA_VISIBLE_DEVICES=0 python demo_19news.py ../Data/[person_id]

This process takes about 2 minutes on a Titan Xp.

cd Audio/code/
python train_19news_1.py [person_id] [gpu_id]

The saved models are in Audio/model/atcnet_pose0_con3/[person_id]. This process takes about 5 minutes on a Titan Xp.

cd render-to-video/
python train_19news_1.py [person_id] [gpu_id]

The saved models are in render-to-video/checkpoints/memory_seq_p2p/[person_id]. This process takes about 40 minutes on a Titan Xp.

Test on a target peron

Place the audio file (.wav or .mp3) for test under Audio/audio/. Run [with generated poses]

cd Audio/code/
python test_personalized.py [audio] [person_id] [gpu_id]

or [with poses from short video]

cd Audio/code/
python test_personalized2.py [audio] [person_id] [gpu_id]

This program will print 'saved to xxx.mov' if the videos are successfully generated. It will output 2 movs, one is a video with face only (_full9.mov), the other is a video with background (_transbigbg.mov).

Colab

A colab demo is here.

Citation

If you use this code for your research, please cite our papers:

@article{yi2020audio,
  title     = {Audio-driven talking face video generation with learning-based personalized head pose},
  author    = {Yi, Ran and Ye, Zipeng and Zhang, Juyong and Bao, Hujun and Liu, Yong-Jin},
  journal   = {arXiv preprint arXiv:2002.10137},
  year      = {2020}
}
@article{YiYSZZWBL22,
  title     = {Predicting Personalized Head Movement From Short Video and Speech Signal},
  author    = {Yi, Ran and Ye, Zipeng and Sun, Zhiyao and Zhang, Juyong and Zhang, Guoxin and Wan, Pengfei and Bao, Hujun and Liu, Yong-Jin},
  journal   = {IEEE Transactions on Multimedia}, 
  volume    = {},
  number    = {},
  pages     = {1-13},
  doi       = {10.1109/TMM.2022.3207606}
}

Acknowledgments

The face reconstruction code is from Deep3DFaceReconstruction, the arcface code is from insightface, the gan code is developed based on pytorch-CycleGAN-and-pix2pix.