Home

Awesome

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation (ECCV 2022 Oral)

Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, Bolei Zhou.

Project | Paper | Demo

Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders realistic video portraits.

<img src='./misc/SSP-NeRF.png' width=800>

Environment and Pre-requisites

This project is developed and tested on Ubuntu 18.04, Python 3.7, PyTorch 1.9.1 and CUDA version: 11.1. Since the repository is developed based on AD-NeRF of Guo et al., the environment installation and some pre-requisites like PyTorch3D, 3DMM basis params generally follow theirs.

Dataset Preparation and Pre-processing

Dataset Download and Preparation

Due to the YouTube video copyright issues, we could not provide the raw videos/frames. The YouTube video ID of each corresponding videos are provided in video_id.txt. You could download them by efficient tools like youtube-dl. Afterwards, you could select the clips of interest (the clips with clear talking face of a certain speaker), and then crop the images into square shape for later pre-processing (i.e., keep the head, torso and background parts remained).

Data Pre-processing

The data pre-processing codes are included in the data_util folder, where the main pre-processing calling scripts are in the data_util/process_data.py. To implement it step by step, you could run the below bash command:

bash process_data.sh $person_name$

The additional features are that we parse the more detailed facial regions as well as save the 3DMM facial keypoints.

Training

Since the work is done during internship at SenseTime Research, part of the codes still need to undergo a series of review and permission process.

Citation

If you find our work useful, please kindly cite as:

@article{liu2022semantic,
  title = {Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation},
  author = {Liu, Xian and Xu, Yinghao and Wu, Qianyi and Zhou, Hang and Wu, Wayne and Zhou, Bolei},
  journal={arXiv preprint arXiv:2201.07786},
  year = {2022}
}

Related Projects

If you are interested in NeRF / neural implicit representions + semantic map, we would also like to recommend you to check out other related works:

Acknowledgement