Home

Awesome

<div align="center">

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation (EAT <a href="https://github.com/yuangan/EAT_code"><img src="./doc/favicon_eat.png" style="width: 25px;"></a>)

<a href="https://yuangan.github.io/"><strong>Yuan Gan</strong></a> · <a href="https://z-x-yang.github.io/"><strong>Zongxin Yang</strong></a> · <a><strong>Xihang Yue</strong></a> · <a href="https://scholar.google.com/citations?user=zzW8d-wAAAAJ&hl=zh-CN&oi=ao"><strong>Lingyun Sun</strong></a> · <a href="https://scholar.google.com/citations?user=RMSuNFwAAAAJ&hl=en"><strong>Yi Yang</strong></a>

arXiv Project Page <a href="https://colab.research.google.com/drive/133hwDHzsfRYl-nQCUQxJGjcXa5Fae22Z#scrollTo=GWqHlw6kKrbo"><img src="https://colab.research.google.com/assets/colab-badge.svg" height="20" alt="google colab logo"></a> License GitHub Stars

EAT

</div> <div align="justify">

News:

Environment

If you wish to run only our demo, we recommend trying it out in Colab. Please note that our preprocessing and training code should be executed locally, and requires the following environmental configuration:

conda/mamba env create -f environment.yml

Note: We recommend using mamba to install dependencies, which is faster than conda.

Checkpoints && Demo dependencies

In the EAT_code folder, Use gdown or download and unzip the ckpt, demo and Utils to the specific folder.

gdown --id 1KK15n2fOdfLECWN5wvX54mVyDt18IZCo && unzip -q ckpt.zip -d ckpt
gdown --id 1MeFGC7ig-vgpDLdhh2vpTIiElrhzZmgT && unzip -q demo.zip -d demo
gdown --id 1HGVzckXh-vYGZEUUKMntY1muIbkbnRcd && unzip -q Utils.zip -d Utils

Demo

Execute the code within our <strong>eat</strong> environment using the command:

conda activate eat

Then, run the demo with:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/W015_neu_1_002 --emo hap

Note 1: Place your own images in ./demo/imgs/ and run the above command to generate talking-head videos with aligned new portraits. If you prefer not to align your portrait, simply place your cropped image (256x256) in ./demo/imgs_cropped. Due to the background used in the MEAD training set, results tend to be better with a similar background.

Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.

Note 3: The audio used in our work should be sampled at 16,000 Hz and the corresponding video should have a frame rate of 25 fps.

Test MEAD

To reproduce the results of MEAD as reported in our paper, follow these steps:

First, Download the additional MEAD test data from mead_data and unzip it into the mead_data directory:

gdown --id 1_6OfvP1B5zApXq7AIQm68PZu1kNyMwUY && unzip -q mead_data.zip -d mead_data

Then, Execute the test using the following command:

CUDA_VISIBLE_DEVICES=0 python test_mead.py [--part 0/1/2/3] [--mode 0]

You can use our evaluation_eat code to evaluate.

Test LRW

To reproduce the results of LRW as reported in our paper, you need to download and extract the LRW test dataset from here. Due to the limitations of the license, we cannot provide any video data. (The name of the test files can be found here for validation.) After downloading LRW, You will need to preprocess them using our preprocessing code. Then, move and rename the output files as follows:

'imgs, latents, deepfeature32, poseimg, video_fps25/.wavs' --> 'lrw/lrw_images, lrw/lrw_latent, lrw/lrw_df32, lrw/poseimg, lrw/lrw_wavs/.wav'

Change the dataset path in test_lrw_posedeep_normalize_neutral.py.

Then, execute the following command:

CUDA_VISIBLE_DEVICES=0 python test_lrw_posedeep_normalize_neutral.py --name deepprompt_eam3d_all_final_313 --part [0/1/2/3] --mode 0

or run them concurrently:

bash test_lrw_posedeep_normalize_neutral.sh

The results will be saved in './result_lrw/'.

Preprocessing

If you want to test with your own driven video that includes audio, place your video (which should have audio) in the preprocess/video. Then execute the preprocessing code:

cd preprocess
python preprocess_video.py

The video will be processed and saved in the demo/video_processed. To test it, run:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/[fill in your video name] --emo [fill in emotion name]

The videos should contain only one person. We will crop the input video according to the estimated landmark of the first frame. Refer to these video for more details.

Note 1: The preprocessing code has been verified to work correctly with TensorFlow version 1.15.0, which can be installed on Python 3.7. Refer to this issue for more information.

Note 2: Extract the bbox for training with preprocess/extract_bbox.py.

A2KP Training

Data&Ckpt Preparation:

Execution:

Emotional Adaptation Training

Data&Ckpt Preparation:

Execution:

Evaluation:

Zero-shot Editing

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at ganyuan@zju.edu.cn and yangyics@zju.edu.cn

Citation

If you find this code helpful for your research, please cite:

@InProceedings{Gan_2023_ICCV,
    author    = {Gan, Yuan and Yang, Zongxin and Yue, Xihang and Sun, Lingyun and Yang, Yi},
    title     = {Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {22634-22645}
}

Acknowledge

We acknowledge these works for their public code and selfless help: EAMM, OSFV (unofficial), AVCT, PC-AVS, Vid2Vid, AD-NeRF and so on.

</div>