Awesome

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation (EAT <a href="https://github.com/yuangan/EAT_code"><img src="./doc/favicon_eat.png" style="width: 25px;"></a>)

<a href="https://yuangan.github.io/"><strong>Yuan Gan</strong></a> · <a href="https://z-x-yang.github.io/"><strong>Zongxin Yang</strong></a> · <a><strong>Xihang Yue</strong></a> · <a href="https://scholar.google.com/citations?user=zzW8d-wAAAAJ&hl=zh-CN&oi=ao"><strong>Lingyun Sun</strong></a> · <a href="https://scholar.google.com/citations?user=RMSuNFwAAAAJ&hl=en"><strong>Yi Yang</strong></a>

EAT

</div> <div align="justify">

News:

28/05/2024 Released example scripts for the zero-shot editing with CLIP loss. Thank you for your attention~ :tada:
10/03/2024 Released all evaluation codes used in our paper, please refer to here for more details.
26/12/2023 Released the A2KP Training code. Thank you for your attention and patience~ :tada:
05/12/2023 Released the LRW test code.
27/10/2023 Released the Emotional Adaptation Training code. Thank you for your patience~ :tada:
17/10/2023 Released the evaluation code for the MEAD test results. For more information, please refer to evaluation_eat.
21/09/2023 Released the preprocessing code. Now, EAT can generate emotional talking-head videos with <strong>any</strong> portrait and driven video.
07/09/2023 Released the pre-trained weight and inference code.

Environment

If you wish to run only our demo, we recommend trying it out in Colab. Please note that our preprocessing and training code should be executed locally, and requires the following environmental configuration:

conda/mamba env create -f environment.yml

Note: We recommend using mamba to install dependencies, which is faster than conda.

Checkpoints && Demo dependencies

In the EAT_code folder, Use gdown or download and unzip the ckpt, demo and Utils to the specific folder.

gdown --id 1KK15n2fOdfLECWN5wvX54mVyDt18IZCo && unzip -q ckpt.zip -d ckpt
gdown --id 1MeFGC7ig-vgpDLdhh2vpTIiElrhzZmgT && unzip -q demo.zip -d demo
gdown --id 1HGVzckXh-vYGZEUUKMntY1muIbkbnRcd && unzip -q Utils.zip -d Utils

Demo

Execute the code within our <strong>eat</strong> environment using the command:

conda activate eat

Then, run the demo with:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/W015_neu_1_002 --emo hap

Parameters:
- root_wav: Choose from ['obama', 'M003_neu_1_001', 'W015_neu_1_002', 'W009_sad_3_003', 'M030_ang_3_004']. Preprocessed wavs are located in ./demo/video_processed/. The 'obama' wav is approximately 5 minutes long, while the others are much shorter.
- emo: Choose from ['ang', 'con', 'dis', 'fea', 'hap', 'neu', 'sad', 'sur']

Note 1: Place your own images in ./demo/imgs/ and run the above command to generate talking-head videos with aligned new portraits. If you prefer not to align your portrait, simply place your cropped image (256x256) in ./demo/imgs_cropped. Due to the background used in the MEAD training set, results tend to be better with a similar background.

Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.

Note 3: The audio used in our work should be sampled at 16,000 Hz and the corresponding video should have a frame rate of 25 fps.

Test MEAD

To reproduce the results of MEAD as reported in our paper, follow these steps:

First, Download the additional MEAD test data from mead_data and unzip it into the mead_data directory:

gdown --id 1_6OfvP1B5zApXq7AIQm68PZu1kNyMwUY && unzip -q mead_data.zip -d mead_data

Then, Execute the test using the following command:

CUDA_VISIBLE_DEVICES=0 python test_mead.py [--part 0/1/2/3] [--mode 0]

Parameters:
- part: Choose from [0, 1, 2, 3]. These represent the four test parts in the MEAD test data.
- mode: Choose from [0, 1]. Where 0 tests only 100 samples in total, and 1 tests all samples (985 in total).

You can use our evaluation_eat code to evaluate.

Test LRW

To reproduce the results of LRW as reported in our paper, you need to download and extract the LRW test dataset from here. Due to the limitations of the license, we cannot provide any video data. (The name of the test files can be found here for validation.) After downloading LRW, You will need to preprocess them using our preprocessing code. Then, move and rename the output files as follows:

'imgs, latents, deepfeature32, poseimg, video_fps25/.wavs' --> 'lrw/lrw_images, lrw/lrw_latent, lrw/lrw_df32, lrw/poseimg, lrw/lrw_wavs/.wav'

Change the dataset path in test_lrw_posedeep_normalize_neutral.py.

Then, execute the following command:

CUDA_VISIBLE_DEVICES=0 python test_lrw_posedeep_normalize_neutral.py --name deepprompt_eam3d_all_final_313 --part [0/1/2/3] --mode 0

or run them concurrently:

bash test_lrw_posedeep_normalize_neutral.sh

The results will be saved in './result_lrw/'.

Preprocessing

If you want to test with your own driven video that includes audio, place your video (which should have audio) in the preprocess/video. Then execute the preprocessing code:

cd preprocess
python preprocess_video.py

The video will be processed and saved in the demo/video_processed. To test it, run:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/[fill in your video name] --emo [fill in emotion name]

The videos should contain only one person. We will crop the input video according to the estimated landmark of the first frame. Refer to these video for more details.

Note 1: The preprocessing code has been verified to work correctly with TensorFlow version 1.15.0, which can be installed on Python 3.7. Refer to this issue for more information.

Note 2: Extract the bbox for training with preprocess/extract_bbox.py.

A2KP Training

Data&Ckpt Preparation:

Download Voxceleb 2 datasets from the official website or here and preprocess with our code. It needs to be noted that we exclude some videos, which have blurred or small faces, according to the face detected in the first frame.
Modify the dataset path in config/pretrain_a2kp_s1.yaml, config/pretrain_a2kp_img_s2.yaml and frames_dataset_transformer25.py.
Download ckpt 000299_1024-checkpoint.pth.tar to folder 'ckpt/'
Download the voxselect file and untar it to the processed vox_path.

Execution:

Run the following command to start training A2KP transformer with latent and pca loss in 4 GPUs:

python pretrain_a2kp.py --config config/pretrain_a2kp_s1.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/pretrain_new_274.pth.tar
Note: Stop training when the loss converges. We trained for 8 epochs here. Our training log is at: ./output/qvt_2 30_10_22_14.59.29/log.txt. Copy and rename the output ckpt to folder ./ckpt, for example: ckpt/qvt_2_1030_281.pth.tar
Run the following command to start training A2KP transformer with all loss in 4 GPUs:

python pretrain_a2kp_img.py --config config/pretrain_a2kp_img_s2.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/qvt_2_1030_281.pth.tar
Note: Stop training when the loss converges. We trained for 24 epochs here. Our training log is at: ./output/qvt_img_pca_sync_4 01_11_22_15.47.54/log.txt

Emotional Adaptation Training

Data&Ckpt Preparation:

The processed MEAD data used in our paper can be downloaded from Yandex or Baidu. After downloading, concatenate, unzip the files, and update the paths in deepprompt_eam3d_st_tanh_304_3090_all.yaml and frames_dataset_transformer25.py.
We have updated environment.yaml to adapt to the training environment. You can install the required packages using pip or mamba, or reinstall the eat environment.
We have also updated ckpt.zip, which contains the pre-trained checkpoints that can be used directly for the second phase of training.

Execution:

Run the following command to start training in 4 GPUs:

python -u prompt_st_dp_eam3d.py --config ./config/deepprompt_eam3d_st_tanh_304_3090_all.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/qvt_img_pca_sync_4_01_11_304.pth.tar
Note 1: The batch_size in the config should be consistent with the number of GPUs. To compute the sync loss, we train consecutive syncnet_T frames (which is 5 in our paper) in a batch. Each GPU is assigned a batch during training, consuming around 17GB of VRAM.
Note 2: Our checkpoints are saved every half an hour. The results in the paper were obtained using 4 Nvidia 3090 GPUs, training for about 5-6 hours. Please refer to output/deepprompt_eam3d_st_tanh_304_3090_all\ 03_11_22_15.40.38/log.txt for the training logs at that time. The convergence speed of the training loss should be similar to what is shown there.

Evaluation:

The checkpoints and logs are saved at ./output/deepprompt_eam3d_st_tanh_304_3090_all [timestamp].
Change the data root in test_posedeep_deepprompt_eam3d.py and dirname in test_posedeep_deepprompt_eam3d.sh, then run the following command for batch testing:

bash test_posedeep_deepprompt_eam3d.sh
The results from sample testing (100 samples) are stored in ./result. You can use our evaluation_eat code to evaluate.

Zero-shot Editing

Install our CLIP: pip install git+https://github.com/yuangan/CLIP.git. (We modified the model.py in the repository.)

Modify the data path in pretrain_test_posedeep_deepprompt_eam3d_newstyle4.py, then fine-tune Mead dataset with the text "She is talking while crying hard.":

CUDA_VISIBLE_DEVICES=0 python -u prompt_st_dp_eam3d_mapper_full.py --config config/prompt_st_eam3d_tanh_mapper.yaml  --device_ids 0  --checkpoint ./ckpt/deepprompt_eam3d_all_final_313.pth.tar

Test the fine-tuned model. The results would be saved in './result_mapper':

CUDA_VISIBLE_DEVICES=0 python pretrain_test_posedeep_deepprompt_eam3d_newstyle4.py --name prompt_st_eam3d_tanh_mapper\ xxxxxx(replace with your path) --part 3 --mode 1

Similarly, you can fine-tune and test the LRW datasets with the text "He is talking with a fierce expression.":

CUDA_VISIBLE_DEVICES=0 python -u prompt_st_dp_eam3d_mapper_full_lrw.py --config config/prompt_st_eam3d_tanh_mapper.yaml  --device_ids 0  --checkpoint ./ckpt/deepprompt_eam3d_all_final_313.pth.tar
CUDA_VISIBLE_DEVICES=0 python pretrain_test_posedeep_deepprompt_eam3d_newstyle4_lrw.py --name prompt_st_eam3d_tanh_mapper\ xxxxxx(replace with your path) --part 3 --mode 1

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at ganyuan@zju.edu.cn and yangyics@zju.edu.cn

Citation

If you find this code helpful for your research, please cite:

@InProceedings{Gan_2023_ICCV,
    author    = {Gan, Yuan and Yang, Zongxin and Yue, Xihang and Sun, Lingyun and Yang, Yi},
    title     = {Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {22634-22645}
}

Acknowledge

We acknowledge these works for their public code and selfless help: EAMM, OSFV (unofficial), AVCT, PC-AVS, Vid2Vid, AD-NeRF and so on.

</div>