Home

Awesome

<!-- # EDTalk -->

<div align="center">๐Ÿš€ EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis</div>

<p align="center"> <a href="https://scholar.google.com.hk/citations?user=9KjKwDwAAAAJ&hl=en">Shuai Tan</a><sup>1</sup>, <a href="https://scholar.google.com.hk/citations?hl=zh-CN&user=uZeBvd8AAAAJ">Bin Ji</a><sup>1</sup>, <a href="">Mengxiao Bi</a><sup>2</sup>, <a href="">Ye Pan</a><sup>1</sup>, <br><br> <sup>1</sup>Shanghai Jiao Tong University<br> <sup>2</sup>NetEase Fuxi AI Lab<br> <br> <i><strong><a href='https://eccv2024.ecva.net/' target='_blank'>ECCV 2024 Oral</a></strong></i> </p> <div align="center"> <a href="https://tanshuai0219.github.io/EDTalk/"><img src="https://img.shields.io/badge/project-EDTalk-red"></a> &ensp; <a href="https://arxiv.org/abs/2404.01647"><img src="https://img.shields.io/badge/Arxiv-EDTalk-blue"></a> &ensp; <a href="https://github.com/tanshuai0219/EDTalk"><img src="https://img.shields.io/github/stars/tanshuai0219/EDTalk?style=social"></a> &ensp; <!-- [![GitHub Stars](https://img.shields.io/github/stars/yuangan/EAT_code?style=social)](https://github.com/yuangan/EAT_code) --> <!-- <a href="https://arxiv.org/abs/2404.01647"><img src="https://img.shields.io/badge/OpenXlab-EDTalk-grenn"></a> &ensp; --> </div> <div align="center"> <img src="assets/image/teaser.svg" width="900" ></img> <br> </div> <br>

๐ŸŽ Abstract

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they <strong>a)</strong> operate independently without mutual interference and <strong>b)</strong> can be preserved to share with different modal inputsโ€”both aspects often neglected in existing methods. To address this gap, this paper proposes a novel <strong>E</strong>fficient <strong>D</strong>isentanglement framework for <strong>Talk</strong>ing head generation (<strong>EDTalk</strong>). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on both video and audio inputs. Specifically, we employ three <strong>lightweight</strong> modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an <strong>efficient</strong> training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk.

๐Ÿ’ป Overview

<div align="center"> <img src="assets/image/EDTalk.png" width="800" ></img> <br> </div> <br> <!-- Illustration of our proposed EDTalk. (a) EDTalk framework. Given an identity source $ I^i $ and various driving images $I^*$ ($ * \in \{m,p,e\} $) for controlling corresponding facial components, EDTalk animates the identity image $ I^i $ to mimic the mouth shape, head pose, and expression of $ I^m $, $ I^p $, and $ I^e $ with the assistance of three Component-aware Latent Navigation modules: MLN, PLN, and ELN. (b) Efficient Disentanglement. The disentanglement process consists of two parts: Mouth-Pose decouple and Expression Decouple. For the former, we introduce the cross-reconstruction training strategy aimed at separating mouth shape and head pose. For the latter, we achieve expression disentanglement using self-reconstruction complementary learning. -->

๐Ÿ”ฅ Update

๐Ÿ“… TODO

๐ŸŽฎ Installation

We train and test based on Python 3.8 and Pytorch. To install the dependencies run:

git clone https://github.com/tanshuai0219/EDTalk.git
cd EDTalk

Install dependency

conda create -n EDTalk python=3.8
conda activate EDTalk
pip install -r requirements.txt
pip install -r requirements_windows.txt

Thanks to nitinmukesh for providing a Windows 11 installation tutorial and welcome to follow his channel!

python webui_emotions.py
<div align="center"> <img src="assets/image/gradio.png" width="800" ></img> <br> </div>

๐ŸŽฌ Quick Start

Download the checkpoints/huggingface link and put them into ./ckpts.

[ไธญๆ–‡็”จๆˆท] ๅฏไปฅ้€š่ฟ‡่ฟ™ไธช้“พๆŽฅไธ‹่ฝฝๆƒ้‡ใ€‚

EDTalk-A:lip+pose+exp: Run the demo in audio-driven setting (EDTalk-A):

For user-friendliness, we extracted the weights of eight common sentiments in the expression base. one can directly specify the sentiment to generate emotional talking face videos (recommended)

python demo_EDTalk_A_using_predefined_exp_weights.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_type type/of/expression --save_path path/to/save

# example:
python demo_EDTalk_A_using_predefined_exp_weights.py --source_path res/results_by_facesr/demo_EDTalk_A.png --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_type angry --save_path res/demo_EDTalk_A_using_weights.mp4

<video controls loop src="https://github.com/user-attachments/assets/09ff9885-073b-4750-bec5-f1574126d6eb" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/f5c7ff63-66dd-45cb-9e2d-7808fbb0fbaf" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/3dbad54f-bd23-4e28-83c7-d941e58d506a" muted="false"></video>
<video controls loop src="https://github.com/user-attachments/assets/9a1423f6-3b5d-4cfc-8658-7b9d2fd348d2" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/c29148f6-6083-4ef7-8b76-3340ad32a832" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/0c8fed0f-3267-4507-9c43-db7464d65abf" muted="false"></video>

Or you can input an expression reference (image/video) to indicate expression.

python demo_EDTalk_A.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save

# example:
python demo_EDTalk_A.py --source_path res/results_by_facesr/demo_EDTalk_A.png --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_driving_path test_data/expression_source.mp4 --save_path res/demo_EDTalk_A.mp4

The result will be stored in save_path.

Source_path and videos used must be first cropped using scripts crop_image2.py (download shape_predictor_68_face_landmarks.dat and put it in ./data_preprocess dir) and crop_video.py. Make sure the every video' frame rate must be 25 fps

You can also use crop_image.py to crop the image, but increase_ratio has to be carefully set and tried several times to get the optimal result.

<!-- For images where faces only make up a small portion of the image, we recommend using the [crop_image2.py](data_preprocess/crop_image2.py) to crop image. -->

EDTalk-A:lip+pose without exp: If you don't want to change the expression of the identity source, please download the EDTalk_lip_pose.pt and put it into ./ckpts.

If you only want to change the lip motion of the identity source, run

 python demo_lip_pose.py --fix_pose --source_path path/to/image --audio_driving_path path/to/audio --save_path path/to/save
 # example:
 python demo_lip_pose.py --fix_pose --source_path test_data/identity_source.jpg --audio_driving_path test_data/mouth_source.wav --save_path res/demo_lip_pose_fix_pose.mp4

Or you can additionally control the head poses on top of the above via pose_driving_path

 python demo_lip_pose.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --save_path path/to/save

 # example:
 python demo_lip_pose.py --source_path test_data/identity_source.jpg --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --save_path res/demo_lip_pose_fix_pose.mp4

Source ImgEDTalkEDTalk + liveprotrait
<img src="https://github.com/user-attachments/assets/1620d456-7bbf-436b-8bad-fdcd247e9f26" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/36ae9b6d-fc96-476a-8e63-8fe318b32782" muted="false"></video>
<img src="https://github.com/user-attachments/assets/22fd0a6a-dc00-4719-9bc8-9778fd5b0e79" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/70c27d4b-dd06-4ae1-81ad-7e4795fce541" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/5cfb1933-ec7c-48a6-8343-507f5fd4a090" muted="false"></video>

And control the lip motion via a driven video.

 python demo_lip_pose_V.py --source_path path/to/image --audio_driving_path path/to/audio --lip_driving_path path/to/mouth --pose_driving_path path/to/pose --save_path path/to/save

# example:
 python demo_lip_pose_V.py --source_path res/results_by_facesr/demo_lip_pose5.png --audio_driving_path test_data/mouth_source.wav --lip_driving_path test_data/mouth_source.mp4 --pose_driving_path test_data/pose_source1.mp4 --save_path demo_lip_pose_V.mp4

Source Imgdemo_lip_pose_V Results+ FaceSR
<img src="test_data/identity_source.jpg" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/912097cf-ce92-42ca-960b-c4e0906cb0b0" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/c4e1a81c-76c1-462a-b671-9c82e37e14ad" muted="false"></video>
<img src="test_data/leijun.png" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/4e630594-1dd2-47fb-b367-6be7a700c769" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/f1a0b477-a120-47a5-b925-00af4ff09781" muted="false"></video>

Change the lip motion of a source video, run:

 python demo_change_a_video_lip.py --source_path path/to/video --audio_driving_path path/to/audio --save_path path/to/save

 # example
 python demo_change_a_video_lip.py --source_path test_data/pose_source1.mp4 --audio_driving_path test_data/mouth_source.wav --save_path res/demo_change_a_video_lip.mp4

Source Imgresults #1results #2
<video controls loop src="https://github.com/user-attachments/assets/f940a507-d28c-4cc9-abda-af82c6bbf596" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/d199732f-66ad-4182-9df1-0e4416ec8a51" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/328d2b9d-8e98-4814-9d6f-195dddfd80f7" muted="false"></video>

Run the demo in video-driven setting (EDTalk-V):

python demo_EDTalk_V.py --source_path path/to/image --lip_driving_path path/to/lip --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save

# example:
python demo_EDTalk_V.py --source_path test_data/identity_source.jpg --lip_driving_path test_data/mouth_source.mp4 --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_driving_path test_data/expression_source.mp4 --save_path res/demo_EDTalk_V.mp4

The result will be stored in save_path.

Face Super-resolution (Optional)

โ˜บ๏ธ๐Ÿ™ Thanks to Tao Liu for the proposal~

The purpose is to upscale the resolution from 256 to 512 and address the issue of blurry rendering.

Please install addtional environment here:

pip install facexlib
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install gfpgan

Then enable the option --face_sr in your scripts. The first time will download the weights of gfpgan (you can optionally first download gfpgan ckpts and put them in gfpgan/weights dir).

Here are some examples:


python demo_lip_pose.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --save_path path/to/save --face_sr

python demo_EDTalk_V.py --source_path path/to/image --lip_driving_path path/to/lip --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save --face_sr

python demo_EDTalk_A_using_predefined_exp_weights.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_type type/of/expression --save_path path/to/save --face_sr
<!-- **Note:** Due to the limitations of markdown, we downsampled the results after facesr, which may be detrimental to video quality and smoothness, see the [results_by_facesr](res/results_by_facesr) for detailed results. | Source Img | EDTalk Results | EDTalk + FaceSR | |------------|--------------------------|---------------------------| |<img src="res/results_by_facesr/demo_lip_pose5.png" width="200" ></img> | <img src="res/results_by_facesr/gif/demo_lip_pose5.gif" width="200" ></img> | <img src="res/results_by_facesr/gif/demo_lip_pose5_512.gif" width="200" ></img> | |<img src="res/results_by_facesr/demo_EDTalk_A.png" width="200" ></img> | <img src="res/results_by_facesr/gif/demo_EDTalk_A.gif" width="200" ></img> | <img src="res/results_by_facesr/gif/demo_EDTalk_A_512.gif" width="200" ></img> | |<img src="res/results_by_facesr/RD_Radio51_000.png" width="200" ></img> | <img src="res/results_by_facesr/gif/RD_Radio51_000.gif" width="200" ></img> | <img src="res/results_by_facesr/gif/RD_Radio51_000_512.gif" width="200" ></img> | -->
Source ImgEDTalk ResultsEDTalk + FaceSR
<img src="res/results_by_facesr/demo_lip_pose5.png" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/f450414f-e272-49eb-a39e-0ffcb9269470" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/6ad42d0b-6c3d-498b-b16f-0bb0fc7699b7" muted="false"></video>
<img src="res/results_by_facesr/demo_EDTalk_A.png" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/8ca59ada-507c-4d4e-a126-0e806582b4b6" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/bccea19d-513c-4c22-8c49-4aac7c7d49d0" muted="false"></video>
<img src="res/results_by_facesr/RD_Radio51_000.png" width="250" ></img><video controls loop src="https://github.com/user-attachments/assets/b75f5a6c-0d38-4dc2-bbfa-330290f098ba" muted="false"></video><video controls loop src="https://github.com/user-attachments/assets/644100c6-608e-4266-8b94-6b61880dddbe" muted="false"></video>

๐ŸŽฌ Fine tune on a specific person

There are a few issues currently, I'll be checking them carefully. Please be patient! Note: We take Obama and the path in my computer (/data/ts/xxxxxx) as example and you should replace it with your own path:

Step #0Step #100Step #200
<img src="fine_tune/examples/step_00000.jpg" width="250" ></img><img src="fine_tune/examples/step_00200.jpg" width="250" ></img><img src="fine_tune/examples/step_00400.jpg" width="250" ></img>
First line is source image, second line is driving image, and third line is generated results.

๐ŸŽฌ Data Preprocess for Training

<details> <summary> Data Preprocess for Training </summary> **Note**: The functions provided are available, but one should adjust the way they are called, e.g. by modifying the path to the data. If you run into any problems, feel free to leave your problems! - Download the MEAD and HDTF dataset: 1) **MEAD**. [download link](https://wywu.github.io/projects/MEAD/MEAD.html).
We only use *Front* videos and extract audios and orgnize the data as follows:

```text
/dir_path/MEAD_front/
|-- Original_video
|   |-- M003#angry#level_1#001.mp4
|   |-- M003#angry#level_1#002.mp4
|   |-- ...
|-- audio
|   |-- M003#angry#level_1#001.wav
|   |-- M003#angry#level_1#002.wav
|   |-- ...
```

2) HDTF. download link.

We orgnize the data as follows:

```text
/dir_path/HDTF/
|-- audios
|   |-- RD_Radio1_000.wav
|   |-- RD_Radio2_000.wav
|   |-- ...
|-- original_videos
|   |-- RD_Radio1_000.mp4
|   |-- RD_Radio2_000.mp4
|   |-- ...
```

๐ŸŽฌ Start Training

<details> <summary> Start Training </summary> - Pretrain Encoder $E$ and Generator $G$:
- Please refer to [LIA](https://github.com/wyhsirius/LIA) to train from scratch.
- (Optional) If you want to accelerate convergence speed, you can download the pre-trained model of [LIA](https://github.com/wyhsirius/LIA).
- we provide training code to fine-tune the model on MEAD and HDTF dataset:
```bash
python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train/train_E_G.py
```
<!-- - Train Expression Decouple module: ```bash python -m torch.distributed.launch --nproc_per_node=2 --master_port 12344 train/train_Expression_decouple.py ``` -->

๐Ÿ™ Thanks to all contributors for their efforts

We hope more people can get involved, and we will promptly handle pull requests. Currently, there are still some tasks that need assistance, such as creating a colab notebook, web UI, and translation work, among others.

contributors

๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ Other Talking head papers:

[ICCV 23] EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation

[AAAI 24] Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

[AAAI 24] Say Anything with Any Style

[CVPR 24] FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

๐ŸŽ“ Citation

@inproceedings{tan2024edtalk,
  title = {EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis},
  author = {Tan, Shuai and Ji, Bin and Bi, Mengxiao and Pan, Ye},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year = {2024}
}

๐Ÿ™ Acknowledgement

Some code are borrowed from following projects:

Some figures in the paper is inspired by:

Thanks for these great projects.