Awesome

MEMO

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation <br> Longtao Zheng*, Yifan Zhang*, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan <br> Project Page | arXiv | Model

This repository contains the example inference script for the MEMO-preview model. The gif demo below is compressed. See our project page for full videos. Also, check out the community contributions, including a ComfyUI integration, Gradio app, demo, and Jupyter notebook.

Installation

conda create -n memo python=3.10 -y
conda activate memo
conda install -c conda-forge ffmpeg -y
pip install -e .

Our code will download the checkpoint from Hugging Face automatically, and the models for face analysis and vocal separation will be downloaded to misc_model_dir of configs/inference.yaml. If you want to download the models manually, please download the checkpoint from here and specify the path in model_name_or_path of configs/inference.yaml.

Inference

python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>

For example:

python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs

We tested the code on H100 and RTX 4090 GPUs using CUDA 12. Under the default settings (fps=30, inference_steps=20), the inference time is around 1 second per frame on H100 and 2 seconds per frame on RTX 4090. We welcome community contributions to improve the inference speed or add more features.

Acknowledgement

Our work is made possible thanks to high-quality open-source talking video datasets (including HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD) and some pioneering works (such as EMO and Hallo).

Ethics Statement

We acknowledge the potential of AI in generating talking videos, with applications spanning education, virtual assistants, and entertainment. However, we are equally aware of the ethical, legal, and societal challenges that misuse of this technology could pose. To reduce potential risks, we have only open-sourced a preview model for research purposes. Demos on our website use publicly available materials. We welcome copyright concerns—please contact us if needed, and we will address issues promptly. Users are required to ensure that their actions align with legal regulations, cultural norms, and ethical standards. It is strictly prohibited to use the model for creating malicious, misleading, defamatory, or privacy-infringing content, such as deepfake videos for political misinformation, impersonation, harassment, or fraud. We strongly encourage users to review generated content carefully, ensuring it meets ethical guidelines and respects the rights of all parties involved. Users must also ensure that their inputs (e.g., audio and reference images) and outputs are used with proper authorization. Unauthorized use of third-party intellectual property is strictly forbidden. While users may claim ownership of content generated by the model, they must ensure compliance with copyright laws, particularly when involving public figures' likeness, voice, or other aspects protected under personality rights.

Citation

If you find our work useful, please use the following citation:

@article{zheng2024memo,
  title={MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation},
  author={Zheng, Longtao and Zhang, Yifan and Guo, Hanzhong and Pan, Jiachun and Tan, Zhenxiong and Lu, Jiahao and Tang, Chuanxin and An, Bo and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2412.04448},
  year={2024}
}

Star History