Awesome
<p align="center"> <img src="https://github.com/user-attachments/assets/fba781e5-497d-44fa-abb5-07b3b3e8a471" width="256" style="margin-bottom: 0.2;"/> <p> <h2 align="center"> <a href="https://github.com/PKU-YuanGroup/WF-VAE/">WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model</a></h2> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2> <h5 align="center"> </h5> <details open><summary>💡 I also have other projects that may interest you ✨. </summary><p> <!-- may --></p></details>Open-Sora Plan: Open-Source Large Video Generation Model <br> Bin Lin and Yunyang Ge and Xinhua Cheng and Zongjian Li and Bin Zhu and Shaodong Wang and Xianyi He and Yang Ye and Shenghai Yuan and Liuhan Chen and Tanghui Jia and Junwu Zhang and Zhenyu Tang and Yatian Pang and Bin She and Cen Yan and Zhiheng Hu and Xiaoyi Dong and Lin Chen and Zhang Pan and Xing Zhou and Shaoling Dong and Yonghong Tian and Li Yuan <br> <br>
📰 News
- [2024.11.27] 🔥🔥🔥 We have published our report, which provides comprehensive training details and includes additional experiments.
- [2024.11.25] 🔥🔥🔥 We have released our 16-channel WF-VAE-L model along with the training code. Welcome to download it from Huggingface.
😮 Highlights
WF-VAE utilizes a multi-level wavelet transform to construct an efficient energy pathway, enabling low-frequency information from video data to flow into latent representation. This method achieves competitive reconstruction performance while markedly reducing computational costs.
💡 Simpler Architecture, Faster Encoding
- This architecture substantially improves speed and reduces training costs in large-scale video generation models and data processing workflows.
🔥 Competitive Reconstruction Performance with SOTA VAEs
- Our experiments demonstrate competitive performance of our model against SOTA VAEs.
🚀 Main Results
Reconstruction
<div align="center"> <img src="https://github.com/user-attachments/assets/0b9d6203-ea31-47b0-86b6-fbfaf96ddb37" style="max-width: 80%;"> </div> <table> <thead> <tr> <th>WF-VAE</th> <th>CogVideoX</th> </tr> </thead> <tbody> <tr> <td> <img src="https://github.com/user-attachments/assets/da74cce6-7878-4aff-ba4a-ed2b3c23f530" alt="WF-VAE"> </td> <td> <img src="https://github.com/user-attachments/assets/a7c8c5f4-8487-485b-80d0-81caf2b01d9f" alt="CogVideoX"> </td> </tr> </tbody> </table>Efficiency
We conduct efficiency tests at 33-frame videos using float32 precision on an H100 GPU. All models operated without block-wise inference strategies. Our model demonstrated performance comparable to state-of-the-art VAEs while significantly reducing encoding costs.
<div align="center"> <img src="https://github.com/user-attachments/assets/53f74160-81f0-486e-b294-10dbb5bed8e5" style="max-width: 80%;"> </div>🛠️ Requirements and Installation
git clone https://github.com/PKU-YuanGroup/WF-VAE
cd WF-VAE
conda create -n wfvae python=3.10 -y
conda activate wfvae
pip install -r requirements.txt
🤖 Reconstructing Video or Image
To reconstruct a video or an image, execute the following commands:
Video Reconstruction
CUDA_VISIBLE_DEVICES=1 python scripts/recon_single_video.py \
--model_name WFVAE \
--from_pretrained "Your VAE" \
--video_path "Video Path" \
--rec_path rec.mp4 \
--device cuda \
--sample_rate 1 \
--num_frames 65 \
--height 512 \
--width 512 \
--fps 30 \
--enable_tiling
Image Reconstruction
CUDA_VISIBLE_DEVICES=1 python scripts/recon_single_image.py \
--model_name WFVAE \
--from_pretrained "Your VAE" \
--image_path assets/gt_5544.jpg \
--rec_path rec.jpg \
--device cuda \
--short_size 512
For further guidance, refer to the example scripts: examples/rec_single_video.sh
and examples/rec_single_image.sh
.
🗝️ Training & Validating
The training & validating instruction is in TRAIN_AND_VALIDATE.md.
👍 Acknowledgement
- Open-Sora Plan - https://github.com/PKU-YuanGroup/Open-Sora-Plan
- Allegro - https://github.com/rhymes-ai/Allegro
- CogVideoX - https://github.com/THUDM/CogVideo
- Stable Diffusion - https://github.com/CompVis/stable-diffusion
✏️ Citation
@misc{li2024wfvaeenhancingvideovae,
title={WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model},
author={Zongjian Li and Bin Lin and Yang Ye and Liuhan Chen and Xinhua Cheng and Shenghai Yuan and Li Yuan},
year={2024},
eprint={2411.17459},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17459},
}
🔒 License
This project is released under the Apache 2.0 license as found in the LICENSE file.