Home

Awesome

CogVideo & CogVideoX

中文阅读

日本語で読む

<div align="center"> <img src=resources/logo.svg width="50%"/> </div> <p align="center"> Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> </p> <p align="center"> 📚 View the <a href="https://arxiv.org/abs/2408.06072" target="_blank">paper</a> and <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">user guide</a> </p> <p align="center"> 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/dCGfUsagrD" target="_blank">Discord</a> </p> <p align="center"> 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models. </p>

Project Updates

Table of Contents

Jump to a specific section:

Quick Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.

For more details on quantized inference, please refer to diffusers-torchao. With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at diffusers-torchao.

Gallery

CogVideoX-5B

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video> </td> </tr> </table>

CogVideoX-2B

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video> </td> </tr> </table>

To view the corresponding prompt words for the gallery, please click here

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.

<table style="border-collapse: collapse; width: 100%;"> <tr> <th style="text-align: center;">Model Name</th> <th style="text-align: center;">CogVideoX-2B</th> <th style="text-align: center;">CogVideoX-5B</th> <th style="text-align: center;">CogVideoX-5B-I2V</th> </tr> <tr> <td style="text-align: center;">Model Description</td> <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td> <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td> <td style="text-align: center;">CogVideoX-5B image-to-video version.</td> </tr> <tr> <td style="text-align: center;">Inference Precision</td> <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td> <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td> </tr> <tr> <td style="text-align: center;">Single GPU Memory Usage<br></td> <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td> <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td> </tr> <tr> <td style="text-align: center;">Multi-GPU Inference Memory Usage</td> <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td> <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td> </tr> <tr> <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td> <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td> <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td> </tr> <tr> <td style="text-align: center;">Fine-tuning Precision</td> <td style="text-align: center;"><b>FP16</b></td> <td colspan="2" style="text-align: center;"><b>BF16</b></td> </tr> <tr> <td style="text-align: center;">Fine-tuning Memory Usage</td> <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td> <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td> <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td> </tr> <tr> <td style="text-align: center;">Prompt Language</td> <td colspan="3" style="text-align: center;">English*</td> </tr> <tr> <td style="text-align: center;">Maximum Prompt Length</td> <td colspan="3" style="text-align: center;">226 Tokens</td> </tr> <tr> <td style="text-align: center;">Video Length</td> <td colspan="3" style="text-align: center;">6 Seconds</td> </tr> <tr> <td style="text-align: center;">Frame Rate</td> <td colspan="3" style="text-align: center;">8 Frames / Second</td> </tr> <tr> <td style="text-align: center;">Video Resolution</td> <td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td> </tr> <tr> <td style="text-align: center;">Position Encoding</td> <td style="text-align: center;">3d_sincos_pos_embed</td> <td style="text-align: center;">3d_sincos_pos_embed</td> <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td> </tr> <tr> <td style="text-align: center;">Download Link (Diffusers)</td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td> </tr> <tr> <td style="text-align: center;">Download Link (SAT)</td> <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td> </tr> </table>

Data Explanation

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Friendly Links

We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them:

Project Structure

This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the CogVideoX open-source model.

Quick Start with Colab

Here provide three projects that can be run directly on free Colab T4 instances:

Inference

<div style="text-align: center;"> <img src="resources/web_demo.png" style="width: 100%; height: auto;" /> </div>

finetune

sat

Tools

This folder contains some tools for model conversion / caption generation, etc.

CogVideo(ICLR'23)

The official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers is on the CogVideo branch

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

High-frame-rate sample

Intro images

<div align="center"> <video src="https://github.com/user-attachments/assets/2fa19651-e925-4a2a-b8d6-b3f216d490ba" width="80%" controls autoplay></video> </div>

The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

We welcome your contributions! You can click here for more information.

License Agreement

The code in this repository is released under the Apache 2.0 License.

The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the Apache 2.0 License.

The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.