Home

Awesome

CogVideoX Factory 🧪

中文阅读

Fine-tune Cog family of video models for custom video generation under 24GB of GPU memory ⚡️📼

<table align="center"> <tr> <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">Your browser does not support the video tag.</video></td> </tr> </table>

Quickstart

Clone the repository and make sure the requirements are installed: pip install -r requirements.txt and install diffusers from source by pip install git+https://github.com/huggingface/diffusers.

Then download a dataset:

# install `huggingface_hub`
huggingface-cli download \
  --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
  --local-dir video-dataset-disney

Then launch LoRA fine-tuning for text-to-video (modify the different hyperparameters, dataset root, and other configuration options as per your choice):

# For LoRA finetuning of the text-to-video CogVideoX models
./train_text_to_video_lora.sh

# For full finetuning of the text-to-video CogVideoX models
./train_text_to_video_sft.sh

# For LoRA finetuning of the image-to-video CogVideoX models
./train_image_to_video_lora.sh

Assuming your LoRA is saved and pushed to the HF Hub, and named my-awesome-name/my-awesome-lora, we can now use the finetuned model for inference:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")
+ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="cogvideox-lora")
+ pipe.set_adapters(["cogvideox-lora"], [1.0])

video = pipe("<my-awesome-prompt>").frames[0]
export_to_video(video, "output.mp4", fps=8)

For Image-to-Video LoRAs trained with multiresolution videos, one must also add the following lines (see this Issue for more details):

from diffusers import CogVideoXImageToVideoPipeline

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
).to("cuda")

# ...

del pipe.transformer.patch_embed.pos_embedding
pipe.transformer.patch_embed.use_learned_positional_embeddings = False
pipe.transformer.config.use_learned_positional_embeddings = False

You can also check if your LoRA is correctly mounted here.

Below we provide additional sections detailing on more options explored in this repository. They all attempt to make fine-tuning for video models as accessible as possible by reducing memory requirements as much as possible.

Prepare Dataset and Training

Before starting the training, please check whether the dataset has been prepared according to the dataset specifications. We provide training scripts suitable for text-to-video and image-to-video generation, compatible with the CogVideoX model family. Training can be started using the train*.sh scripts, depending on the task you want to train. Let's take LoRA fine-tuning for text-to-video as an example.

Note: Training scripts are untested on MPS, so performance and memory requirements can differ widely compared to the CUDA reports below.

Memory requirements

<table align="center"> <tr> <td align="center" colspan="2"><b>CogVideoX LoRA Finetuning</b></td> </tr> <tr> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td> </tr> <tr> <td align="center"><img src="assets/lora_2b.png" /></td> <td align="center"><img src="assets/lora_5b.png" /></td> </tr> <tr> <td align="center" colspan="2"><b>CogVideoX Full Finetuning</b></td> </tr> <tr> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td> </tr> <tr> <td align="center"><img src="assets/sft_2b.png" /></td> <td align="center"><img src="assets/sft_5b.png" /></td> </tr> </table>

Supported and verified memory optimizations for training include:

[!IMPORTANT] The memory requirements are reported after running the training/prepare_dataset.py, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.

If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying --enable_model_cpu_offload.

LoRA finetuning

[!NOTE] The memory requirements for image-to-video lora finetuning are similar to that of text-to-video on THUDM/CogVideoX-5b, so it hasn't been reported explicitly.

Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using: ffmpeg -i input.mp4 -frames:v 1 frame.png, or provide a URL to a valid and accessible image.

<details> <summary> AdamW </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.76446.91824.234
THUDM/CogVideoX-2b16True12.94512.94521.12124.234
THUDM/CogVideoX-2b64False13.03544.31447.46924.469
THUDM/CogVideoX-2b64True13.03613.03521.56424.500
THUDM/CogVideoX-2b256False13.09545.82648.99025.543
THUDM/CogVideoX-2b256True13.09413.09522.34425.537
THUDM/CogVideoX-5b16True19.74219.74228.74638.123
THUDM/CogVideoX-5b64True20.00620.81830.33838.738
THUDM/CogVideoX-5b256True20.77122.11931.93941.537

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.80321.81424.322
THUDM/CogVideoX-2b64True13.03522.25422.25424.572
THUDM/CogVideoX-2b256True13.09422.02022.03325.574
THUDM/CogVideoX-5b16True19.74246.49246.49238.197
THUDM/CogVideoX-5b64True20.00647.80547.80539.365
THUDM/CogVideoX-5b256True20.77147.26847.33241.008
</details> <details> <summary> AdamW (8-bit bitsandbytes) </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.73246.88724.195
THUDM/CogVideoX-2b16True12.94512.94521.43024.195
THUDM/CogVideoX-2b64False13.03544.00447.15824.369
THUDM/CogVideoX-2b64True13.03513.03521.29724.357
THUDM/CogVideoX-2b256False13.03545.29148.45524.836
THUDM/CogVideoX-2b256True13.03513.03521.62524.869
THUDM/CogVideoX-5b16True19.74219.74228.60238.049
THUDM/CogVideoX-5b64True20.00620.81829.35938.520
THUDM/CogVideoX-5b256True20.77121.35230.72739.596

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.73421.77524.281
THUDM/CogVideoX-2b64True13.03621.94121.94124.445
THUDM/CogVideoX-2b256True13.09422.02022.26624.943
THUDM/CogVideoX-5b16True19.74246.32046.32638.104
THUDM/CogVideoX-5b64True20.00646.82046.82038.588
THUDM/CogVideoX-5b256True20.77147.92047.98040.002
</details> <details> <summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.70546.85924.180
THUDM/CogVideoX-2b16True12.94512.94521.39524.180
THUDM/CogVideoX-2b64False13.03543.91647.07024.234
THUDM/CogVideoX-2b64True13.03513.03520.88724.266
THUDM/CogVideoX-2b256False13.09544.94748.11124.607
THUDM/CogVideoX-2b256True13.09513.09521.39124.635
THUDM/CogVideoX-5b16True19.74219.74228.53338.002
THUDM/CogVideoX-5b64True20.00620.00629.10738.785
THUDM/CogVideoX-5b256True20.77120.77130.07839.559

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.70921.76224.254
THUDM/CogVideoX-2b64True13.03521.84421.85524.338
THUDM/CogVideoX-2b256True13.09422.02022.03124.709
THUDM/CogVideoX-5b16True19.74246.26246.29738.400
THUDM/CogVideoX-5b64True20.00646.56146.57438.840
THUDM/CogVideoX-5b256True20.77147.26847.33239.623
</details> <details> <summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>

Note: Results are reported with gradient_checkpointing enabled, running on a 2x A100.

With train_batch_size = 1:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.14113.14121.07024.602
THUDM/CogVideoX-5b20.17020.17028.66238.957

With train_batch_size = 4:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.14119.85420.83624.709
THUDM/CogVideoX-5b20.17040.63540.69939.027
</details>

Full finetuning

[!NOTE] The memory requirements for image-to-video full finetuning are similar to that of text-to-video on THUDM/CogVideoX-5b, so it hasn't been reported explicitly.

Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using: ffmpeg -i input.mp4 -frames:v 1 frame.png, or provide a URL to a valid and accessible image.

[!NOTE] Trying to run full finetuning without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

<details> <summary> AdamW </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39633.93443.84837.520
THUDM/CogVideoX-5bTrue30.061OOMOOMOOM

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39638.28148.34137.544
THUDM/CogVideoX-5bTrue30.061OOMOOMOOM
</details> <details> <summary> AdamW (8-bit bitsandbytes) </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39616.44727.55527.156
THUDM/CogVideoX-5bTrue30.06152.82658.57049.541

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39627.93027.99027.326
THUDM/CogVideoX-5bTrue16.39666.64866.70548.828
</details> <details> <summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39616.39626.10023.832
THUDM/CogVideoX-5bTrue30.06139.35948.30737.947

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39627.91627.97523.936
THUDM/CogVideoX-5bTrue30.06166.60766.66838.061
</details> <details> <summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>

Note: Results are reported with gradient_checkpointing enabled, running on a 2x A100.

With train_batch_size = 1:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.11113.11120.32823.867
THUDM/CogVideoX-5b19.76219.99827.69738.018

With train_batch_size = 4:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.11121.18821.25423.869
THUDM/CogVideoX-5b19.76243.46543.53138.082
</details>

[!NOTE]

<table align="center"> <tr> <td align="center"><a href="https://www.youtube.com/watch?v=UvRl4ansfCg"> Slaying OOMs with PyTorch</a></td> </tr> <tr> <td align="center"><img src="assets/slaying-ooms.png" style="width: 480px; height: 480px;"></td> </tr> </table>

TODOs

[!IMPORTANT] Since our goal is to make the scripts as memory-friendly as possible we don't guarantee multi-GPU training.