Home

Awesome

finetrainers ๐Ÿงช

cogvideox-factory was renamed to finetrainers. If you're looking to train CogVideoX or Mochi with the legacy training scripts, please refer to this README instead. Everything in the training/ directory will be eventually moved and supported under finetrainers.

FineTrainers is a work-in-progress library to support training of video models. The first priority is to support lora training for all models in Diffusers, and eventually other methods like controlnets, control-loras, distillation, etc.

<table align="center"> <tr> <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">Your browser does not support the video tag.</video></td> </tr> </table>

News

Quickstart

Clone the repository and make sure the requirements are installed: pip install -r requirements.txt and install diffusers from source by pip install git+https://github.com/huggingface/diffusers.

Then download a dataset:

# install `huggingface_hub`
huggingface-cli download \
  --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
  --local-dir video-dataset-disney

Then launch LoRA fine-tuning. For CogVideoX and Mochi, refer to this and this.

<details> <summary> LTX Video </summary>

Training:

#!/bin/bash

# export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
# export TORCHDYNAMO_VERBOSE=1
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/raid/aryan/video-dataset-disney"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/output/directory/ltx-video/ltxv_disney"

# Model arguments
model_cmd="--model_name ltx_video \
  --pretrained_model_name_or_path Lightricks/LTX-Video"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token BW_STYLE \
  --video_resolution_buckets 49x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd="--flow_resolution_shifting"

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 1200 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 500 \
  --checkpointing_limit 2 \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 3e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.@@@49x512x768:::A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@49x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-ltxv \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Inference:

Assuming your LoRA is saved and pushed to the HF Hub, and named my-awesome-name/my-awesome-lora, we can now use the finetuned model for inference:

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
).to("cuda")
+ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="ltxv-lora")
+ pipe.set_adapters(["ltxv-lora"], [0.75])

video = pipe("<my-awesome-prompt>").frames[0]
export_to_video(video, "output.mp4", fps=8)
</details> <details> <summary> Hunyuan Video </summary>

Training:

#!/bin/bash

# export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
# export TORCHDYNAMO_VERBOSE=1
export WANDB_MODE="offline"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1,2,3,4,5,6,7"

DATA_ROOT="/path/to/dataset"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/path/to/models/hunyuan-video/hunyuan-video-loras/hunyuan-video_cakify_500_3e-5_constant_with_warmup"

# Model arguments
model_cmd="--model_name hunyuan_video \
  --pretrained_model_name_or_path tencent/HunyuanVideo
  --revision refs/pr/18"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token afkx \
  --video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 0"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 1 \
  --train_steps 500 \
  --rank 128 \
  --lora_alpha 128 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 500 \
  --checkpointing_limit 2 \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 2e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

Inference:

Assuming your LoRA is saved and pushed to the HF Hub, and named my-awesome-name/my-awesome-lora, we can now use the finetuned model for inference:

import torch
from diffusers import HunyuanVideoPipeline

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="hunyuanvideo-lora")
pipe.set_adapters(["hunyuanvideo-lora"], [0.6])
pipe.vae.enable_tiling()
pipe.to("cuda")

output = pipe(
    prompt="A cat walks on the grass, realistic",
    height=320,
    width=512,
    num_frames=61,
    num_inference_steps=30,
).frames[0]
export_to_video(output, "output.mp4", fps=15)
</details>

If you would like to use a custom dataset, refer to the dataset preparation guide here.

Memory requirements

<table align="center"> <tr> <td align="center" colspan="2"><b>CogVideoX LoRA Finetuning</b></td> </tr> <tr> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td> </tr> <tr> <td align="center"><img src="assets/lora_2b.png" /></td> <td align="center"><img src="assets/lora_5b.png" /></td> </tr> <tr> <td align="center" colspan="2"><b>CogVideoX Full Finetuning</b></td> </tr> <tr> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-2b">THUDM/CogVideoX-2b</a></td> <td align="center"><a href="https://huggingface.co/THUDM/CogVideoX-5b">THUDM/CogVideoX-5b</a></td> </tr> <tr> <td align="center"><img src="assets/sft_2b.png" /></td> <td align="center"><img src="assets/sft_5b.png" /></td> </tr> </table>

Supported and verified memory optimizations for training include:

[!IMPORTANT] The memory requirements are reported after running the training/prepare_dataset.py, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.

If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying --enable_model_cpu_offload.

LoRA finetuning

[!NOTE] The memory requirements for image-to-video lora finetuning are similar to that of text-to-video on THUDM/CogVideoX-5b, so it hasn't been reported explicitly.

Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using: ffmpeg -i input.mp4 -frames:v 1 frame.png, or provide a URL to a valid and accessible image.

<details> <summary> AdamW </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.76446.91824.234
THUDM/CogVideoX-2b16True12.94512.94521.12124.234
THUDM/CogVideoX-2b64False13.03544.31447.46924.469
THUDM/CogVideoX-2b64True13.03613.03521.56424.500
THUDM/CogVideoX-2b256False13.09545.82648.99025.543
THUDM/CogVideoX-2b256True13.09413.09522.34425.537
THUDM/CogVideoX-5b16True19.74219.74228.74638.123
THUDM/CogVideoX-5b64True20.00620.81830.33838.738
THUDM/CogVideoX-5b256True20.77122.11931.93941.537

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.80321.81424.322
THUDM/CogVideoX-2b64True13.03522.25422.25424.572
THUDM/CogVideoX-2b256True13.09422.02022.03325.574
THUDM/CogVideoX-5b16True19.74246.49246.49238.197
THUDM/CogVideoX-5b64True20.00647.80547.80539.365
THUDM/CogVideoX-5b256True20.77147.26847.33241.008
</details> <details> <summary> AdamW (8-bit bitsandbytes) </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.73246.88724.195
THUDM/CogVideoX-2b16True12.94512.94521.43024.195
THUDM/CogVideoX-2b64False13.03544.00447.15824.369
THUDM/CogVideoX-2b64True13.03513.03521.29724.357
THUDM/CogVideoX-2b256False13.03545.29148.45524.836
THUDM/CogVideoX-2b256True13.03513.03521.62524.869
THUDM/CogVideoX-5b16True19.74219.74228.60238.049
THUDM/CogVideoX-5b64True20.00620.81829.35938.520
THUDM/CogVideoX-5b256True20.77121.35230.72739.596

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.73421.77524.281
THUDM/CogVideoX-2b64True13.03621.94121.94124.445
THUDM/CogVideoX-2b256True13.09422.02022.26624.943
THUDM/CogVideoX-5b16True19.74246.32046.32638.104
THUDM/CogVideoX-5b64True20.00646.82046.82038.588
THUDM/CogVideoX-5b256True20.77147.92047.98040.002
</details> <details> <summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>

Note: Trying to run CogVideoX-5b without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

With train_batch_size = 1:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16False12.94543.70546.85924.180
THUDM/CogVideoX-2b16True12.94512.94521.39524.180
THUDM/CogVideoX-2b64False13.03543.91647.07024.234
THUDM/CogVideoX-2b64True13.03513.03520.88724.266
THUDM/CogVideoX-2b256False13.09544.94748.11124.607
THUDM/CogVideoX-2b256True13.09513.09521.39124.635
THUDM/CogVideoX-5b16True19.74219.74228.53338.002
THUDM/CogVideoX-5b64True20.00620.00629.10738.785
THUDM/CogVideoX-5b256True20.77120.77130.07839.559

With train_batch_size = 4:

modellora rankgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b16True12.94521.70921.76224.254
THUDM/CogVideoX-2b64True13.03521.84421.85524.338
THUDM/CogVideoX-2b256True13.09422.02022.03124.709
THUDM/CogVideoX-5b16True19.74246.26246.29738.400
THUDM/CogVideoX-5b64True20.00646.56146.57438.840
THUDM/CogVideoX-5b256True20.77147.26847.33239.623
</details> <details> <summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>

Note: Results are reported with gradient_checkpointing enabled, running on a 2x A100.

With train_batch_size = 1:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.14113.14121.07024.602
THUDM/CogVideoX-5b20.17020.17028.66238.957

With train_batch_size = 4:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.14119.85420.83624.709
THUDM/CogVideoX-5b20.17040.63540.69939.027
</details>

Full finetuning

[!NOTE] The memory requirements for image-to-video full finetuning are similar to that of text-to-video on THUDM/CogVideoX-5b, so it hasn't been reported explicitly.

Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using: ffmpeg -i input.mp4 -frames:v 1 frame.png, or provide a URL to a valid and accessible image.

[!NOTE] Trying to run full finetuning without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.

<details> <summary> AdamW </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39633.93443.84837.520
THUDM/CogVideoX-5bTrue30.061OOMOOMOOM

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39638.28148.34137.544
THUDM/CogVideoX-5bTrue30.061OOMOOMOOM
</details> <details> <summary> AdamW (8-bit bitsandbytes) </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39616.44727.55527.156
THUDM/CogVideoX-5bTrue30.06152.82658.57049.541

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39627.93027.99027.326
THUDM/CogVideoX-5bTrue16.39666.64866.70548.828
</details> <details> <summary> AdamW + CPUOffloadOptimizer (with gradient offloading) </summary>

With train_batch_size = 1:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39616.39626.10023.832
THUDM/CogVideoX-5bTrue30.06139.35948.30737.947

With train_batch_size = 4:

modelgradient_checkpointingmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2bTrue16.39627.91627.97523.936
THUDM/CogVideoX-5bTrue30.06166.60766.66838.061
</details> <details> <summary> DeepSpeed (AdamW + CPU/Parameter offloading) </summary>

Note: Results are reported with gradient_checkpointing enabled, running on a 2x A100.

With train_batch_size = 1:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.11113.11120.32823.867
THUDM/CogVideoX-5b19.76219.99827.69738.018

With train_batch_size = 4:

modelmemory_before_trainingmemory_before_validationmemory_after_validationmemory_after_testing
THUDM/CogVideoX-2b13.11121.18821.25423.869
THUDM/CogVideoX-5b19.76243.46543.53138.082
</details>

[!NOTE]

<table align="center"> <tr> <td align="center"><a href="https://www.youtube.com/watch?v=UvRl4ansfCg"> Slaying OOMs with PyTorch</a></td> </tr> <tr> <td align="center"><img src="assets/slaying-ooms.png" style="width: 480px; height: 480px;"></td> </tr> </table>