Awesome
<p align="center" width="50%"> <img src="https://github.com/user-attachments/assets/38efb5bc-723e-4012-aebd-f55723c593fb" alt="VideoTuna" style="width: 75%; min-width: 450px; display: block; margin: auto; background-color: transparent;"> </p>VideoTuna
<a href='https://github.com/user-attachments/assets/a48d57a3-4d89-482c-8181-e0bce4f750fd'><img src='https://badges.aleen42.com/src/wechat.svg'></a>
π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
π The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
π A new 3D video VAE, and a controllable facial video generation model will be released soon.
Features
π All-in-one framework: Inference and fine-tune up-to-date video generation models.
π Pre-training: Build your own foundational text-to-video model.
π Continuous training: Keep improving your model with new data.
π Domain-specific fine-tuning: Adapt models to your specific scenario.
π Concept-specific fine-tuning: Teach your models with unique concepts.
π Enhanced language understanding: Improve model comprehension through continuous training.
π Post-processing: Enhance the videos with video-to-video enhancement model.
π Post-training/Human preference alignment: Post-training with RLHF for more attractive results.
π Updates
- [2024-11-01] We make the VideoTuna V0.1.0 public!
Demo
Model Inference and Comparison
3D Video VAE
The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.
<table class="center"> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/0efcbf80-0074-4421-810f-79a1f1733ed3"><img src="https://github.com/user-attachments/assets/0efcbf80-0074-4421-810f-79a1f1733ed3" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/4adf29f2-d413-49b1-bccc-48adfd64a4da"><img src="https://github.com/user-attachments/assets/4adf29f2-d413-49b1-bccc-48adfd64a4da" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/508e2b10-487a-4850-a9d8-89fdbc13120a"><img src="https://github.com/user-attachments/assets/508e2b10-487a-4850-a9d8-89fdbc13120a" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/28029b2c-4922-46ee-88d3-ff8d577d2525"><img src="https://github.com/user-attachments/assets/28029b2c-4922-46ee-88d3-ff8d577d2525)" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/51471c42-8a38-4f02-b29b-e34a5279753a"><img src="https://github.com/user-attachments/assets/51471c42-8a38-4f02-b29b-e34a5279753a" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/383f2120-5fed-4d9f-82d8-3de130d6bd65"><img src="https://github.com/user-attachments/assets/383f2120-5fed-4d9f-82d8-3de130d6bd65" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/28892f4c-54cc-4cea-91e1-8fe6bbc7a1a4"><img src="https://github.com/user-attachments/assets/28892f4c-54cc-4cea-91e1-8fe6bbc7a1a4" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/d56d34e4-c4d5-4ed2-b2c5-fa1596714492"><img src="https://github.com/user-attachments/assets/d56d34e4-c4d5-4ed2-b2c5-fa1596714492" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/a0ffc2ca-c3e2-485f-b0ea-ead0d733cc8b"><img src="https://github.com/user-attachments/assets/a0ffc2ca-c3e2-485f-b0ea-ead0d733cc8b" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/1465ac70-caa9-42c7-874b-b01e13a78efb"><img src="https://github.com/user-attachments/assets/1465ac70-caa9-42c7-874b-b01e13a78efb" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/48e2eb49-265b-4eaf-b730-48fa4d7e5bfd"><img src="https://github.com/user-attachments/assets/48e2eb49-265b-4eaf-b730-48fa4d7e5bfd" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/24c893c5-865e-4af4-b003-17bda2ba4f59"><img src="https://github.com/user-attachments/assets/24c893c5-865e-4af4-b003-17bda2ba4f59" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/c18ed80f-3650-43a7-8438-7914de7e39ab"><img src="https://github.com/user-attachments/assets/c18ed80f-3650-43a7-8438-7914de7e39ab" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/89d38004-021b-4a4d-ab83-5627474f8928"><img src="https://github.com/user-attachments/assets/89d38004-021b-4a4d-ab83-5627474f8928" width="320"></a></td> </tr> </table>Face domain
<table class="center"> <tr> <td><img src="https://github.com/user-attachments/assets/a1562c70-d97c-4324-bb11-47db2b83f443" width="240" alt="Image 1"></td> <td><img src="https://github.com/user-attachments/assets/3196810b-48d7-4024-b687-df2009774631" width="240" alt="Image 2"></td> <td><img src="https://github.com/user-attachments/assets/4e873f4c-ca56-4549-aaa1-ef24032ae96b" width="240" alt="Image 3"></td> </tr> <tr> <td style="text-align: center;">Input 1</td> <td style="text-align: center;">Input 2</td> <td style="text-align: center;">Input 3</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/972dde7a-fa88-479a-a47e-71d3650b1826"><img src="https://github.com/user-attachments/assets/972dde7a-fa88-479a-a47e-71d3650b1826" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/3c216090-9ad1-4911-b990-179b45314d3e"><img src="https://github.com/user-attachments/assets/3c216090-9ad1-4911-b990-179b45314d3e" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/2e2fb78e-2f39-47bd-acaf-3cfbce83b162"><img src="https://github.com/user-attachments/assets/2e2fb78e-2f39-47bd-acaf-3cfbce83b162" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/f2f55021-4e0d-43a7-9f57-3c94b772f573"><img src="https://github.com/user-attachments/assets/f2f55021-4e0d-43a7-9f57-3c94b772f573" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/600a2f6c-7a8f-4304-bdc3-5f0d65d4fb83"><img src="https://github.com/user-attachments/assets/600a2f6c-7a8f-4304-bdc3-5f0d65d4fb83" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/8ad7c7d8-6492-4435-9436-168f90429be3"><img src="https://github.com/user-attachments/assets/8ad7c7d8-6492-4435-9436-168f90429be3" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/8ba84071-1978-4245-84b3-3a6fc3c9fa5a"><img src="https://github.com/user-attachments/assets/8ba84071-1978-4245-84b3-3a6fc3c9fa5a" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/d180c358-bdff-40b4-aa5a-fa4ec73d80b6"><img src="https://github.com/user-attachments/assets/d180c358-bdff-40b4-aa5a-fa4ec73d80b6" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/37004c20-3d0d-4cff-8b4a-f8a4d184de51"><img src="https://github.com/user-attachments/assets/37004c20-3d0d-4cff-8b4a-f8a4d184de51" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/025fe090-7d53-4a12-9498-1c814a0ee768"><img src="https://github.com/user-attachments/assets/025fe090-7d53-4a12-9498-1c814a0ee768" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e8ddf3d1-57ea-4545-a004-66554c19f27b"><img src="https://github.com/user-attachments/assets/e8ddf3d1-57ea-4545-a004-66554c19f27b" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/519b3c87-baa6-408b-b3a5-a95eece9e19e"><img src="https://github.com/user-attachments/assets/519b3c87-baa6-408b-b3a5-a95eece9e19e" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/f55a2b6c-ce10-4716-a001-f747c0da17a4"><img src="https://github.com/user-attachments/assets/f55a2b6c-ce10-4716-a001-f747c0da17a4" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/bb3620af-322d-43bc-9060-c7ce9fc32672"><img src="https://github.com/user-attachments/assets/bb3620af-322d-43bc-9060-c7ce9fc32672" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/46a39738-89c2-43fe-9f98-f0ac0d26e39b"><img src="https://github.com/user-attachments/assets/46a39738-89c2-43fe-9f98-f0ac0d26e39b" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/2d3d6e0d-2034-4341-8ca3-42e0cda2704f"><img src="https://github.com/user-attachments/assets/2d3d6e0d-2034-4341-8ca3-42e0cda2704f" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/331f23e1-f441-46f7-98a6-b25a684780f3"><img src="https://github.com/user-attachments/assets/331f23e1-f441-46f7-98a6-b25a684780f3" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/542f8535-634a-4f82-ae2a-39f988c6bc55"><img src="https://github.com/user-attachments/assets/542f8535-634a-4f82-ae2a-39f988c6bc55" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table>Storytelling
<table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/27aee539-f2bf-467a-8da5-22f506713aa0"><img src="https://github.com/user-attachments/assets/27aee539-f2bf-467a-8da5-22f506713aa0" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/fef5b694-6e1f-42f6-a5a1-7f0b856f3678"><img src="https://github.com/user-attachments/assets/fef5b694-6e1f-42f6-a5a1-7f0b856f3678" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/91408eb8-264d-4d3f-9098-0bfe06022467"><img src="https://github.com/user-attachments/assets/91408eb8-264d-4d3f-9098-0bfe06022467" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/0b822858-da6f-4e6f-82dd-5f243e2feccc"><img src="https://github.com/user-attachments/assets/0b822858-da6f-4e6f-82dd-5f243e2feccc" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/75f69de2-9e5c-48d7-ae55-b28084772836"><img src="https://github.com/user-attachments/assets/75f69de2-9e5c-48d7-ae55-b28084772836" width="240"></a></td> </tr> <tr> <td style="text-align:center;">The picture shows a cozy room with a little girl telling her travel story to her teddybear beside the bed.</td> <td style="text-align:center;">As night falls, teddybear sits by the window, his eyes sparkling with longing for the distant place</td> <td style="text-align:center;">Teddybear was in a corner of the room, making a small backpack out of old cloth strips, with a map, a compass and dry food next to it.</td> <td style="text-align:center;">The first rays of sunlight in the morning came through the window, and teddybear quietly opened the door and embarked on his adventure.</td> <td style="text-align:center;">In the forest, the sun shines through the treetops, and teddybear moves among various animals and communicates with them.</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/3ae06dbf-f41f-4e7f-b384-fca3abf2c0aa"><img src="https://github.com/user-attachments/assets/3ae06dbf-f41f-4e7f-b384-fca3abf2c0aa" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/09a2fdb9-3e84-40a5-a729-075876b10412"><img src="https://github.com/user-attachments/assets/09a2fdb9-3e84-40a5-a729-075876b10412" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/a382aa30-895d-4476-8f86-b668fe153c16"><img src="https://github.com/user-attachments/assets/a382aa30-895d-4476-8f86-b668fe153c16" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/c906b20e-1576-4a96-9c77-b67a940dcce8"><img src="https://github.com/user-attachments/assets/c906b20e-1576-4a96-9c77-b67a940dcce8" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/04f553d2-8ac5-4085-8c1b-59bddf9deb41"><img src="https://github.com/user-attachments/assets/04f553d2-8ac5-4085-8c1b-59bddf9deb41" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Teddybear leaves his mark on the edge of a clear lake, surrounded by exotic flowers, and the picture is full of mystery and exploration.</td> <td style="text-align:center;">Teddybear climbs the rugged mountain road, the weather is changeable, but he is determined.</td> <td style="text-align:center;">The picture switches to the top of the mountain, where teddybear stands in the glow of the sunrise, with a magnificent mountain view in the background.</td> <td style="text-align:center;">On the way home, teddybear helps a wounded bird, the picture is warm and touching.</td> <td style="text-align:center;">Teddybear sits by the little girl's bed and tells her his adventure story, and the little girl is fascinated.</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/14c5f6f1-7830-46fc-9ebd-8611f339d8ab"><img src="https://github.com/user-attachments/assets/14c5f6f1-7830-46fc-9ebd-8611f339d8ab" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/705eb5c4-b084-4752-a47b-ab20d206b9ee"><img src="https://github.com/user-attachments/assets/705eb5c4-b084-4752-a47b-ab20d206b9ee" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/39a814b6-8692-41f5-819d-eecfd6085b03"><img src="https://github.com/user-attachments/assets/39a814b6-8692-41f5-819d-eecfd6085b03" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e2b7297d-d0bb-482e-9c16-f01f9a56fbb0"><img src="https://github.com/user-attachments/assets/e2b7297d-d0bb-482e-9c16-f01f9a56fbb0" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/3afc9aa8-92f5-4e27-bfed-950300645748"><img src="https://github.com/user-attachments/assets/3afc9aa8-92f5-4e27-bfed-950300645748" width="240"></a></td> </tr> <tr> <td style="text-align:center;">The scene shows a peaceful village, with moonlight shining on the roofs and streets, creating a peaceful atmosphere.</td> <td style="text-align:center;">cat sits by the window, her eyes twinkling in the night, reflecting her special connection with the moon and stars.</td> <td style="text-align:center;">Villagers gather in the center of the village for the annual Moon Festival celebration, with lanterns and colored lights adorning the night sky.</td> <td style="text-align:center;">cat feels the call of the moon, and her beard trembles with the excitement in her heart.</td> <td style="text-align:center;">cat quietly leaves her home in the night and embarks on a path illuminated by the silver moonlight.</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/b055930e-4f97-4872-bc07-2e9cb5641d7d"><img src="https://github.com/user-attachments/assets/b055930e-4f97-4872-bc07-2e9cb5641d7d" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/ece35515-295d-4655-a5c5-cca95fd11e92"><img src="https://github.com/user-attachments/assets/ece35515-295d-4655-a5c5-cca95fd11e92" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/804e32f1-75ae-4a0b-b3c0-1d13c6e2d987"><img src="https://github.com/user-attachments/assets/804e32f1-75ae-4a0b-b3c0-1d13c6e2d987" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e09d8134-7991-4585-a09a-b00283ab6a56"><img src="https://github.com/user-attachments/assets/e09d8134-7991-4585-a09a-b00283ab6a56" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/4c722b97-79d9-4281-8451-b31eb3393c3a"><img src="https://github.com/user-attachments/assets/4c722b97-79d9-4281-8451-b31eb3393c3a" width="240"></a></td> </tr> <tr> <td style="text-align:center;">A group of forest elves dance around glowing mushrooms, their costumes and movements full of magic and vitality.</td> <td style="text-align:center;">cat joins the celebration and dances with the elves, the picture is full of joy and freedom.</td> <td style="text-align:center;">A wise old owl reveals the secret power of the moon to cat and the light of the moon in the picture becomes brighter.</td> <td style="text-align:center;">cat closes her eyes in the moonlight, puts her hands together, and makes a wish, surrounded by the light of stars and the moon.</td> <td style="text-align:center;">cat feels the surge of power, and her eyes become more determined.</td> </tr> </table>β° TODOs
- More demo and applications
- More functionalities such as control modules. (Suggestions are welcome!)
π Information
Code Structure
VideoTuna/
βββ assets # put images for readme
βββ checkpoints # put model checkpoints here
βββ configs # model and experimental configs
βββ data # data processing scripts and dataset files
βββ docs # documentations
βββ eval # evaluation scripts
βββ inputs # input examples for testing
βββ scripts # train and inference python scripts
βββ shsripts # train and inference shell scripts
βββ src # model-related source code
βββ tests # testing scripts
βββ tools # some tool scripts
Supported Models
T2V-Models | HxWxL | Checkpoints |
---|---|---|
HunyuanVideo | 720x1280x129 | Hugging Face |
Mochi | 848x480, 3s | Hugging Face |
CogVideoX-2B | 480x720x49 | Hugging Face |
CogVideoX-5B | 480x720x49 | Hugging Face |
Open-Sora 1.0 | 512Γ512x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
VideoCrafter2 | 320x512x16 | Hugging Face |
VideoCrafter1 | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
I2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-5B-I2V | 480x720x49 | Hugging Face |
DynamiCrafter | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
- Note: H: height; W: width; L: length
Please check docs/CHECKPOINTS.md to download all the model checkpoints.
π Get started
1.Prepare environment
conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..
conda config --add channels conda-forge
conda install ffmpeg
2.Prepare checkpoints
Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.
3.Inference state-of-the-art T2V/I2V/T2I models
- Inference a set of text-to-video models in one command:
bash tools/video_comparison/compare.sh
- The default mode is to run all models, e.g.,
inference_methods="videocrafter2;dynamicrafter;cogvideoβt2v;cogvideoβi2v;opensora"
- If the users want to inference specific models, modify the
inference_methods
variable incompare.sh
, and list the desired models separated by semicolons. - Also specify the input directory via the
input_dir
variable. This directory should contain aprompts.txt
file, where each line corresponds to a prompt for the video generation. The defaultinput_dir
isinputs/t2v
- The default mode is to run all models, e.g.,
- Inference a set of image-to-video models in one command:
bash tools/video_comparison/compare_i2v.sh
- Inference a specific model, run the corresponding commands as follows:
Task | Model | Command | Length (#frames) | Resolution | Inference Time (s) | GPU Memory (GiB) |
---|---|---|---|---|---|---|
T2V | HunyuanVideo | bash shscripts/inference_hunyuan_diffusers.sh | 129 | 720x1280 | 1920 | 59.15 |
T2V | Mochi | bash shscripts/inference_mochi.sh | 84 | 480x848 | 109.0 | 26 |
I2V | CogVideoX-5b-I2V | bash shscripts/inference_cogVideo_i2v_diffusers.sh | 49 | 480x720 | 310.4 | 4.78 |
T2V | CogVideoX-2b | bash shscripts/inference_cogVideo_t2v_diffusers.sh | 49 | 480x720 | 107.6 | 2.32 |
T2V | Open Sora V1.0 | bash shscripts/inference_opensora_v10_16x256x256.sh | 16 | 256x256 | 11.2 | 23.99 |
T2V | VideoCrafter-V2-320x512 | bash shscripts/inference_vc2_t2v_320x512.sh | 16 | 320x512 | 26.4 | 10.03 |
T2V | VideoCrafter-V1-576x1024 | bash shscripts/inference_vc1_t2v_576x1024.sh | 16 | 576x1024 | 91.4 | 14.57 |
I2V | DynamiCrafter | bash shscripts/inference_dc_i2v_576x1024.sh | 16 | 576x1024 | 101.7 | 52.23 |
I2V | VideoCrafter-V1 | bash shscripts/inference_vc1_i2v_320x512.sh | 16 | 320x512 | 26.4 | 10.03 |
T2I | Flux-dev | bash shscripts/inference_flux.sh | 1 | 768x1360 | 238.1 | 1.18 |
T2I | Flux-schnell | bash shscripts/inference_flux.sh | 1 | 768x1360 | 5.4 | 1.20 |
4. Finetune T2V models
4.1 Prepare dataset
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
4.2 Fine-tune
1. VideoCrafter2 Full Fine-tuning
Before started, we assume you have finished the following two preliminary steps:
- Install the environment
- Prepare the dataset
- Download the checkpoints and get these two checkpoints
ll checkpoints/videocrafter/t2v_v2_512/model.ckpt
ll checkpoints/stablediffusion/v2-1_512-ema/model.ckpt
First, run this command to convert the VC2 checkpoint as we make minor modifications on the keys of the state dict of the checkpoint. The converted checkpoint will be automatically save at checkpoints/videocrafter/t2v_v2_512/model_converted.ckpt
.
python tools/convert_checkpoint.py --input_path checkpoints/videocrafter/t2v_v2_512/model.ckpt
Second, run this command to start training on the single GPU. The training results will be automatically saved at results/train/${CURRENT_TIME}_${EXPNAME}
bash shscripts/train_videocrafter_v2.sh
2. VideoCrafter2 Lora Fine-tuning
We support lora finetuning to make the model to learn new concepts/characters/styles.
- Example config file:
configs/001_videocrafter2/vc2_t2v_lora.yaml
- Training lora based on VideoCrafter2:
bash shscripts/train_videocrafter_lora.sh
- Inference the trained models:
bash shscripts/inference_vc2_t2v_320x512_lora.sh
3. Open-Sora Fine-tuning
We support open-sora finetuning, you can simply run the following commands:
# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh
<!-- Please check [configs/train/003_vc2_lora_ft/README.md](configs/train/003_vc2_lora_ft/README.md) for details. -->
<!--
(1) Prepare data
(2) Finetune
```
bash configs/train/000_videocrafter2ft/run.sh
``` -->
Finetuning for enhanced langugage understanding
5. Evaluation
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
<!-- ### 6. Alignment We support video alignment post-training to align human perference for video diffusion models. Please check [configs/train/004_rlhf_vc2/README.md](configs/train/004_rlhf_vc2/README.md) for details. -->Acknowledgement
We thank the following repos for sharing their awesome models and codes!
- Mochi: A new SOTA in open-source video generation models
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
Some Resources
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
π» Contributors
<a href="https://github.com/VideoVerses/VideoTuna/graphs/contributors"> <img src="https://contrib.rocks/image?repo=VideoVerses/VideoTuna" /> </a>π License
Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).
π Citation
@software{videotuna,
author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
month = {Nov},
year = {2024},
url = {https://github.com/VideoVerses/VideoTuna}
}