Awesome
<p align="center" width="50%"> <img src="https://github.com/user-attachments/assets/38efb5bc-723e-4012-aebd-f55723c593fb" alt="VideoTuna" style="width: 75%; min-width: 450px; display: block; margin: auto; background-color: transparent;"> </p>VideoTuna
<a href='https://github.com/user-attachments/assets/a48d57a3-4d89-482c-8181-e0bce4f750fd'><img src='https://badges.aleen42.com/src/wechat.svg'></a>
π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
π The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
π A new 3D video VAE, and a controllable facial video generation model will be released soon.
Features
π All-in-one framework: Inference and fine-tune up-to-date video generation models.
π Pre-training: Build your own foundational text-to-video model.
π Continuous training: Keep improving your model with new data.
π Domain-specific fine-tuning: Adapt models to your specific scenario.
π Concept-specific fine-tuning: Teach your models with unique concepts.
π Enhanced language understanding: Improve model comprehension through continuous training.
π Post-processing: Enhance the videos with video-to-video enhancement model.
π Post-training/Human preference alignment: Post-training with RLHF for more attractive results.
π Updates
- [2024-11-01] We make the VideoTuna V0.1.0 public!
Demo
Model Inference and Comparison
3D Video VAE
The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.
<table class="center"> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/0efcbf80-0074-4421-810f-79a1f1733ed3"><img src="https://github.com/user-attachments/assets/0efcbf80-0074-4421-810f-79a1f1733ed3" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/4adf29f2-d413-49b1-bccc-48adfd64a4da"><img src="https://github.com/user-attachments/assets/4adf29f2-d413-49b1-bccc-48adfd64a4da" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/508e2b10-487a-4850-a9d8-89fdbc13120a"><img src="https://github.com/user-attachments/assets/508e2b10-487a-4850-a9d8-89fdbc13120a" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/28029b2c-4922-46ee-88d3-ff8d577d2525"><img src="https://github.com/user-attachments/assets/28029b2c-4922-46ee-88d3-ff8d577d2525)" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/51471c42-8a38-4f02-b29b-e34a5279753a"><img src="https://github.com/user-attachments/assets/51471c42-8a38-4f02-b29b-e34a5279753a" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/383f2120-5fed-4d9f-82d8-3de130d6bd65"><img src="https://github.com/user-attachments/assets/383f2120-5fed-4d9f-82d8-3de130d6bd65" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/28892f4c-54cc-4cea-91e1-8fe6bbc7a1a4"><img src="https://github.com/user-attachments/assets/28892f4c-54cc-4cea-91e1-8fe6bbc7a1a4" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/d56d34e4-c4d5-4ed2-b2c5-fa1596714492"><img src="https://github.com/user-attachments/assets/d56d34e4-c4d5-4ed2-b2c5-fa1596714492" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/a0ffc2ca-c3e2-485f-b0ea-ead0d733cc8b"><img src="https://github.com/user-attachments/assets/a0ffc2ca-c3e2-485f-b0ea-ead0d733cc8b" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/1465ac70-caa9-42c7-874b-b01e13a78efb"><img src="https://github.com/user-attachments/assets/1465ac70-caa9-42c7-874b-b01e13a78efb" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/48e2eb49-265b-4eaf-b730-48fa4d7e5bfd"><img src="https://github.com/user-attachments/assets/48e2eb49-265b-4eaf-b730-48fa4d7e5bfd" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/24c893c5-865e-4af4-b003-17bda2ba4f59"><img src="https://github.com/user-attachments/assets/24c893c5-865e-4af4-b003-17bda2ba4f59" width="320"></a></td> </tr> <tr> <td style="text-align:center;" width="320">Ground Truth</td> <td style="text-align:center;" width="320">Reconstruction</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/c18ed80f-3650-43a7-8438-7914de7e39ab"><img src="https://github.com/user-attachments/assets/c18ed80f-3650-43a7-8438-7914de7e39ab" width="320"></a></td> <td><a href="https://github.com/user-attachments/assets/89d38004-021b-4a4d-ab83-5627474f8928"><img src="https://github.com/user-attachments/assets/89d38004-021b-4a4d-ab83-5627474f8928" width="320"></a></td> </tr> </table>Face domain
<table class="center"> <tr> <td><img src="https://github.com/user-attachments/assets/a1562c70-d97c-4324-bb11-47db2b83f443" width="240" alt="Image 1"></td> <td><img src="https://github.com/user-attachments/assets/3196810b-48d7-4024-b687-df2009774631" width="240" alt="Image 2"></td> <td><img src="https://github.com/user-attachments/assets/4e873f4c-ca56-4549-aaa1-ef24032ae96b" width="240" alt="Image 3"></td> </tr> <tr> <td style="text-align: center;">Input 1</td> <td style="text-align: center;">Input 2</td> <td style="text-align: center;">Input 3</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/972dde7a-fa88-479a-a47e-71d3650b1826"><img src="https://github.com/user-attachments/assets/972dde7a-fa88-479a-a47e-71d3650b1826" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/3c216090-9ad1-4911-b990-179b45314d3e"><img src="https://github.com/user-attachments/assets/3c216090-9ad1-4911-b990-179b45314d3e" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/2e2fb78e-2f39-47bd-acaf-3cfbce83b162"><img src="https://github.com/user-attachments/assets/2e2fb78e-2f39-47bd-acaf-3cfbce83b162" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/f2f55021-4e0d-43a7-9f57-3c94b772f573"><img src="https://github.com/user-attachments/assets/f2f55021-4e0d-43a7-9f57-3c94b772f573" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/600a2f6c-7a8f-4304-bdc3-5f0d65d4fb83"><img src="https://github.com/user-attachments/assets/600a2f6c-7a8f-4304-bdc3-5f0d65d4fb83" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/8ad7c7d8-6492-4435-9436-168f90429be3"><img src="https://github.com/user-attachments/assets/8ad7c7d8-6492-4435-9436-168f90429be3" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/8ba84071-1978-4245-84b3-3a6fc3c9fa5a"><img src="https://github.com/user-attachments/assets/8ba84071-1978-4245-84b3-3a6fc3c9fa5a" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/d180c358-bdff-40b4-aa5a-fa4ec73d80b6"><img src="https://github.com/user-attachments/assets/d180c358-bdff-40b4-aa5a-fa4ec73d80b6" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/37004c20-3d0d-4cff-8b4a-f8a4d184de51"><img src="https://github.com/user-attachments/assets/37004c20-3d0d-4cff-8b4a-f8a4d184de51" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/025fe090-7d53-4a12-9498-1c814a0ee768"><img src="https://github.com/user-attachments/assets/025fe090-7d53-4a12-9498-1c814a0ee768" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e8ddf3d1-57ea-4545-a004-66554c19f27b"><img src="https://github.com/user-attachments/assets/e8ddf3d1-57ea-4545-a004-66554c19f27b" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/519b3c87-baa6-408b-b3a5-a95eece9e19e"><img src="https://github.com/user-attachments/assets/519b3c87-baa6-408b-b3a5-a95eece9e19e" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/f55a2b6c-ce10-4716-a001-f747c0da17a4"><img src="https://github.com/user-attachments/assets/f55a2b6c-ce10-4716-a001-f747c0da17a4" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/bb3620af-322d-43bc-9060-c7ce9fc32672"><img src="https://github.com/user-attachments/assets/bb3620af-322d-43bc-9060-c7ce9fc32672" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/46a39738-89c2-43fe-9f98-f0ac0d26e39b"><img src="https://github.com/user-attachments/assets/46a39738-89c2-43fe-9f98-f0ac0d26e39b" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Anger</td> <td style="text-align:center;">Emotion: Disgust</td> <td style="text-align:center;">Emotion: Fear</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/2d3d6e0d-2034-4341-8ca3-42e0cda2704f"><img src="https://github.com/user-attachments/assets/2d3d6e0d-2034-4341-8ca3-42e0cda2704f" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/331f23e1-f441-46f7-98a6-b25a684780f3"><img src="https://github.com/user-attachments/assets/331f23e1-f441-46f7-98a6-b25a684780f3" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/542f8535-634a-4f82-ae2a-39f988c6bc55"><img src="https://github.com/user-attachments/assets/542f8535-634a-4f82-ae2a-39f988c6bc55" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Emotion: Happy</td> <td style="text-align:center;">Emotion: Sad</td> <td style="text-align:center;">Emotion: Surprise</td> </tr> </table>Storytelling
<table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/27aee539-f2bf-467a-8da5-22f506713aa0"><img src="https://github.com/user-attachments/assets/27aee539-f2bf-467a-8da5-22f506713aa0" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/fef5b694-6e1f-42f6-a5a1-7f0b856f3678"><img src="https://github.com/user-attachments/assets/fef5b694-6e1f-42f6-a5a1-7f0b856f3678" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/91408eb8-264d-4d3f-9098-0bfe06022467"><img src="https://github.com/user-attachments/assets/91408eb8-264d-4d3f-9098-0bfe06022467" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/0b822858-da6f-4e6f-82dd-5f243e2feccc"><img src="https://github.com/user-attachments/assets/0b822858-da6f-4e6f-82dd-5f243e2feccc" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/75f69de2-9e5c-48d7-ae55-b28084772836"><img src="https://github.com/user-attachments/assets/75f69de2-9e5c-48d7-ae55-b28084772836" width="240"></a></td> </tr> <tr> <td style="text-align:center;">The picture shows a cozy room with a little girl telling her travel story to her teddybear beside the bed.</td> <td style="text-align:center;">As night falls, teddybear sits by the window, his eyes sparkling with longing for the distant place</td> <td style="text-align:center;">Teddybear was in a corner of the room, making a small backpack out of old cloth strips, with a map, a compass and dry food next to it.</td> <td style="text-align:center;">The first rays of sunlight in the morning came through the window, and teddybear quietly opened the door and embarked on his adventure.</td> <td style="text-align:center;">In the forest, the sun shines through the treetops, and teddybear moves among various animals and communicates with them.</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/3ae06dbf-f41f-4e7f-b384-fca3abf2c0aa"><img src="https://github.com/user-attachments/assets/3ae06dbf-f41f-4e7f-b384-fca3abf2c0aa" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/09a2fdb9-3e84-40a5-a729-075876b10412"><img src="https://github.com/user-attachments/assets/09a2fdb9-3e84-40a5-a729-075876b10412" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/a382aa30-895d-4476-8f86-b668fe153c16"><img src="https://github.com/user-attachments/assets/a382aa30-895d-4476-8f86-b668fe153c16" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/c906b20e-1576-4a96-9c77-b67a940dcce8"><img src="https://github.com/user-attachments/assets/c906b20e-1576-4a96-9c77-b67a940dcce8" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/04f553d2-8ac5-4085-8c1b-59bddf9deb41"><img src="https://github.com/user-attachments/assets/04f553d2-8ac5-4085-8c1b-59bddf9deb41" width="240"></a></td> </tr> <tr> <td style="text-align:center;">Teddybear leaves his mark on the edge of a clear lake, surrounded by exotic flowers, and the picture is full of mystery and exploration.</td> <td style="text-align:center;">Teddybear climbs the rugged mountain road, the weather is changeable, but he is determined.</td> <td style="text-align:center;">The picture switches to the top of the mountain, where teddybear stands in the glow of the sunrise, with a magnificent mountain view in the background.</td> <td style="text-align:center;">On the way home, teddybear helps a wounded bird, the picture is warm and touching.</td> <td style="text-align:center;">Teddybear sits by the little girl's bed and tells her his adventure story, and the little girl is fascinated.</td> </tr> </table> <table class="center"> <tr> <td><a href="https://github.com/user-attachments/assets/14c5f6f1-7830-46fc-9ebd-8611f339d8ab"><img src="https://github.com/user-attachments/assets/14c5f6f1-7830-46fc-9ebd-8611f339d8ab" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/705eb5c4-b084-4752-a47b-ab20d206b9ee"><img src="https://github.com/user-attachments/assets/705eb5c4-b084-4752-a47b-ab20d206b9ee" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/39a814b6-8692-41f5-819d-eecfd6085b03"><img src="https://github.com/user-attachments/assets/39a814b6-8692-41f5-819d-eecfd6085b03" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e2b7297d-d0bb-482e-9c16-f01f9a56fbb0"><img src="https://github.com/user-attachments/assets/e2b7297d-d0bb-482e-9c16-f01f9a56fbb0" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/3afc9aa8-92f5-4e27-bfed-950300645748"><img src="https://github.com/user-attachments/assets/3afc9aa8-92f5-4e27-bfed-950300645748" width="240"></a></td> </tr> <tr> <td style="text-align:center;">The scene shows a peaceful village, with moonlight shining on the roofs and streets, creating a peaceful atmosphere.</td> <td style="text-align:center;">cat sits by the window, her eyes twinkling in the night, reflecting her special connection with the moon and stars.</td> <td style="text-align:center;">Villagers gather in the center of the village for the annual Moon Festival celebration, with lanterns and colored lights adorning the night sky.</td> <td style="text-align:center;">cat feels the call of the moon, and her beard trembles with the excitement in her heart.</td> <td style="text-align:center;">cat quietly leaves her home in the night and embarks on a path illuminated by the silver moonlight.</td> </tr> <tr> <td><a href="https://github.com/user-attachments/assets/b055930e-4f97-4872-bc07-2e9cb5641d7d"><img src="https://github.com/user-attachments/assets/b055930e-4f97-4872-bc07-2e9cb5641d7d" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/ece35515-295d-4655-a5c5-cca95fd11e92"><img src="https://github.com/user-attachments/assets/ece35515-295d-4655-a5c5-cca95fd11e92" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/804e32f1-75ae-4a0b-b3c0-1d13c6e2d987"><img src="https://github.com/user-attachments/assets/804e32f1-75ae-4a0b-b3c0-1d13c6e2d987" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/e09d8134-7991-4585-a09a-b00283ab6a56"><img src="https://github.com/user-attachments/assets/e09d8134-7991-4585-a09a-b00283ab6a56" width="240"></a></td> <td><a href="https://github.com/user-attachments/assets/4c722b97-79d9-4281-8451-b31eb3393c3a"><img src="https://github.com/user-attachments/assets/4c722b97-79d9-4281-8451-b31eb3393c3a" width="240"></a></td> </tr> <tr> <td style="text-align:center;">A group of forest elves dance around glowing mushrooms, their costumes and movements full of magic and vitality.</td> <td style="text-align:center;">cat joins the celebration and dances with the elves, the picture is full of joy and freedom.</td> <td style="text-align:center;">A wise old owl reveals the secret power of the moon to cat and the light of the moon in the picture becomes brighter.</td> <td style="text-align:center;">cat closes her eyes in the moonlight, puts her hands together, and makes a wish, surrounded by the light of stars and the moon.</td> <td style="text-align:center;">cat feels the surge of power, and her eyes become more determined.</td> </tr> </table>β° TODOs
- More demo and applications
- More functionalities such as control modules. (Suggestions are welcome!)
π Information
Code Structure
VideoTuna/
βββ assets # put images for readme
βββ checkpoints # put model checkpoints here
βββ configs # model and experimental configs
βββ data # data processing scripts and dataset files
βββ docs # documentations
βββ eval # evaluation scripts
βββ inputs # input examples for testing
βββ scripts # train and inference python scripts
βββ shsripts # train and inference shell scripts
βββ src # model-related source code
βββ tests # testing scripts
βββ tools # some tool scripts
Supported Models
T2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-2B | 720x480, 6s | Hugging Face |
CogVideoX-5B | 720x480, 6s | Hugging Face |
Open-Sora 1.0 | 512Γ512x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
VideoCrafter2 | 320x512x16 | Hugging Face |
VideoCrafter1 | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
I2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-5B-I2V | 720x480, 6s | Hugging Face |
DynamiCrafter | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
- Note: H: height; W: width; L: length
Please check docs/CHECKPOINTS.md to download all the model checkpoints.
Title## π Get started
1.Prepare environment
conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -U poetry pip
poetry config virtualenvs.create false
poetry install
pip install optimum-quanto==0.2.1
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
rm -rf SwissArmyTransformer
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..
2.Prepare checkpoints
Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.
3.Inference state-of-the-art T2V/I2V/T2I models
- Inference a set of text-to-video models in one command:
bash tools/video_comparison/compare.sh
- The default mode is to run all models, e.g.,
inference_methods="videocrafter2;dynamicrafter;cogvideoβt2v;cogvideoβi2v;opensora"
- If the users want to inference specific models, modify the
inference_methods
variable incompare.sh
, and list the desired models separated by semicolons. - Also specify the input directory via the
input_dir
variable. This directory should contain aprompts.txt
file, where each line corresponds to a prompt for the video generation. The defaultinput_dir
isinputs/t2v
- The default mode is to run all models, e.g.,
- Inference a set of image-to-video models in one command:
bash tools/video_comparison/compare_i2v.sh
- Inference a specific model, run the corresponding commands as follows:
Task | Model | Command | Length (#frames) | Resolution | Inference Time (s) | GPU Memory (GiB) |
---|---|---|---|---|---|---|
I2V | CogVideoX-5b-I2V | bash shscripts/inference_cogVideo_i2v_diffusers.sh | 49 | 576x1024 | 310.4 | 4.78 |
T2V | CogVideoX-2b | bash shscripts/inference_cogVideo_t2v_diffusers.sh | 49 | 576x1024 | 107.6 | 2.32 |
T2V | Open Sora V1.0 | bash shscripts/inference_opensora_v10_16x256x256.sh | 16 | 256x256 | 11.2 | 23.99 |
T2V | VideoCrafter-V2-320x512 | bash shscripts/inference_vc2_t2v_320x512.sh | 16 | 320x512 | 26.4 | 10.03 |
T2V | VideoCrafter-V1-576x1024 | bash shscripts/inference_vc1_t2v_576x1024.sh | 16 | 576x1024 | 91.4 | 14.57 |
I2V | DynamiCrafter | bash shscripts/inference_dc_i2v_576x1024.sh | 16 | 576x1024 | 101.7 | 52.23 |
I2V | VideoCrafter-V1 | bash shscripts/inference_vc1_i2v_320x512.sh | 16 | 320x512 | 26.4 | 10.03 |
T2I | Flux-dev | bash shscripts/inference_flux.sh | 1 | 768x1360 | 238.1 | 1.18 |
T2I | Flux-schnell | bash shscripts/inference_flux.sh | 1 | 768x1360 | 5.4 | 1.20 |
4. Finetune T2V models
(1). Prepare Dataset
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
(2). Finetune
Open-Sora finetuning
We support open-sora finetuning, you can simply run the following commands:
# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh
Lora finetuning
We support lora finetuning to make the model to learn new concepts/characters/styles.
- Example config file:
configs/001_videocrafter2/vc2_t2v_lora.yaml
- Training lora based on VideoCrafter2:
bash shscripts/train_videocrafter_lora.sh
- Inference the trained models:
bash shscripts/inference_vc2_t2v_320x512_lora.sh
Finetuning for enhanced langugage understanding
5. Evaluation
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
<!-- ### 6. Alignment We support video alignment post-training to align human perference for video diffusion models. Please check [configs/train/004_rlhf_vc2/README.md](configs/train/004_rlhf_vc2/README.md) for details. -->Acknowledgement
We thank the following repos for sharing their awesome models and codes!
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
Some Resources
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
π» Contributors
<a href="https://github.com/VideoVerses/VideoTuna/graphs/contributors"> <img src="https://contrib.rocks/image?repo=VideoVerses/VideoTuna" /> </a>π License
Please follow CC-BY-NC-ND. If you want a license authorization, please contact yhebm@connect.ust.hk and yxingag@connect.ust.hk.
π Citation
To be updated...