Awesome
<div align="center"> <h1>LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</h1>Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jinβ , Hao Liβ
[Fudan University]
[Shanghai Academy of Artificial Intelligence for Science]
[Australian Institute for Machine Learning, The University of Adelaide]
(β corresponding author)
<a href="https://arxiv.org/pdf/2412.04814"> <img src='https://img.shields.io/badge/arXiv-LiFT-blue' alt='Paper PDF'></a> <a href="https://codegoat24.github.io/LiFT/"> <img src='https://img.shields.io/badge/Project-Website-orange' alt='Project Page'></a> </div>π₯ News
- [2024/12/22] π₯ We have updated our LiFT-HRA 10K/20K dataset. Download the latest version here !!
- [2024/12/20] π₯ The supplementary of our paper will be updated on arXiv soon.
- [2024/12/17] π₯ We released our optimized evaluation prompts derived from VBench in
Vbench/Vbench_full_info_opt.json
for users to reproduce the results in our paper. - [2024/12/17] π₯π₯ We released our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 !!
- [2024/12/16] π₯ Our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 is coming soon!!
- [2024/12/10] π₯π₯ We released the training and inference code.
- [2024/12/9] π₯ We released the LiFT-Critic-v1.0 and CogVideoX-2B-LiFT. Our code is coming soon!!
- [2024/12/9] π₯ We released the paper.
- [2024/12/6] π₯ We launched the project page.
π Abstract
<p> Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, which includes approximately 10k human annotations comprising both a score and the corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn human feedback-based reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos. </p>π§ Installation
- Clone this repository and navigate to LiFT folder
git clone https://github.com/CodeGoat24/LiFT.git
cd LiFT
- Install packages
bash ./environment_setup.sh lift
π Inference
LiFT-Critic-13b/40b-lora Weights
Please download all public LiFT-Critic checkpoints from Huggingface.
Run
We provide some synthesized videos for quick inference in ./demo
directory.
LiFT-Critic-13b:
python LiFT-Critic/test/run_critic_13b.py --model-path ./LiFT-Critic-13b-lora
LiFT-Critic-40b:
python LiFT-Critic/test/run_critic_40b.py --model-path ./LiFT-Critic-40b-lora
Examples
π» Training
LiFT-Critic is trained on 8 H100 GPUs with 80GB memory.
Dataset
Please download our LiFT-HRA dataset and the 1K subset of the VIDGEN-1M (derived from HD-VILA) we used in our paper.
Please put them under ./dataset
directory. The data structure is like this:
dataset
βββ LiFT-HRA
β βββ LiFT-HRA-data.json
β βββ videos
βββ VIDGEN
β βββ vidgen-data.json
β βββ videos
Training
LiFT-Critic-13b
bash LiFT_Critic/train/train_critic_13b.sh
LiFT-Critic-40b
bash LiFT_Critic/train/train_critic_40b.sh
ποΈ TODO
- β Release project page
- β Release paper
- β Release LiFT-Critic 13B/40B-v1.0
- β Release CogVideoX-2B-LiFT
- β Release inference code
- β Release training code
- β Release LiFT-Critic 13B/40B-v1.5
- β Release dataset LiFT-HRA 10K
- β Release dataset LiFT-HRA 20K
- Release CogVideoX-5B-LiFT
- Release LiFT-Critic 13B/40B-v2.0
π§ Contact
If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.
ποΈ Citation
π If you find our work helpful, please leave us a star and cite our paper.
@article{LiFT,
title={LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.},
author={Wang, Yibin and Tan, Zhiyu, and Wang, Junyan and Yang, Xiaomeng and Jin, Cheng and Li, Hao},
journal={arXiv preprint arXiv:2412.04814},
year={2024}
}
πΌοΈ Results
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <h2>CogVideoX-2B</h2> <video src="https://github.com/user-attachments/assets/6e05e678-88ad-499a-b31f-66679746f7b7" width="100%" controls autoplay loop></video> </td> <td> <h2>CogVideoX-2B-LiFT(Ours)</h2> <video src="https://github.com/user-attachments/assets/e45af501-8d89-4db0-8e4c-3a1e1b0e948b" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/a5a35d67-3ce1-415a-a7f4-c2e982b3b318" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/aea1c0ff-cc1c-476a-8c0e-7c4a34ed404d" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/8818d282-09e2-47df-9f50-92c6281c7da7" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/df1c487a-3a60-4ee2-b8ef-98fafed9bb09" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/59874ca4-d3df-4e76-a1bc-909f5d3424c5" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/f7ced2e8-7e68-4549-91b7-164d54a7bad3" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/c1930e74-b9e2-4df2-84a2-f51bcbf153fe" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/5d310ea7-ba24-4e83-8701-e2bb4217837d" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/b426c98a-6816-4fe1-aabf-cf9444262761" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/81ea3a02-979f-43a4-97ca-445f3414b51f" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/b51b211f-20ea-4895-b117-a147bc7f63a8" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/54e52501-087b-4127-9a3c-fd481c990820" width="100%" controls autoplay loop></video> </td> </tr> </table>π Acknowledgement
Our work is based on LLaVA and VILA, thanks to all the contributors!