Home

Awesome

<div align="center"> <h1>LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</h1>

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin†, Hao Li†

[Fudan University]

[Shanghai Academy of Artificial Intelligence for Science]

[Australian Institute for Machine Learning, The University of Adelaide]

(†corresponding author)

<a href="https://arxiv.org/pdf/2412.04814"> <img src='https://img.shields.io/badge/arXiv-LiFT-blue' alt='Paper PDF'></a> <a href="https://codegoat24.github.io/LiFT/"> <img src='https://img.shields.io/badge/Project-Website-orange' alt='Project Page'></a>

Hugging Face Spaces Hugging Face Spaces

</div>

πŸ”₯ News

πŸ“– Abstract

<p> Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, which includes approximately 10k human annotations comprising both a score and the corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn human feedback-based reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos. </p>

teaser

πŸ”§ Installation

  1. Clone this repository and navigate to LiFT folder
git clone https://github.com/CodeGoat24/LiFT.git
cd LiFT
  1. Install packages
bash ./environment_setup.sh lift

πŸš€ Inference

LiFT-Critic-13b/40b-lora Weights

Please download all public LiFT-Critic checkpoints from Huggingface.

Run

We provide some synthesized videos for quick inference in ./demo directory.

LiFT-Critic-13b:

python LiFT-Critic/test/run_critic_13b.py --model-path ./LiFT-Critic-13b-lora

LiFT-Critic-40b:

python LiFT-Critic/test/run_critic_40b.py --model-path ./LiFT-Critic-40b-lora

Examples

critic_case

πŸ’» Training

LiFT-Critic is trained on 8 H100 GPUs with 80GB memory.

Dataset

Please download our LiFT-HRA dataset and the 1K subset of the VIDGEN-1M (derived from HD-VILA) we used in our paper.

Please put them under ./dataset directory. The data structure is like this:

dataset
β”œβ”€β”€ LiFT-HRA
β”‚  β”œβ”€β”€ LiFT-HRA-data.json
β”‚  β”œβ”€β”€ videos
β”œβ”€β”€ VIDGEN
β”‚  β”œβ”€β”€ vidgen-data.json
β”‚  β”œβ”€β”€ videos

Training

LiFT-Critic-13b

bash LiFT_Critic/train/train_critic_13b.sh

LiFT-Critic-40b

bash LiFT_Critic/train/train_critic_40b.sh

πŸ—“οΈ TODO

πŸ“§ Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

πŸ–ŠοΈ Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{LiFT,
  title={LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.},
  author={Wang, Yibin and Tan, Zhiyu, and Wang, Junyan and Yang, Xiaomeng and Jin, Cheng and Li, Hao},
  journal={arXiv preprint arXiv:2412.04814},
  year={2024}
}

πŸ–ΌοΈ Results

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <h2>CogVideoX-2B</h2> <video src="https://github.com/user-attachments/assets/6e05e678-88ad-499a-b31f-66679746f7b7" width="100%" controls autoplay loop></video> </td> <td> <h2>CogVideoX-2B-LiFT(Ours)</h2> <video src="https://github.com/user-attachments/assets/e45af501-8d89-4db0-8e4c-3a1e1b0e948b" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/a5a35d67-3ce1-415a-a7f4-c2e982b3b318" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/aea1c0ff-cc1c-476a-8c0e-7c4a34ed404d" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/8818d282-09e2-47df-9f50-92c6281c7da7" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/df1c487a-3a60-4ee2-b8ef-98fafed9bb09" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/59874ca4-d3df-4e76-a1bc-909f5d3424c5" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/f7ced2e8-7e68-4549-91b7-164d54a7bad3" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/c1930e74-b9e2-4df2-84a2-f51bcbf153fe" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/5d310ea7-ba24-4e83-8701-e2bb4217837d" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/b426c98a-6816-4fe1-aabf-cf9444262761" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/81ea3a02-979f-43a4-97ca-445f3414b51f" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/b51b211f-20ea-4895-b117-a147bc7f63a8" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/54e52501-087b-4127-9a3c-fd481c990820" width="100%" controls autoplay loop></video> </td> </tr> </table>

πŸ™ Acknowledgement

Our work is based on LLaVA and VILA, thanks to all the contributors!