Home

Awesome

<div align="center"> <h2><font color="red"> ๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Follow-Your-Pose ๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ </font></center> <br> <center>Pose-Guided Text-to-Video Generation using Pose-Free Videos (AAAI 2024)</h2>

Yue Ma*, Yingqing He*, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, and Qifeng Chen

<a href='https://arxiv.org/abs/2304.01186'><img src='https://img.shields.io/badge/ArXiv-2304.01186-red'></a> <a href='https://follow-your-pose.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> Open In Colab Hugging Face Spaces Open in OpenXLab visitors GitHub

</div> <!-- ![fatezero_demo](./docs/teaser.png) --> <table class="center"> <td><img src="gif_results/new_result_0830/a_man_in_the_park.gif"></td> <td><img src="gif_results/new_result_0830/a_Iron_man_in_the_street.gif"></td> <tr> <td width=25% style="text-align:center;">"The man is sitting on chair, on the park"</td> <td width=25% style="text-align:center;">"The Iron man, on the street "</td> <!-- <td width=25% style="text-align:center;">"Wonder Woman, wearing a cowboy hat, is skiing"</td> <td width=25% style="text-align:center;">"A man, wearing pink clothes, is skiing at sunset"</td> --> </tr> <td><img src="gif_results/new_result_0830/a_strom.gif"></td> <td><img src="gif_results/new_result_0830/a_astronaut_cartoon.gif"></td> <tr> <td width=25% style="text-align:center;">"The stormtrooper, in the gym "</td> <td width=25% style="text-align:center;">"The astronaut, earth background, Cartoon Style "</td> </tr> </table >

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Demo Video

https://github.com/mayuelala/FollowYourPose/assets/38033523/e021bce6-b9bd-474d-a35a-7ddff4ab8e75

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Abstract

<b>TL;DR: We tune the text-to-image model (e.g., stable diffusion) to generate the character videos from pose and text description.</b>

<details><summary>CLICK for full abstract</summary>

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable textto-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

</details>

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Changelog

<!-- A new option store all the attentions in hard disk, which require less ram. -->

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ HuggingFace Demo

<table class="center"> <td><img src="https://user-images.githubusercontent.com/38033523/231338219-94b54b10-3fdc-4bf5-9e07-0c1ff236793a.png"></td> <td><img src="https://user-images.githubusercontent.com/38033523/231337960-a30db639-2ecc-486f-8d95-3b3e9c2ed338.png"></td> </tr> </table>

๐ŸŽค๐ŸŽค๐ŸŽค Todo

๐Ÿป๐Ÿป๐Ÿป Setup Environment

Our method is trained using cuda11, accelerator and xformers on 8 A100.

conda create -n fupose python=3.8
conda activate fupose

pip install -r requirements.txt

xformers is recommended for A100 GPU to save memory and running time.

<details><summary>Click for xformers installation </summary>

We find its installation not stable. You may try the following wheel:

wget https://github.com/ShivamShrirao/xformers-wheels/releases/download/4c06c79/xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
pip install xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
</details>

Our environment is similar to Tune-A-video (official, unofficial). You may check them for more details.

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Training

We fix the bug in Tune-a-video and finetune stable diffusion-1.4 on 8 A100. To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --multi_gpu --num_processes=8 --gpu_ids '0,1,2,3,4,5,6,7' \
    train_followyourpose.py \
    --config="configs/pose_train.yaml" 

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Inference

Once the training is done, run inference:

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --gpu_ids '0' \
    txt2video.py \
    --config="configs/pose_sample.yaml" \
    --skeleton_path="./pose_example/vis_ikun_pose2.mov"

You could make the pose video by mmpose , we detect the skeleton by HRNet. You just need to run the video demo to obtain the pose video. Remember to replace the background with black.

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Local Gradio Demo

You could run the gradio demo locally, only need a A100/3090.

python app.py

then the demo is running on local URL: http://0.0.0.0:Port

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Weight

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4)

[FollowYourPose] We also provide our pretrained checkpoints in Huggingface. you could download them and put them into checkpoints folder to inference our models.

FollowYourPose
โ”œโ”€โ”€ checkpoints
โ”‚   โ”œโ”€โ”€ followyourpose_checkpoint-1000
โ”‚   โ”‚   โ”œโ”€โ”€...
โ”‚   โ”œโ”€โ”€ stable-diffusion-v1-4
โ”‚   โ”‚   โ”œโ”€โ”€...
โ”‚   โ””โ”€โ”€ pose_encoder.pth

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Results

We show our results regarding various pose sequences and text prompts.

Note mp4 and gif files in this github page are compressed. Please check our Project Page for mp4 files of original video results.

<table class="center"> <tr> <td><img src="gif_results/new_result_0830/Trump_on_the_mountain.gif"></td> <td><img src="gif_results/new_result_0830/Trump_wear_yellow.gif"></td> <td><img src="gif_results/new_result_0830/astranaut_on_the_mountain.gif"></td> </tr> <tr> <td width=25% style="text-align:center;">"Trump, on the mountain "</td> <td width=25% style="text-align:center;">"man, on the mountain "</td> <td width=25% style="text-align:center;">"astronaut, on mountain"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/new_result_0830/a_girl_yellow.gif"></td> <td><img src="gif_results/new_result_0830/a_Iron_man_on_the_beach.gif"></td> <td><img src="gif_results/new_result_0830/hulk_on_the_mountain.gif"></td> </tr> <tr> <td width=25% style="text-align:center;">"girl, simple background"</td> <td width=25% style="text-align:center;">"A Iron man, on the beach"</td> <td width=25% style="text-align:center;">"A Hulk, on the mountain"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/new_result_0830/a_policeman_in_the_street.gif"></td> <td><img src="gif_results/new_result_0830/a_girl_in_the_forest.gif"></td> <td><img src="gif_results/new_result_0830/Iron_man_in_the_street.gif"></td> </tr> <tr> <td width=25% style="text-align:center;">"A policeman, on the street"</td> <td width=25% style="text-align:center;">"A girl, in the forest"</td> <td width=25% style="text-align:center;">"A Iron man, on the street"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/A Robot is dancing in Sahara desert.gif"></td> <td><img src="gif_results/sdsdA Iron man on the beach.gif"></td> <td><img src="gif_results/A Panda on the sea.gif"></td> </tr> <tr> <td width=25% style="text-align:center;">"A Robot, in Sahara desert"</td> <td width=25% style="text-align:center;">"A Iron man, on the beach"</td> <td width=25% style="text-align:center;">"A panda, son the sea"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/a man in the park, Van Gogh style.gif"></td> <td><img src="gif_results/fireman in the beach.gif"></td> <td><img src="gif_results/Batman brown background.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A man in the park, Van Gogh style"</td> <td width=25% style="text-align:center;">"The fireman in the beach"</td> <td width=25% style="text-align:center;">"Batman, brown background"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/ertA Hulk on the sea .gif"></td> <td><img src="gif_results/lokA Superman on the forest.gif"></td> <td><img src="gif_results/A Iron man on the snow.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A Hulk, on the sea"</td> <td width=25% style="text-align:center;">"A superman, in the forest"</td> <td width=25% style="text-align:center;">"A Iron man, in the snow"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/A man in the forest, Minecraft.gif"></td> <td><img src="gif_results/a man in the sea, at sunset.gif"></td> <td><img src="gif_results/James Bond, grey simple background.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A man in the forest, Minecraft."</td> <td width=25% style="text-align:center;">"A man in the sea, at sunset"</td> <td width=25% style="text-align:center;">"James Bond, grey simple background"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/vryvA Panda on the sea.gif"></td> <td><img src="gif_results/nbthA Stormtrooper on the sea.gif"></td> <td><img src="gif_results/egbA astronaut on the moon.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A Panda on the sea."</td> <td width=25% style="text-align:center;">"A Stormtrooper on the sea"</td> <td width=25% style="text-align:center;">"A astronaut on the moon"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/sssA astronaut on the moon.gif"></td> <td><img src="gif_results/ccA Robot in Antarctica.gif"></td> <td><img src="gif_results/zzzzA Iron man on the beach..gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A astronaut on the moon."</td> <td width=25% style="text-align:center;">"A Robot in Antarctica."</td> <td width=25% style="text-align:center;">"A Iron man on the beach."</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/yrvA Obama in the desert.gif"></td> <td><img src="gif_results/dfcA Astronaut on the beach.gif"></td> <td><img src="gif_results/sasqA Iron man on the snow.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"The Obama in the desert"</td> <td width=25% style="text-align:center;">"Astronaut on the beach."</td> <td width=25% style="text-align:center;">"Iron man on the snow"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/aaaA Stormtrooper on the sea.gif"></td> <td><img src="gif_results/A Iron man on the beach..gif"></td> <td><img src="gif_results/A astronaut on the moon.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"A Stormtrooper on the sea"</td> <td width=25% style="text-align:center;">"A Iron man on the beach."</td> <td width=25% style="text-align:center;">"A astronaut on the moon."</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/cdAstronaut on the beach.gif"></td> <td><img src="gif_results/cswSuperman on the forest.gif"></td> <td><img src="gif_results/cwIron man on the beach..gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"Astronaut on the beach"</td> <td width=25% style="text-align:center;">"Superman on the forest"</td> <td width=25% style="text-align:center;">"Iron man on the beach"</td> </tr> </table> <!--#########################################################--> <table class="center"> <tr> <td><img src="gif_results/dfewcAstronaut on the beach.gif"></td> <td><img src="gif_results/ewA Robot in Antarctica.gif"></td> <td><img src="gif_results/vervA Stormtrooper on the sea.gif"></td> </tr> <tr> </tr> <tr> <td width=25% style="text-align:center;">"Astronaut on the beach"</td> <td width=25% style="text-align:center;">"Robot in Antarctica"</td> <td width=25% style="text-align:center;">"The Stormtroopers, on the beach"</td> </tr> </table> <!--#########################################################-->

๐ŸŽผ๐ŸŽผ๐ŸŽผ Citation

If you think this project is helpful, please feel free to leave a starโญ๏ธโญ๏ธโญ๏ธ and cite our paper:

@article{ma2023follow,
  title={Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos},
  author={Ma, Yue and He, Yingqing and Cun, Xiaodong and Wang, Xintao and Shan, Ying and Li, Xiu and Chen, Qifeng},
  journal={arXiv preprint arXiv:2304.01186},
  year={2023}
}

๐Ÿ‘ฏ๐Ÿ‘ฏ๐Ÿ‘ฏ Acknowledgements

This repository borrows heavily from Tune-A-Video and FateZero. thanks the authors for sharing their code and models.

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Maintenance

This is the codebase for our research work. We are still working hard to update this repo and more details are coming in days. If you have any questions or ideas to discuss, feel free to contact Yue Ma or Yingqing He or Xiaodong Cun.

โญ๏ธโญ๏ธโญ๏ธ Star History

Star History Chart