Awesome

This&That: Language-Gesture Controlled Video Generation for Robot Planning

</div>

This is the official implementation of Video Generation part of This&That: Language-Gesture Controlled Video Generation for Robot Planning.

Robotics part can be found here.

<a name="Update"></a>Update 🔥🔥🔥

Release the test code implementation of This&That
Release the huggingface pretrained Bridge-trained paper weight (v1.0) of This&That
Release the huggingface pretrained Bridge-trained improved weight (v1.1) of This&That
Release the Gradio Demo && Huggingface Demo
Release the dataset curation
Release the train code implementation

:star: If you like This&That, please help star this repo. Thanks! :hugs:

<a name="Visualization"></a> Visualization 👀

https://github.com/user-attachments/assets/fc6b00c1-db7d-4278-8965-a6cf802a2b08

<a name="installation"></a> Installation 🔧

conda create -n ttvdm python=3.10
conda activate ttvdm
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
git lfs install

<a name="fast_inference"></a> Fast Inference ⚡⚡⚡

Gradio Interactive demo is available by

  python app.py

This will use our v1.1 weight for VGL mode only. The Hugginface online demo can be found here

<a name="regular_inference"></a> Regular Inference ⚡

We provide an easy inference methods by automatically download the pretrained and the yaml file needed. The testing dataset can be found in assets folder which includes all the format needed. The generated results can be found at generated_results. Feel free to explore the coding structure, we won't go too details right now.

Note that, the weight right now we provide is Bridge-trained, so the IssacGym trained one is a different one and will be provided later.

python test_code/inference.py --model_type GestureNet --huggingface_pretrained_path HikariDawn/This-and-That-1.1

The default arguments of test_code/inference.py is capable of executing sample images from "assets" folder and a lot of settings are fixed. Please have a look for the argument parts available.

Change --model_type to UNet for VL (Vision+Language), or to GestureNet for VGL (Vision+Gesture+Language). Recommend to use VGL for the best performance.

We provide two kinds of model weight, one is paper weight named V1.0. Another is V1.1 which we finetune the hyperparameter a little bit to have slightly better performance.

<a name="curation"></a> Dataset Curation

In the following training, I preprocessed the original Bridge dataset folder (recursive folder) to a flat single folder style. If you want to use otherwise, you may need to write scripts or modify DataLoader Class under "data_loader" folder.

To prepare the dataset, you based on our provided sample code:

python curation_pipeline/prepare_bridge_v1.py --dataset_path /path/to/Bridge/raw/bridge_data_v1/berkeley --destination_path XXX
python curation_pipeline/prepare_bridge_v2.py --dataset_path /path/to/Bridge/raw/bridge_data_v2 --destination_path XXX

For the v1, you need to deep into "berkeley" folder, but not for v2.

For the Gesture labelling, we are also based on the flat folder style as above. We need you to first download the pretrained yolo weight by us for the gripper detection here, and also the SAM1 weight (sam_vit_h_4b8939.pth). The default setting is 14 frames with 4x maximum acceleration duration (so 56 frames max) allowed. To execute, you can (doing twice for V1 and V2 weight):

python curation_pipeline/select_frame_with_this_that.py --dataset_path XXX --destination_path XXX --yolo_pretarined_path XXX --sam_pretrained_path XXX

The validation file should be the same format as the training files. You can copy paste some instances as the validation dataset during the training (I choose 3-5 usually). I would recommend you to check the training code and yaml to set the "validation_img_folder". Validation for VL and VGL should not be mixed.

<a name="training"></a> Training

For the Text+Image2Video training, edit file "config/train_image2video.yaml" line 14 to edit the dataset path, and other setting based on your preference. Also, edit "num_processes" for the number of GPU used in the file "config/accelerate_config.json", and also check other setting, follwing accelerate package.

accelerate launch --config_file config/accelerate_config.json --main_process_port 24532 train_code/train_svd.py

Set "--main_process_port" to what you need

For the Text+Image+Gesture to Video training, first edit file "config/train_image2video_controlnet.yaml" line 16 to edit the dataset path. Further, edit "load_unet_path" in line2 for your trained weight. Read more for the yaml setting file for a better control to the training.

accelerate launch --config_file config/accelerate_config.json --main_process_port 24532 train_code/train_csvd.py

There are a lot of deatils not shown here, please check the code and the yaml file.

:books: Citation

If you make use of our work, please cite our paper.

@article{wang2024language,
  title={This\&That: Language-Gesture Controlled Video Generation for Robot Planning},
  author={Wang, Boyang and Sridhar, Nikhil and Feng, Chao and Van der Merwe, Mark and Fishman, Adam and Fazeli, Nima and Park, Jeong Joon},
  journal={arXiv preprint arXiv:2407.05530},
  year={2024}
}

🤗 Acknowledgment

The current version of This&That is built on SVD. The Hugginface Gradio Demo is based on DragDiffusion.

We appreciate the authors for sharing their awesome codebase.