Awesome
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
ECCV 2024 (Oral, Best Paper Finalist)
Project Page | Paper | Dataset
Update
[10/01] Our paper was nominated in the Best Paper Finalist.
[08/15] Our work was selected as oral presentation in ECCV 2024.
[08/15] We have released our model weights and completed README with detailed guidance.
[07/21] <font color=red>Our dataset has been released!</font>
<!-- Thank you for your interest in our work! The first version of the code has been released. We are editing the README instructions. ToDo: - [x] Dataset - [x] Codes - [x] README - [x] Checkpoints --> <img src='https://bolinlai.github.io/Lego_EgoActGen/figures/visualization_new_actions.png'/>Contents
- Setup
- Dataset
- GPT-Curated Data (Optional)
- VLLM Features
- Model Weights
- Train and Inference
- Metrics
- BibTeX
- Acknowledge
TODO
-
Move model weights to huggingface for easier download.
-
Implement for inference on a single sample.
Setup
Due to the incompatibility of VLLM and LDM packages, we use two environments for each model.
Install the dependencies for VLLM.
conda env create -f vllm_env.yaml # The env name is "vllm".
conda activate vllm
pip install flash-attn==2.0.7 --no-build-isolation
Install the dependencies for LDM.
conda env create -f ldm_env.yaml # The env name is "ldm".
Dataset
Run the command below to download our dataset, or download via this link and unzip EgoGen.zip
to your local path. .
bash scripts/download_dataset.sh [your_local_path] # replace [your_local_path] to your local download path.
Our dataset is composed of video frames (in EgoGen.zip
) and action labels/descriptions (in *.json
) from Ego4D and Epic-Kitchens.
The structure of the dataset is as follows. Note that val_gt_for_metric
only contains ground truth images of val
. It's simply used for metric calculation (FID score) and not involved in training and inference.
[your_local_path]
|
|-- EgoGen
| |
| |-- ego4d.fho
| | |-- train
| | |-- val
| | |-- val_gt_for_metric
| |
| |-- epickitchen
| |-- train
| |-- val
| |-- val_gt_for_metric
|
|-- ego4d_train.json # ego4d image paths and actions for training
|-- ego4d_val.json # ego4d image paths and actions for evaluation
|-- ego4d_metadata.json # ego4d metadata of selected video frames
|-- epickitchen_train.json # epickitchens image paths and actions for training
|-- epickitchen_val.json # epickitchens image paths and actions for evaluation
|-- epickitchen_metadata.json # epickitchens metadata of selected video frames
The test sets of Ego4d and Epic-Kitchens are hidden so we use the validation set as test set in our experiments. In ego4d_train.json
, ego4d_val.json
, epickitchen_train.json
and epickitchen_val.json
, image_0
and image_1
denote the source and target images for action frame generation. action
refers to the original action labels/descriptions in the Ego4D and Epic-Kitchens. We also release the enriched action descriptions (llava_forecast_finetune
) generated by LLM. In ego4d_metadata.json
and epickitchen_metadata.json
, we release the meta data (e.g., fps, resolution, object bbox, etc.) of the selected video frames.
GPT-Curated Data (Optional)
We release our detailed acton descriptions curated from GPT-3.5, which are used for VLLM instruction tuning. You can download by running the following command, or from this link.
bash scripts/download_gpt_curated_data.sh
Note: This step is only necessary if you want to finetune the VLLM component in LEGO. Otherwise, you can directly use the released VLLM weights for inference.
VLLM Features
In addition, we also release the VLLM image and text features, and then you can train the LDM component without running VLLM inference. You can download and unzip all features by running this command, or from this link.
Please make sure there are 500GB available on your machine before downloading.
bash scripts/download_vllm_features.sh [your_local_path] # replace [your_local_path] to your local download path.
If you download from the link, please unzip each zip file in the same folder by using unzip xxx.zip
. For ego4d image features, we split the data in .zip
and .z01
files. You can put them in the same folder and run zip -s0 llava-llama-2-13b-chat-forecasting-finetune-ckpt450-split.zip --out merge.zip
to merge them to one single zip file. Then you can unzip merge.zip
.
Model Weights
Please download our released model weights via the links below.
Ego4D | Epic-Kitchens | |
---|---|---|
VLLM | download | download |
LDM | download | download |
LDM (scaleup) | download | download |
VLLM and LDM are trained on only one of the datasets, while LDM (scaleup) refers to the latent diffusion models trained with both Ego4D and Epic-Kitchens training sets, thus having better performance. We released two checkpoints of LDM (scaleup) that lead to the best performance on two test sets respectively. In inference, you can load LDM (scaleup) in the same way as regular LDM checkpoints.
Train and Inference
We train the VLLM and LDM components separately. Both of them are trained on 8x40G A100.
Note: As a quick start, you can directly use our released enriched action descriptions (in our dataset) and VLLM features, to skip VLLM instruction tuning and inference. Then you can jump to LDM Training.
VLLM Training
Preparation
Activate vllm
virtual environment.
conda activate vllm
Download pretrained llava weights by
wget -O [your_path]/llava_pretrained.zip "https://www.dropbox.com/scl/fi/q5yy8znjirymfe9kte2a2/llava_pretrained.zip?rlkey=qbskcxd85qxg5jphb50lvxd4a&st=qwdxpg2o&dl=1"
unzip [your_path]/llava_pretrained.zip -d [your_path]
rm [your_path]/llava_pretrained.zip
Before running the script, you have to update the paths of dataset and pretrained weights in vllm/scripts/finetune_ego4d.sh
(for training on Ego4D) and vllm/scripts/finetune_epickitchen.sh
(for training on Epic-Kitchens) to your local paths.
--model_name_or_path
: The path of pretrained VLLM checkpoint for initialization.
--data_path
: The path of detailed action descriptions curated from GPT-3.5.
--image_folder
: The path of Ego4D/Epic-Kitchens training data (i.e., video frames).
--output_dir
: The path to save checkpoints.
Train VLLM on Ego4D
Then run the command below.
bash vllm/scripts/finetune_ego4d.sh
Train VLLM on Epic-Kitchens
Then run the command below.
bash vllm/scripts/finetune_epickitchen.sh
VLLM Inference
Preparation
Activate vllm
virtual environment.
conda activate vllm
To speed up inference, we divide the data into 5 chunks and run inference on them separately. There are two ways to run inference on Ego4D or Epic-Kitchens.
(1) Use Slurm
If you are using slurm to launch jobs, before running the script, you have to update the paths in vllm/scripts/sbatch_inference.sh
to your local paths.
model_path
: The path of instruction-tuned VLLM weights.
image_dir
: The path of video frames.
action_label
: The path of action labels (i.e., *.json files downloaded with video frames).
save_path
: The path to save generated enriched action descriptions.
save_image_feature_path
: The path to save VLLM image features.
save_text_feature_path
: The path to save VLLM text features.
The configuration of slurm can be edited in vllm/scripts/sbatch_inference/inference_sbatch_*.sh
. Then run the command below and check the logs in vllm/out/logs
.
bash vllm/scripts/sbatch_inference.sh
Then merge the output in one json file using
python vllm/scripts/merge_inference_results.py
(2) Without Slurm
You need to manually run the inference on each chunk. You can use --num-chunks
to control how many chunks the data will be divided into, and --chunk-idx
to specify which chunk to be used for inference (e.g., --num-chunks=5 --chunk-idx=3
means dividing data into 5 chunks and run inference on the third chunk). The paths should be changed to your local paths as elaborated above.
export PYTHONPATH=$PYTHONPATH:./vllm
python -m llava.eval.run_llava_in_loop \
--model-path /fsx-project/bolinlai/Release/checkpoints/VLLM/ego4d/llava-llama-2-13b-chat-forecasting-finetune \
--image-dir /fsx-project/bolinlai/Release/dataset/EgoGen/ego4d.fho/val \
--action-label /fsx-project/bolinlai/Release/dataset/ego4d_val.json \
--query "How does the person properly {} that is displayed in the video frame?" \
--save-path /fsx-project/bolinlai/Release/vllm_output/ego4d/llava-llama-2-13b-chat-forecasting-finetune/val \
--save-image-feature-path /fsx-project/bolinlai/Release/vllm_features/ego4d/vllm_image_features/llava-llama-2-13b-chat-forecasting-finetune/val \
--save-text-feature-path /fsx-project/bolinlai/Release/vllm_features/ego4d/vllm_text_features/llava-llama-2-13b-chat-forecasting-finetune/val \
--seed 42 \
--num-chunks 5 \
--chunk-idx 1
Then merge the output in one json file using
python vllm/scripts/merge_inference_results.py
LDM Training
Preparation
Activate ldm
virtual environment.
conda activate ldm
Download pretrained stable diffusion weights by
wget -O [your_path]/stable_diffusion.zip "https://www.dropbox.com/scl/fi/773bpwnb2m4db2uvo0d64/stable_diffusion.zip?rlkey=qgk8mg5j4hrqqbsxkz0gt0os7&st=b5wltovy&dl=1"
unzip [your_path]/stable-diffusion.zip -d [your_path]
rm [your_path]/stable-diffusion.zip
Before launching the job, you have to update the paths in configs/train_ego4d.yaml
and configs/train_epickitchen.yaml
for training on Ego4D and Epic-Kitchens, respectively.
model.params.ckpt_path
: The path of pretrained latent diffusion model weights for initialization.
data.params.train.params.data_path
: The path of video frames in training set.
data.params.train.params.edit_path
: The path of action descriptions in training set (i.e., ego4d_train.json
or epickitchen_train.json
downloaded with video frames, or you can replace the action descriptions with the VLLM output generated by yourself in VLLM Inference).
data.params.train.params.additional_cond_path
: The paths of VLLM image and text features in training set (our released features, or features generated by yourself in VLLM Inference).
data.params.validation.params.data_path
: The path of video frames in val set.
data.params.validation.params.edit_path
: The path of action descriptions in val set (i.e., ego4d_val.json
or epickitchen_val.json
downloaded with video frames, or you can replace the action descriptions with the VLLM output generated by yourself in VLLM Inference).
data.params.validation.params.additional_cond_path
: The paths of VLLM image and text features in val set (our released features, or features generated by yourself in VLLM Inference).
Train LDM on Ego4D
Run the command below. The checkpoints will be saved in logs/
.
python main.py --name lego_ego4d --base configs/train_ego4d.yaml --train --gpus 0,1,2,3,4,5,6,7
Train LDM on Epic-Kitchens
Run the command below. The checkpoints will be saved in logs/
.
python main.py --name lego_epickitchens --base configs/train_epickitchen.yaml --train --gpus 0,1,2,3,4,5,6,7
LDM Inference
Preparation
Activate ldm
virtual environment.
conda activate ldm
To speed up inference, we divide the data into 8 chunks and run inference on them separately. Similar to training, you need to update the paths in configs/generate_ego4d.yaml
and configs/generate_epickitchen.yaml
for Ego4D and Epic-Kitchens inference, respectively.
data.metadata_path
: The path of metadata (released in our dataset).
data.params.data_path
: The path of video frames in val set.
data.params.edit_path
: The path of action descriptions in val set (i.e., ego4d_val.json
or epickitchen_val.json
).
data.params.additional_cond_path
: The paths of VLLM image and text features in val set.
Run LDM inference on Ego4D
(1) Use Slurm
Edit slurm configuration in sbatch_inference/ego4d_inference/test_ego4d_sbatch_*.sh
to comply with your cluster.
Run the following command with your local path to the checkpoint and check the logs in logs/out
. Generated images will be saved under the same directory of checkpoint.
bash test_ego4d.sh logs/ego4d_diffusion_with_vllm_feature.ckpt
(2) Without Slurm
You need to manually run inference on each chunk with the command below. --n_chunk 8 --chunk_idx 1
means dividing data into 8 chunks and run inference on the first chunk.
python metrics/inference.py --config configs/generate_ego4d.yaml --ckpt logs/ego4d_diffusion_with_vllm_feature.ckpt --n_chunk 8 --chunk_idx 1
Run LDM inference on Epic-Kitchens
(1) Use Slurm
Edit slurm configuration in sbatch_inference/epickitchen_inference/test_epickitchen_sbatch_*.sh
to comply with your cluster.
Run the following command with your local path to the checkpoint and check the logs in logs/out
. Generated images will be saved under the same directory of checkpoint. Generated images will be saved under the same directory of checkpoint.
bash test_epickitchen.sh logs/epickitchen_diffusion_with_vllm_feature.ckpt
(2) Without Slurm
You need to manually run inference on each chunk with the command below. --n_chunk 8 --chunk_idx 1
means dividing data into 8 chunks and run inference on the first chunk. Generated images will be saved under the same directory of checkpoint.
python metrics/inference.py --config configs/generate_epickitchen.yaml --ckpt logs/epickitchen_diffusion_with_vllm_feature.ckpt --n_chunk 8 --chunk_idx 1
Metrics
Preparation
Activate ldm
virtual environment.
conda activate ldm
If this is your first time to run metric calcuation, you need to download some model weights by running
bash scripts/download_metric_weights.sh
Calculate metrics on Ego4D
Replace the following path to your local path and run the command.
--gen_path
: The path of generated action frames.
--gt_path
: The path of val_gt_for_metric
(downloaded with dataset).
--edit_file
: THe path of action descriptions (i.e., ego4d_val.json
).
python metrics/all_metrics_in_one.py --dataset ego4d --llava_key llava_forecast_finetune --gen_path logs/ego4d_diffusion_with_vllm_feature-e=0-s=100-si=1.5/images --gt_path /fsx-project/bolinlai/Release/dataset/EgoGen/ego4d.fho/val_gt_for_metric --edit_file /fsx-project/bolinlai/Release/dataset/ego4d_val.json
Calculate metrics on Epic-Kitchens
Similarly, update the paths (as above) and run the command.
python metrics/all_metrics_in_one.py --dataset epickitchen --llava_key llava_forecast_finetune --gen_path logs/epickitchen_diffusion_with_vllm_feature-e=0-s=100-si=1.5/images --gt_path /fsx-project/bolinlai/Release/dataset/EgoGen/epickitchen/val_gt_for_metric --edit_file /fsx-project/bolinlai/Release/dataset/epickitchen_val.json
BibTeX
If you find LEGO useful for your work, please cite using this BibTeX.
@inproceedings{lai2024lego,
title={Lego: Learning egocentric action frame generation via visual instruction tuning},
author={Lai, Bolin and Dai, Xiaoliang and Chen, Lawrence and Pang, Guan and Rehg, James M and Liu, Miao},
booktitle={European Conference on Computer Vision},
pages={135--155},
year={2024},
organization={Springer}
}
Acknowledgement
Our code was built on LLaVA and InstructPix2Pix. We appreciate the authors of the two awesome codebases.