Home

Awesome

HierVL: Learning Hierarchical Video-Language Embeddings

Official code of HierVL: Learning Hierarchical Video-Language Embeddings, CVPR 2023.

Project page | arXiv

Teaser

Introduction

HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We pretrain on Ego4D narrations and summaries and also transfer the representations to Charades-Ego, EPIC-KITCHENS and HowTo100M.

Installation

To create a conda enviornment with the required dependencies, run the following command:

conda env create -f environment.yml
source activate hiervl

Dataset Preparation

Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.

Setting Correct Paths

All the references to the datasets must be set correctly to run the codes. To help this process, we have replaced all the paths with a suitable string and documented it in PATHS. Use git grep path to find all the occurences of that filepath and replace it with your processed path.

Pretraining

We use four nodes for distributed training. Each node has 32GB GPUs and 480GB CPU memory. The pretraining can be run as

python -m torch.distributed.launch  --nnodes=$HOST_NUM  --node_rank=$INDEX  --master_addr $CHIEF_IP  --nproc_per_node $HOST_GPU_NUM  --master_port 8081  run/train_egoaggregate.py --config configs/pt/egoaggregation.json

We experiment mainly on SLURM and the instructions to run this code on SLURM is given next.

Running on SLURM cluster

To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this directly level and run

bash mover_trainer.sh job_name

The parameters of the SLURM job can be changed in the trainer.sh script. We use 4 nodes each with 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.

Pretraining Checkpoint

The pretraining checkpoint is available here.

Configs for Baseline and Ablations

Change the following flags to run the baselines and ablations

Downstream Task Training

To run the downstream tasks, modify the trainer.sh commands with the following flags

Downstream Task Testing

Charades-Ego Action Classification

To test the performance, run

python run/test_charades.py

Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.

EPIC-KITCHENS-100 Multi-Instance Retrieval

To test the performance, run

python run/test_epic.py

Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.

Issues

Please open an issue in this repository (preferred for better visibility) or reach out to kumar.ashutosh@utexas.edu.

Contributing

See the CONTRIBUTING file for how to help out.

Citation

If you use the code or the method, please cite the following paper:

@InProceedings{Ashutosh_2023_CVPR,
    author    = {Ashutosh, Kumar and Girdhar, Rohit and Torresani, Lorenzo and Grauman, Kristen},
    title     = {HierVL: Learning Hierarchical Video-Language Embeddings},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23066-23078}
}

Acknowledgement

The pretraining and Chrades-Ego, EPIC-KITCHENS finetuning codebase is based on EgoVLP repository. Ego4D LTA is based on Ego4D Baseline Code. We thank the authors and maintainers of these codebases.

License

HierVL is licensed under the CC-BY-NC license.