Awesome

HierVL: Learning Hierarchical Video-Language Embeddings

Official code of HierVL: Learning Hierarchical Video-Language Embeddings, CVPR 2023.

Teaser

Introduction

HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We pretrain on Ego4D narrations and summaries and also transfer the representations to Charades-Ego, EPIC-KITCHENS and HowTo100M.

Installation

To create a conda enviornment with the required dependencies, run the following command:

conda env create -f environment.yml
source activate hiervl

Dataset Preparation

Please refer to EgoVLP codebase for data preparation. We use the downsampled and chunked video outputs as the input to our method (output from utils/video_chunk.py). For summary sentences, we provide the processed summary and narration hierarchy here. The used egosummary_full.csv is available here.

Setting Correct Paths

All the references to the datasets must be set correctly to run the codes. To help this process, we have replaced all the paths with a suitable string and documented it in PATHS. Use git grep path to find all the occurences of that filepath and replace it with your processed path.

Pretraining

We use four nodes for distributed training. Each node has 32GB GPUs and 480GB CPU memory. The pretraining can be run as

python -m torch.distributed.launch  --nnodes=$HOST_NUM  --node_rank=$INDEX  --master_addr $CHIEF_IP  --nproc_per_node $HOST_GPU_NUM  --master_port 8081  run/train_egoaggregate.py --config configs/pt/egoaggregation.json

We experiment mainly on SLURM and the instructions to run this code on SLURM is given next.

Running on SLURM cluster

To run the pretraining on a distributed SLURM system, copy the content of slurm_scripts to this directly level and run

bash mover_trainer.sh job_name

The parameters of the SLURM job can be changed in the trainer.sh script. We use 4 nodes each with 32 GB GPUs. The submit schedule first copies the required scripts to a different folder and then runs it from there. This copying ensures the code can be safely edited while a job is in the SLURM queue.

Pretraining Checkpoint

The pretraining checkpoint is available here.

Configs for Baseline and Ablations

Change the following flags to run the baselines and ablations

HierVL-Avg: Change self-attention to average in configs/pt/egoaggregation.json
HierVL-w/o Joint: Set catastrophic_forgetting_baseline to True in trainer/trainer_egoaggregate.py.
HierVL-w/o Hier: Set append_summary_baseline to True in EgoClip_EgoMCQ_dataset.py and run EgoVLP pretraining.
HierVL-w/o Summ: Set only_sa_no_summary_baseline to True in trainer/trainer_egoaggregate.py
HierVL-w/o Summ <-> Narr: Set only_video_with_summary_baseline to True in trainer/trainer_egoaggregate.py

Downstream Task Training

To run the downstream tasks, modify the trainer.sh commands with the following flags

--experiment charades --config configs/ft/charades.json for Charages-Ego Action Classification downstream training
--experiment epic_mir --config configs/ft/epic.json for EPIC-KITCHENS-100 MIR downstream training
--experiment howto100m --config configs/ft/howto100m.json for HowTo100M long video classification

Downstream Task Testing

Charades-Ego Action Classification

To test the performance, run

python run/test_charades.py

Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.

EPIC-KITCHENS-100 Multi-Instance Retrieval

To test the performance, run

python run/test_epic.py

Remember to use the released finetuned checkpoint here or zero-shot checkpoint here.

Issues

Please open an issue in this repository (preferred for better visibility) or reach out to kumar.ashutosh@utexas.edu.

Contributing

See the CONTRIBUTING file for how to help out.

Citation

If you use the code or the method, please cite the following paper:

@InProceedings{Ashutosh_2023_CVPR,
    author    = {Ashutosh, Kumar and Girdhar, Rohit and Torresani, Lorenzo and Grauman, Kristen},
    title     = {HierVL: Learning Hierarchical Video-Language Embeddings},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23066-23078}
}

Acknowledgement

The pretraining and Chrades-Ego, EPIC-KITCHENS finetuning codebase is based on EgoVLP repository. Ego4D LTA is based on Ego4D Baseline Code. We thank the authors and maintainers of these codebases.

License

HierVL is licensed under the CC-BY-NC license.