Home

Awesome

EgoVLP: Egocentric Video-Language Pretraining

Project page | arXiv

TL;DR: We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on five downstream tasks across three egocentric datasets.

EgoVLP

📢 News

📝 Preparation

Install dependencies

conda env create -f environment.yml
source activate egovlp

Ego4D videos and metadata

You can skip the source video download if pretraining is not required.

  1. Follow the guideline here, download the following to {PATH_TO_EGO4D}

    • Ego4D source videos (nearly 7 TB).
    • Ego4D videos metadata manifest.csv and benchmark metadata, e.g., nlq_train.json for NLQ.
    • Create the dir dataset and add a soft link by ln -s {PATH_TO_EGO4D} dataset/ego4d.
  2. For effectively pretraining, we compress videos in the following way:

    • Resize the source videos with a short size equal to 256 by script utils/video_resize.py.
    • Chunk the resized videos to multiple segments (up to 600 sec) by script utils/video_chunk.py.

EgoClip: an egocentric video-language pretraining dataset

^ The terms tag_verb and tag_noun are used for EgoNCE pretraining objective, which considers synonyms. For example, pick, collect, gather are all belong to the verb parent with idx 93: take_(pick,_grab,_get). The mapping dictionary can be found here.

EgoMCQ: an egocentric video-language development set

EgoMCQ

🏋️‍️ Pretraining

This code is built on PyTorch with DistributedDataParallel (DDP). We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).

🗄 Pretrained Weights

^ This checkpoint is used for EPIC-Kitchens, NLQ, MQ, OSSC, and PNR tasks, except for Charades-Ego. Since we found that VLP (CC3M+WebVid2M, EgoClip) alway degrades significantly on Charades-Ego after the first epoch, we evaluate Charades-Ego using the first pretraining epoch weights of EgoVLP in EgoVLP_PT_EPO1.

^^ You can use our checkpoint to power other egocentric video benchmarks. :)

🔧 Downstream Tasks

EPIC-Kitchens MIR

  1. Follow the instruction here, download the EPIC-Kitchens dataset (RGB frames) and annotation to path: dataset/epic-kitchens/
  2. Follow the instruction here -> How do I create the relevance matrix? to construct a relevance matrix for evaluation.
ModelMode# FramesVideo-Text PTWeightsmAP (V2T)mAP (T2V)mAP (Avg)nDCG (V2T)nDCG (T2V)nDCG (Avg)
EgoVLPZero-shot4EgoClip w/ EgoNCEEgoVLP_PT_BEST19.413.916.624.122.023.1
EgoVLPFine-tuning w/<br /> MI-MM16EgoClip w/ EgoNCEEgoVLP_FT_EPIC49.940.545.060.957.959.4
EgoVLP+Fine-tuning w/ Adaptive-MI-MM + Dual-softmax16EgoClip w/ EgoNCEEgoVLP_FT_EPIC+53.840.947.463.359.661.4

^ EgoVLP+ means our submission for Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022, which equips Adaptive MI-MM loss and Dual-softmax for prediction.

Charades-Ego

  1. Follow the instruction here, download the Charades-Ego dataset (480p) and annotation to path: dataset/charades/
  2. Create a training metadata via utils/charades_meta.py
ModelMode# FramesVideo-Text PTWeightsmAP
EgoVLPZero-shot16EgoClip w/ EgoNCEEgoVLP_PT_EPO125.0
EgoVLPFine-tuning w/ InfoNCE16EgoClip w/ EgoNCEEgoVLP_FT_CHARADES32.1

NLQ @ Ego4D

  1. Make sure you have prepared the NLQ metadata.
  2. For the video branch, download the EgoVLP clip-level features for NLQ. ^ We get these dense video features (fps=1.87) by script run/test_nlq.py.
  3. For the text branch, you can extract EgoVLP text features: python3 run/test_nlq.py --subsample 'text' or use our pretrained text encoder.
  4. Fine-tune the VSLNet or other methods by replacing their input video-text features.

^ We provide our VSLNet codebase which adapts EgoVLP features as an example, you can refer to the data loader and text encoder.

^ Our EgoVLP brings consistent improvement over multiple NLQ challenge baselines.

ModelVideo-Text Pre-extrated FeaturesR@1, IoU=0.3R@5, IoU=0.3R@1, IoU=0.5R@5, IoU=0.5
VSLNetSlowFast + BERT5.4510.743.126.63
VSLNetEgoVLP10.8418.846.8113.45
CONESlowFast + BERT10.4022.745.0311.87
CONEEgoVLP14.1530.338.1818.02

MQ @ Ego4D

  1. Make sure you have prepared the MQ metadata.
  2. Download the EgoVLP clip-level features for MQ. ^ We get these dense video features (fps=1.87) by script run/test_mq.py.
  3. Fine-tune the VSGN or other methods by replacing their input video features.

^ We provide our VSGN codebase which adapts EgoVLP features as an example, you can refer to the data loader.

^ Our EgoVLP brings consistent improvement over multiple MQ challenge baselines.

ModelVideo Pre-extrated FeaturesR@1, IoU=0.5R@5, IoU=0.5mAP
VSGNSlowFast25.1646.186.03
VSGNEgoVLP30.1451.9811.39
ActionFormerSlowFast + Omnivore33.46-17.17
ActionFormerSlowFast + Omnivore + EgoVLP36.84-20.90

OSCC @ Ego4D

  1. Make sure you have prepared the OSCC videos and metadata.
  2. Extract the clip frame follow the instruction here -> Data Preparation.
ModelVideo-Text PretrainedOSCC Acc %
TimeSformerImageNet Init.70.3
TimeSformerEgoVLP73.9

PNR @ Ego4D

ModelVideo-Text PretrainedPNR Err %
TimeSformerImageNet Init.0.616
TimeSformerEgoVLP0.622

^ We found VLP effect is minor in the PNR task.

🎓 Citation

If you find our work helps, please cite our paper.

@article{kevin2022egovlp,
  title={Egocentric Video-Language Pretraining},
  author={Lin, Kevin Qinghong and Wang, Alex Jinpeng and Soldan, Mattia and Wray, Michael and Yan, Rui and Xu, Eric Zhongcong and Gao, Difei and Tu, Rongcheng and Zhao, Wenzhe and Kong, Weijie and others},
  journal={arXiv preprint arXiv:2206.01670},
  year={2022}
}

✉️ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via kevin.qh.lin@gmail.com.

We are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.

🙏 Acknowledgements

This codebase is based on Frozen.

Thanks to Alex for the help with DDP and Mattia for the help with NLQ and MQ benchmarks.

LICENSE

MIT