Home

Awesome

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval [ECCV 2024]

Authors: Thomas Hummel*, Shyamgopal Karthik*, Mariana-Iuliana Georgescu, Zeynep Akata

arXiv

Abstract

<div align="justify">

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks may not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR.

</div>

The EgoCVR benchmark

We propose EgoCVR, a benchmark with 2,295 queries, to evaluate vision-language models for the task of Composed Video Retrieval. The videos and corresponding annotations were collected from the Ego4D FHO task.

Comparison to existing benchmarks

EgoCVR focuses to a significantly greater extent on temporal and action-related modifications (blue) as opposed to object-centred modifications (orange) when compared to the previously existing WebVid-CoVR-Test benchmark.

Dataset structure

Annotations

The annotations for the EgoCVR benchmark are stored in annotation/egocvr/egocvr_annotations.csv. Each row in the CSV file corresponds to a single query. The columns are as follows:

Evaluation

We consider two possible evaluation settings for EgoCVR:

The gallery information for the EgoCVR benchmark is stored in annotation/egocvr/egocvr_annotations_gallery.csv. In addition to the columns from the query annotations, the gallery annotations contain the following columns:

Please follow the instructions below for downloading the gallery information.

Setup

  1. Install the required packages
  2. Download the EgoCVR gallery information
  3. Download either the EgoCVR videos or the pre-computed model embeddings
  4. Download the model weights

1. Installing Required Packages

Instructions coming soon.

2. Downloading the EgoCVR Gallery Information

The gallery information for the EgoCVR benchmark is stored in annotation/egocvr/egocvr_annotations_gallery.csv. The gallery information can be downloaded from the following link:

3. Downloading the EgoCVR Videos or Pre-computed Model Embeddings

EgoCVR Videos

We provide the video clips from the EgoCVR benchmark to download. We provide the clips in original full scale and downscaled to with the short side 256px. For all models we use the full scale video clips as input except for EgoVLPv2, for which we follow the model recommendations of downscaling first (more information).

After downloading, please extract the zip file and place the video clips to the data/ directory.

unzip egocvr_clips.zip -d data/
unzip egocvr_clips_256.zip -d data/

Pre-computed Model Embeddings

We provide also the pre-computed model embeddings for the EgoCVR benchmark to download.

After downloading, please extract the zip file and place the model embeddings to the embeddings/ directory.

unzip egocvr_embeddings_egovlpv2.zip -d embeddings/

4. Downloading the Model Weights

The model weights should be placed in the checkpoints/ directory.

Evaluation

To evaluate different methods on the EgoCVR benchmark, please run the following command:

# Evaluation in the global setting
python egocvr_retrieval.py --evaluation global
# Evaluation in the local setting
python egocvr_retrieval.py --evaluation local

You can specify the model and the modalities to evaluate by using the following arguments:

TFR-CVR: Training-Free Re-ranking for Composed Video Retrieval

In the Paper, we propose a simple training-free re-ranking method for Composed Video Retrieval. To evaluate the TFR-CVR and TF-CVR method, please run the following command:

# 2-Stage retrieval in the global setting
python egocvr_retrieval.py --evaluation global --model languagebind egovlpv2 --modalities visual text --text tfcvr

# 1-Stage TF-CVR in the local setting
python egocvr_retrieval.py --evaluation local --model egovlpv2 --modalities text --text tfcvr

Additional Examples

# CLIP
python egocvr_retrieval.py --evaluation global --model clip --modalities visual-text --text instruction
# BLIP
python egocvr_retrieval.py --evaluation global --model blip --modalities visual-text --text instruction
# LanguageBind
python egocvr_retrieval.py --evaluation global --model languagebind --modalities visual-text --text instruction
# BLIP_CoVR
python egocvr_retrieval.py --evaluation global --model blip --modalities visual-text --text instruction --fusion crossattn --finetuned

Citation

@article{hummel2024egocvr,
  title={EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval},
  author={Thomas Hummel and Shyamgopal Karthik and Mariana-Iuliana Georgescu and Zeynep Akata},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}