Awesome

RGNet

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan Md Mohaiminul Islam Thomas Seidl Gedas Bertasius

Accepted by ECCV 2024

:loudspeaker: Latest Updates

Jul-13: The trained models weights are available here
Jul-13: Released the training and evaluation code.
Jul-1: RGNet is accepted to ECCV 2024! :fire::fire:

RGNet Overview :bulb:

RGNet is a novel architecture for processing Long Videos (20–120 minutes) for fine-grained video moment understanding and reasoning. It predicts the moment boundary specified by textual queries from an hour-long video. RGNet unifies retrieval and moment detection into a single network and processes long videos into multiple granular levels, e.g., clips and frames.

Contributions :trophy:

We systematically deconstruct existing LVTG methods into clip retrieval and grounding stages. Through empirical evaluations, we discern that disjoint retrieval is the primary factor contributing to poor performance.
Based on our observations, we introduce RGNet, which integrates clip retrieval with grounding through parallel clip and frame-level modeling. This obviates the necessity for a separate video retrieval network, replaced instead by an end-to-end clip retrieval module tailored specifically for long videos.
We introduce sparse attention to the retriever and a corresponding loss to model fine-grained event understanding in long-range video. We propose a contrastive negative clip-mining strategy to simulate clip retrieval from a long video during training.
RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Installation :wrench:

Follow INSTALL.md for installing necessary dependencies and compiling the code.

Prepare-offline-data

Download full Ego4D-NLQ data Ego4D-NLQ (8.29GB).
Download partial MAD data MAD (6.5GB). We CAN NOT share the MAD visual features at this moment, please request access to the MAD dataset from official resource MAD github.
We provide the feature extraction and file pre-processing procedures for both benchmarks in detail, please refer to Feature_Extraction_MD.
Follow DATASET.md for processing the dataset.

Ego4D-NLQ-training

Training can be launched by running the following command. The checkpoints and other experiment log files will be written into results.

bash rgnet/scripts/pretrain_ego4d.sh 
bash rgnet/scripts/finetune_ego4d.sh

Ego4D-NLQ-inference

Once the model is trained, you can use the following commands for inference, where CHECKPOINT_PATH is the path to the saved checkpoint.

bash rgnet/scripts/inference_ego4d.sh CHECKPOINT_PATH

MAD-training

Training can be launched by running the following command:

bash rgnet/scripts/train_mad.sh

MAD-inference

Once the model is trained, you can use the following commands for inference, where CUDA_DEVICE_ID is cuda device id, CHECKPOINT_PATH is the path to the saved checkpoint.

bash rgnet/scripts/inference_mad.sh CHECKPOINT_PATH

Qualitative Analysis :mag:

A Comprehensive Evaluation of RGNEt's Performance on Ego4D-NLQ Datasets.

Acknowledgements :pray:

We are grateful for the following awesome projects our VTimeLLM arising from:

Moment-DETR: Detecting Moments and Highlights in Videos via Natural Language Queries
QD-DETR: Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
Ego-4D: Ego4D Episodic Memory Benchmark

If you're using VTimeLLM in your research or applications, please cite using this BibTeX:

@article{hannan2023rgnet,
  title={RGNet: A Unified Retrieval and Grounding Network for Long Videos},
  author={Hannan, Tanveer and Islam, Md Mohaiminul and Seidl, Thomas and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2312.06729},
  year={2023}
}

License :scroll:

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License</a>.

Looking forward to your feedback, contributions, and stars! :star2: