Home

Awesome

[ECCV2024] UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

PWC PWC PWC PWC PWC PWC

<a href='https://arxiv.org/abs/2404.04933'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

<div align="center"> <img src="./images/intro_unimd.png" width="600px"/> </div> <div align="center"> <img src="./images/network.png" width="600px"/> </div>

Introduction

In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR, as shown in Fig.1. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel querydependent decoders to generate a uniform output of classification score and temporal segments, as shown in Fig.4. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts.

This repository will contain the code for UniMD and the video features used in the paper. Our code is built upon the codebase from actionformer. Our paper is accepted to ECCV 2024.

Changelog

Video Features

We provide the video features of three paired datasets used in our experiments, including "Ego4D-MQ & Ego4D-NLQ", "Charades & Charades-STA", and "ActivityNet & ActivityNet-caption".

query embeddings & groundtruth

Installation

Inference

Ego4D

DatasetMethodFeats.TAD-mAPTAD-r1@50MR-r1@30MR-r1@50checkpoint
Ego4D-MQindividualInternVid-verb22.6142.82--ego4d_mq_individual.pth.tar
Ego4D-NLQindividualInternVid-verb--13.999.34ego4d_nlq_individual.pth.tar
# inference
cd ./tools/
sh run_predict_ego4d.sh     # for tad: make data_type=tad; for mr: make data_type=mr

Charades & Charades-STA

DatasetMethodfeatsTAD-mAPMR-r1@50MR-r1@70checkpoint
Charadesindividuali3d22.31--charades_individual.pth.tar
Charadesindividuali3d+clip26.18-charades_i3dClip_individual.pth.tar
Charades-STAindividuali3d-60.1941.02charadesSTA_individual.pth.tar
Charades-STAindividuali3d+clip-58.7940.08charadesSTA_i3dClip_individual.pth.tar
# inference
cd ./tools/
sh run_predict_charades.sh     # for tad: make data_type=tad; for mr: make data_type=mr

ActivityNet & ActivityNet-Caption

DatasetMethodfeatsTAD-mAPTAD-mAP@50MR-r5@50MR-r5@70checkpoint
ANetindividualinternVid38.6058.31--anet_tad_individual.pth.tar
ANet-captionindividualinternVid--77.2852.22anet_caption_individual.pth.tar
# inference
cd ./tools/
sh run_predict_anet.sh     # for tad: make data_type=tad; for mr: make data_type=mr

Training

cd ./tools/
# for example, train ego4d in tad task individually, run:
sh individual_train_ego4d_tad.sh
cd ./tools/
# for example, co-train ego4d, run:
sh cotrain_random_ego4d.sh
cd ./tools/
# for example, co-train ego4d, run:
sh cotrain_sync_ego4d.sh

Citation

@misc{zeng2024unimd,
      title={UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection}, 
      author={Yingsen Zeng and Yujie Zhong and Chengjian Feng and Lin Ma},
      year={2024},
      eprint={2404.04933},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This repository is built based on ActionFormer, InternVideo-ego4d, InternVideo, i3d-feature-extraction repository.