Home

Awesome

PWC PWC

<div align="center"> <h1>Zero-Shot Temporal Action Detection via Vision-Language Prompting</h1> <div> <a href='https://sauradip.github.io/' target='_blank'>Sauradip Nag</a><sup>1,2,+</sup>&emsp; <a href='https://scholar.google.co.uk/citations?hl=en&user=ZbA-z1cAAAAJ&view_op=list_works&sortby=pubdate' target='_blank'>Xiatian Zhu</a><sup>1,3</sup>&emsp; <a href='https://scholar.google.co.uk/citations?user=irZFP_AAAAAJ&hl=en' target='_blank'>Yi-Zhe Song</a><sup>1,2</sup>&emsp; <a href='https://scholar.google.co.uk/citations?hl=en&user=MeS5d4gAAAAJ&view_op=list_works&sortby=pubdate' target='_blank'>Tao Xiang</a><sup>1,2</sup>&emsp; </div> <div> <sup>1</sup>CVSSP, University of Surrey, UK&emsp; <sup>2</sup>iFlyTek-Surrey Joint Research Center on Artificial Intelligence, UK&emsp; <br> <sup>3</sup>Surrey Institute for People-Centred Artificial Intelligence, UK </div> <div> <sup>+</sup>corresponding author </div> <h3><strong>Accepted to <a href='https://eccv2022.ecva.net/' target='_blank'>ECCV 2022</a></strong></h3> <h3 align="center"> <a href="https://arxiv.org/abs/2207.08184" target='_blank'>Paper</a> | <a href="https://sauradip.github.io/project_pages/STALE/" target='_blank'>Project Page</a> </h3> <table> <tr> <td><img src="assets/STALE_intro.gif" width="100%"/></td> </tr> </table> </div>

Updates

Summary

Abstract

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms stateof-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors.

Architecture

Getting Started

Requirements

Environment Setup

It is suggested to create a Conda environment and install the following requirements

pip3 install -r requirements.txt

Extra Dependencies

We have used the implementation of Maskformer for Representation Masking.

git clone https://github.com/sauradip/STALE.git
cd STALE
git clone https://github.com/facebookresearch/MaskFormer

Follow the Installation instructions to install Detectron and other modules within this same environment if possible. After this step, place the files in /STALE/extra_files into /STALE/MaskFormer/mask_former/modeling/transformer/.

Download Features

Download the video features and update the Video paths/output paths in config/anet.yaml file. For now ActivityNetv1.3 dataset config is available. We are planning to release the code for THUMOS14 dataset soon.

DatasetFeature BackbonePre-TrainingLink
ActivityNetViT-B/16-CLIPCLIPGoogle Drive
THUMOSViT-B/16-CLIPCLIPGoogle Drive
ActivityNetI3DKinetics-400Google Drive
THUMOSI3DKinetics-400Google Drive

Training Splits

Currently we support the training-splits provided by EfficientPrompt paper. Both 50% and 75% labeled data split is available for training. This can be found in STALE/splits

Model Training

To train STALE from scratch run the following command. The training configurations can be adjusted from config/anet.yaml file.

python stale_train.py

Model Inference

We provide the pretrained models containing the checkpoints for both 50% and 75% labeled data split for zero-shot setting

DatasetSplit (Seen-Unseen)FeatureLink
ActivityNet50%-50%CLIPckpt
ActivityNet75%-25%CLIPckpt

After downloading the checkpoints, the checkpoints path can be saved in config/anet.yaml file. The model inference can be then performed using the following command

python stale_inference.py

Model Evaluation

To evaluate our STALE model run the following command.

python eval.py

TO-DO Checklist

Acknowledgement

Our source code is based on implementations of DenseCLIP, MaskFormer and CoOP. We thank the authors for open-sourcing their code.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{nag2022zero,
  title={Zero-shot temporal action detection via vision-language prompting},
  author={Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
  journal={arXiv e-prints},
  pages={arXiv--2207},
  year={2022}
}