Awesome

Localizing the Common Action Among a Few Videos(ECCV20)

This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.

For more details, please check our paper.

System Requirements

cuda=9.0
python 3.6
gcc=5.5.0
torch=0.4(currently doesn't support torch0.4.1, for a smooth installation of NMS, see https://github.com/jwyang/faster-rcnn.pytorch/issues/235#issuecomment-409493006)

Package Requirements

conda create -n focal  python=3.6
pip install torch==0.4
pip install --no-cache --upgrade git+https://github.com/dongzhuoyao/pytorchgo.git
pip install -r requirements.txt

Compile the cuda dependencies using following simple commands:

cd lib
sh make.sh

Preparation

Download the initial backbone weight from Onedrive, and put it in the directory data/pretrained_model

Extract ActivityNet1.3 frames by FPS=3 following R-C3D, after that please put them in the directory dataset/activitynet13/train_val_frames_3/, it ought to contain two folders: training, validation.

The detail structure of the dataset is already splitted in our pickle file in ./preprocess. If you want to create your own dataset, you can follow here to create your own pickle file.

Training

python main.py --bs 1 --gpus 0

Evaluate our trained weight

Firstly, download our trained weight from Onedrive, and put the trained weight file best_model.pth in train_log/main, then do the evaluation following the command:

python main.py --test

Email for QA

Any question related to the repo, please send email to us: yangpengwan2016@gmail.com.

Acknowledgement

This repo is developed based on https://github.com/sunnyxiaohu/R-C3D.pytorch, thanks for their contribution.

Citation

If you think our work is useful, please kindly cite our work.

@INPROCEEDINGS{YangECCV20,
        author = {Pengwan Yang and Vincent Tao Hu and Pascal Mettes and Cees G. M. Snoek},
        title = {Localizing the Common Action Among a Few Videos},
        booktitle = {European Conference on Computer Vision},
        month = {August},
        year = {2020},
        address = {Glasgow, UK},
      }