Home

Awesome

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

License: MIT <img src="https://raw.githubusercontent.com/facebookresearch/unbiased-teacher/main/teaser/pytorch-logo-dark.png" width="10%">

<!-- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -->

This is the PyTorch implementation of our paper: <br> ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound<br> Yan-Bo Lin, Jie Lei, Mohit Bansal, and Gedas Bertasius<br> In European Conference on Computer Vision, 2022. <br>

paper

šŸ“ Preparation

  1. pip3 install requirements.txt
  2. Dataset: ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.
  3. extract video frames in 3 fps.
  4. extract audio features.
  5. To load pretrained CLIP weight

The download links are from official CLIP4Clip Download CLIP (ViT-B/32) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

or, download CLIP (ViT-B/16) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

šŸ’æ Extract images and audio features.

ActivityNet/
ā”œā”€ā”€ raw_frames/
ā”‚       ā””ā”€ā”€ VIDEO_NAME/
ā”‚           ā”œā”€ā”€ 0001.jpg
ā”‚           ā”œā”€ā”€ ...
ā”‚           ā””ā”€ā”€ 00...jpg
ā”‚
ā””ā”€ā”€ VGGSound_Audio_features_10s_aligned/
        ā””ā”€ā”€ VIDEO_NAME/
            ā”œā”€ā”€ 0000.pt
            ā”œā”€ā”€ ...
            ā””ā”€ā”€ 00...pt

šŸ’æ Extracted audio features.

VGGSound features on ActivityNet Captions: Google Drive

šŸ“š Train and evaluate

ActivityNet Captions: bash run_act.sh
DiDemo: bash run_didemo.sh
Charades: bash run_cha.sh
QVHighlight:bash run_qvh.sh
YouCook2: bash run_yc2.sh

šŸŽ“ Cite

If you use this code in your research, please cite:

@InProceedings{ECLIPSE_ECCV22,
author = {Yan-Bo Lin and Jie Lei and Mohit Bansal and Gedas Bertasius},
title = {ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {October},
year = {2022}
}

šŸ‘ Acknowledgments

Our code is based on CLIP4Clip and VGGSound

āœ Future works

License

This project is licensed under MIT License, as found in the LICENSE file.