Home

Awesome

Local-Global Video-Text Interactions for Temporal Grounding

PyTorch implementation of Local-Global Interaction (LGI) network for temporal grounding given a text query.

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

model_overview Overall architecture of our algorithm (LGI). Given a video and a text query, we encode them to obtain segment-level visual features, word-level and sentence-level textual features. We extract a set of semantic phrase features from the query using the Sequential Query Attention Network (SQAN). Then, we obtain semantics-aware segment features based on the extracted phrase features via local-global video-text interactions. Finally, we directly predict the time interval from the summarized video features using the temporal attention. We train the model using the regression loss and two additional attention-related losses.

1. Dependencies

This repository is implemented based on PyTorch with Anaconda.</br> Refer to Setting environment with anaconda or use Docker (choco1916/envs:temporal_grounding).

2. Prepare data

Running scripts/prepare_data.sh will download all data including annotations, video features (I3D for Charades-STA, C3D for ActivityNet Captions), pre-processed annotation information.

bash scripts/prepare_data.sh

3. Evaluating pre-trained models

conda activate tg
bash scripts/download_pretrained_models.sh
# Evaluate LGI model trained from ActivityNet Captions Dataset
CUDA_VISIBLE_DEVICES=0 python -m src.experiment.eval \
                     --config pretrained_models/anet_LGI/config.yml \
                     --checkpoint pretrained_models/anet_LGI/model.pkl \
                     --method tgn_lgi \
                     --dataset anet
# Evaluate LGI model trained from Charades-STA Dataset
CUDA_VISIBLE_DEVICES=0 python -m src.experiment.eval \
                     --config pretrained_models/charades_LGI/config.yml \
                     --checkpoint pretrained_models/charades_LGI/model.pkl \
                     --method tgn_lgi \
                     --dataset charades

The pre-trained models will report following scores. While re-implementing this code, the reproduced numbers are slightly different. (reproduced vs. paper-version at R@0.5: 41.65 vs. 41.51 and 59.17 vs. 59.46 in ActivityNet Captions and Charades-STA, respectively)

DatasetR@0.3R@0.5R@0.7mIoU
ActivityNet Captions58.4841.6524.1041.48
Charades-STA72.1859.1735.3250.93

4. Training models from scratch

This code will load all the data (~30GB for ActivityNet Captions and ~3GB for Charades-STA) into RAM for the fast training, if you want to disable this behavior, set in_memory in a config file (config.yaml) as FALSE.

conda activate tg
# LGI model for ActivityNet Captions dataset
bash scripts/train_model.sh LGI tgn_lgi anet 0 4 0
# LGI model for Charades-STA dataset
bash scripts/train_model.sh LGI tgn_lgi charades 0 4 0
# LGI model for ActivityNet Captions dataset
bash scripts/run_docker.sh "bash scripts/train_model.sh LGI tgn_lgi anet 0 4 0" "gpu0"
# LGI model for Charades-STA dataset
bash scripts/run_docker.sh "bash scripts/train_model.sh LGI tgn_lgi charades 0 4 0" "gpu0"

5. Visualization

For the visualization, we need moviepy package as well as raw videos.</br>

# Path to directory for raw videos
ActivityNet Captions: data/anet/raw_videos/validation/
Charades-STA: data/charades/raw_videos/

Refer to visualization.ipynb

6. Citation

If you use this code in a publication, please cite our paper.

@inproceedings{mun2020LGI,
    title     = "{Local-Global Video-Text Interactions for Temporal Grounding}",
    author    = {Mun, Jonghwan and Cho, Minsu and and Han, Bohyung},
    booktitle = {CVPR},
    year      = {2020}
}