Home

Awesome

Leveraging Temporal Contextualization for Video Action Recognition

[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition
Minji Kim†, Dongyoon Han, Taekyung Kim*, Bohyung Han* <br> <sub> (†Work done during an internship at NAVER AI Lab, *corresponding authors) <br> NAVER AI LAB

PWC
paper models video poster Jupyter Notebook

Official PyTorch implementation of the ECCV 2024 paper "Leveraging Temporal Contextualization for Video Action Recognition"

<br>

temporal_modeling_comparison

Abstract

We propose a novel framework for video understanding, called Tempoally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices.

Updates

:rocket: Highlights

:exclamation: Motivation: insufficient token interactions in recent temporal modeling approaches

teaser_attention

Prior works consider temporal cues during the encoding process via (a) Cross-Frame Attention with CLS token interactions or (b) Temporal Window Expansion by adding adjacent frame tokens to key-value pairs. However, the former lacks patch-level details, while the latter limits the range of temporal interactions. (c) Joint Space-Time Attention allows full interactions across all tokens, but exhibits weak discriminability due to sparse attention on the backgrounds, witnessing extrapolation challenges (See details in the paper.) (d) Temporal Contextualization (Ours) aggregates pivotal tokens from a broader range into key-value pairs, successfully focusing on informative regions across all frames.

:sparkles: Temporally Contextualized CLIP (TC-CLIP)

: A novel video understanding framework that leverages holistic video information within its encoding process.

  1. Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
  2. Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
  3. Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.

:file_folder: Models

We use CLIP ViT-B/16 for all experiments below. All the checkpoints can be downloaded at this link.

Zero-shot action recognition

ScriptsHMDB-51UCF-101Kinetics-600Ckpt
TC-CLIP54.2 ± 0.782.9 ± 0.675.8 ± 0.5Link
TC-CLIP (LLM)56.0 ± 0.385.4 ± 0.878.1 ± 1.0Link

Few-shot action recognition

ScriptsHMDB-51UCF-101SSv2Ckpt
K=2 / K=4 / K=8 / K=16K=2 / K=4 / K=8 / K=16K=2 / K=4 / K=8 / K=16
TC-CLIP57.3 / 62.3 / 67.3 / 68.685.9 / 89.9 / 92.5 / 94.67.3 / 8.6 / 9.3 / 14.0Link
TC-CLIP (LLM)58.6 / 63.3 / 65.5 / 68.886.8 / 90.1 / 92.0 / 94.37.3 / 8.6 / 9.3 / 14.0Link
TC-CLIP (P)65.3 / 68.5 / 71.4 / 73.094.1 / 95.6 / 96.6 / 97.38.7 / 10.1 / 12.1 / 15.2Link

Base-to-novel generalization

ScriptsK-400HMDB-51UCF-101SSv2Ckpt
Base / Novel / HMBase / Novel / HMBase / Novel / HMBase / Novel / HM
TC-CLIP78.9 / 63.6 / 70.473.3 / 54.1 / 62.295.5 / 78.0 / 85.917.5 / 13.4 / 15.2Link
TC-CLIP (LLM)79.1 / 65.4 / 71.673.3 / 59.1 / 65.595.4 / 81.6 / 88.017.5 / 13.4 / 15.2Link
TC-CLIP (P)N/A79.4 / 58.3 / 67.297.5 / 84.5 / 90.519.6 / 15.6 / 17.4Link

Fully-supervised action recognition

ScriptsK-400 (Top-1)K-400 (Top-5)Ckpt
TC-CLIP85.296.9Link

:hammer: Environments

Installation

Please follow the instructions in INSTALL.md.

Data preparation

Please follow the instructions in DATASETS.md for data preparation.

Configuration

The organization of configurations in this project is outlined in CONFIG.md.

:dizzy: Training and Evaluation

The basic usage of the commands for training and evaluation is outlined below. For detailed instructions on all experimental setup, please refer to TRAIN_EVAL.md.

Training for TC-CLIP

For all experiments in our main paper, we provide example training commands in scripts/train folder. The basic usage of the training command is as follows:

# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_ckpt_saving_path} trainer=${trainer_name}

# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=ckpt/zero_shot_k400_tc_clip trainer=tc_clip

Note:

Evaluation for TC-CLIP

We provide example evaluation commands in scripts/eval folder. The basic usage of the evaluation command is as follows:

# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_result_saving_path} \
trainer=${trainer_name} eval=test resume=${ckpt_path}

# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=ckpt/zero_shot_k400_tc_clip/best.pth

Note:

:telephone: Contact

If you have any questions, please create an issue on this repository or contact at taekyung.k@navercorp.com and minji@snu.ac.kr.

:thumbsup: Acknowledgements

This project is built upon ViFi-CLIP and borrowed features from FROSTER and ToMe. We sincerely thank the authors for these greate codebases.

:lock: License

TC-CLIP
Copyright (c) 2024-present NAVER Cloud Corp.
CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)

:pushpin: Citation

If you find TC-CLIP useful in your research, please consider citing our paper:

@article{kim2024tcclip,
  title={Leveraging Temporal Contextualization for Video Action Recognition},
  author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}