Home

Awesome

PVLR

[ACM MM 2024] Official Pytorch Implementation of Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization


<img src="fig/figure2_framework.png" width="1280">

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization<br> Geuntaek Lim (Sejong Univ.), Hyunwoo Kim (Sejong Univ.), Joonsoo Kim (ETRI), and Yukyung Choi† (Sejong Univ.)

Abstract: Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods.

Prerequisites

Recommended Environment

conda env create -f environment.yaml
conda activate PVLR

Data Preparation

├── PVLR
   ├── data
      ├── thumos
          ├── Thumos14_CLIP
          ├── Thumos14-Annotations
          ├── Thumos14reduced
          └── Thumos14reduced-Annotations
      ├── annet
          ├── Anet_CLIP
          ├── ActivityNet1.2-Annotations
          └── ActivityNet1.3
├── PVLR
   ├── data
      ├── ...
      ├── ...
      ├── init_thumos.pth
      └── init_annet.pth

Run

Training

OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python main.py --model-name PVLR

Inference

OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python eval/inference.py --pretrained-ckpt output/ckpt/PVLR/Best_model.pkl

References

We referenced the repos below for the code.

✉ Contact

If you have any question or comment, please contact using the issue.