Home

Awesome

Learning Prompt-Enhanced Context features for Weakly-Supervised Video Anomaly Detection

Authors: Yujiang Pu, Xiaoyu Wu, Lulu Yang, Shengjin Wang

Abstract

Video anomaly detection under weak supervision presents significant challenges, particularly due to the lack of frame-level annotations during training. While prior research has utilized graph convolution networks and self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features, these methods often employ multi-branch architectures to capture local and global dependencies separately, resulting in increased parameters and computational costs. Moreover, the coarse-grained interclass separability provided by the binary constraint of MIL-based loss neglects the fine-grained discriminability within anomalous classes. In response, this paper introduces a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. We present a Temporal Context Aggregation (TCA) module that captures comprehensive contextual information by reusing the similarity matrix and implementing adaptive fusion. Additionally, we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic priors using knowledge-based prompts to boost the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Extensive experiments validate the effectiveness of our method's components, demonstrating competitive performance with reduced parameters and computational effort on three challenging benchmarks: UCF-Crime, XD-Violence, and ShanghaiTech datasets. Notably, our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.

image

Contents

1. Introduction
2. Requirements
3. Datasets
4. Quick Start
5. Results and Models
6. Acknowledgement
7. Citation

Introduction

This repo is the official implementation of "Learning Prompt-Enhanced Context features for Weakly-Supervised Video Anomlay Detection" (IEEE-TIP). The original paper can be found here. Please feel free to contact me if you have any questions.

Requirements

The code requires python>=3.8 and the following packages:

torch==1.8.0
torchvision==0.9.0
numpy==1.21.2
scikit-learn==1.0.1
scipy==1.7.2
pandas==1.3.4
tqdm==4.63.0
xlwt==2.5

The environment with required packages can be created directly by running the following command:

conda env create -f environment.yml

Datasets

For the UCF-Crime and XD-Violence datasets, we use off-the-shelf features extracted by Wu et al. For the ShanghaiTech dataset, we used this repo to extract I3D features (highly recommended:+1:).

DatasetOrigin VideoI3D Features
  UCF-Crime  homepagedownload link
 XD-Violence  homepagedownload link
ShanghaiTech  homepagedownload link

Before the Quick Start, please download above features and change feat_prefix in config.py to your local path.

Quick Start

Please change the hyperparameters in config.py if necessary, where we keep default settings as mentioned in our paper. The example of configs for UCF-Crime is shown as follows:

dataset = 'ucf-crime'
model_name = 'ucf_'
metrics = 'AUC'  # the evaluation metric
feat_prefix = '/data/pyj/feat/ucf-i3d'  # the prefix path of the video features
train_list = './list/ucf/train.list'  # the split file of training set
test_list = './list/ucf/test.list'  #  the split file of test/infer set
token_feat = './list/ucf/ucf-prompt.npy'  # the prompt feature extracted by CLIP
gt = './list/ucf/ucf-gt.npy'  # the ground-truth of test videos

# TCA settings
win_size = 9  # the local window size
gamma = 0.6  # initialization for DPE
bias = 0.2  # initialization for DPE 
norm = True  # whether adaptive fusion uses normalization

# CC settings
t_step = 9  # the kernel size of causal convolution

# training settings
temp = 0.09  # the temperature for contrastive learning
lamda = 1  # the loss weight
seed = 9  # random seed

# test settings
test_bs = 10  # test batch size
smooth = 'slide'  # the type of score smoothing ['None', 'fixed': 10, slide': 7]
kappa = 7  # the smoothing window
ckpt_path = './ckpt/ucf__8636.pkl'
python main.py --dataset 'ucf' --mode 'train'  # dataset:['ucf', 'xd', 'sh']  mode:['train', 'infer']
python main.py --dataset 'ucf' --mode 'infer'  # dataset:['ucf', 'xd', 'sh']  mode:['train', 'infer']

Results and Models

Below are the results with score smoothing in the testing phase. Note that our experiments are conducted on a single Tesla A40 GPU, and different torch or cuda versions can lead to slightly different results.

DatasetAUC (%)AP (%)FAR (%)ckptlog
  UCF-Crime  86.76 33.99   0.47 linklink
 XD-Violence  94.94 85.59   0.57 linklink
ShanghaiTech  98.14 72.56   0.00 linklink

Acknowledgement

Our codebase mainly refers to XDVioDet and CLIP. We greatly appreciate their excellent contribution with nicely organized code!

Citation

If this repo works positively for your research, please consider citing our paper. Thanks all!

@article{pu2023learning,
  title={Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection},
  author={Pu, Yujiang and Wu, Xiaoyu and Wang, Shengjin},
  journal={arXiv preprint arXiv:2306.14451},
  year={2023}
}