Awesome
This is an official pytorch implementation of EZ-CLIP: Efficient Zero-Shot Video Action Recognition [arXiv]
Updates
- Trained model download link of google driver.
Overview
Introduction
In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data.
Content
Prerequisites
We provide the conda requirements.txt to help you install these libraries. You can initialize environment by using pip install -r requirements.txt
.
Model Zoo
NOTE: All models in our experiments below uses publicly available ViT/B-16 based CLIP model.
Zero-shot results
All models are trained on Kinetics-400 and then evaluated directly on downstream datasets.
Model | Input | HMDB-51 | UCF-101 | Kinetics-600 | Model |
---|---|---|---|---|---|
EZ-CLIP(ViT-16) | 8x224 | 52.9 | 79.1 | 70.1 | link |
Base-to-novel generalization results
Here, we divide each dataset into base and novel classes. All models are trained on base classes and evaluated on both base and novel classes.
Dataset | Input | Base Acc. | Novel Acc. | HM | Model |
---|---|---|---|---|---|
K-400 | 8x224 | 73.1 | 60.6 | 66.3 | link |
HMDB-51 | 8x224 | 77.0 | 58.2 | 66.3 | link |
UCF-101 | 8x224 | 94.4 | 77.9 | 85.4 | link |
SSV2 | 8x224 | 16.6 | 13.3 | 14.8 | Link |
Data Preparation
We need to first extract videos into frames for fast reading. Please refer 'Dataset_creation_scripts' data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51,
Training
# Train
python train.py --config configs/K-400/k400_train.yaml
Testing
# Test
python test.py --config configs/ucf101/UCF_zero_shot_testing.yaml
Citation
If you find the code and pre-trained models useful for your research, please consider citing our paper:
@article{ez2022clip,
title={EZ-CLIP: Efficient Zeroshot Video Action Recognition},
author={Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat},
journal={arXiv preprint arXiv:2312.08010},
year={2024}
}
Acknowledgments
Our code is based on ActionCLIP