Home

Awesome

<div align="center"> <h1> 【CVPR'24】OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition </h1>

PWC PWC PWC

Website paper

Tongjia Chen<sup>1</sup>, Hongshan Yu<sup>1</sup>, Zhengeng Yang<sup>2</sup>, Zechuan Li<sup>1</sup>, Wei Sun<sup>1</sup>, Chen Chen<sup>3</sup>

<sup>1</sup>HNU, <sup>2</sup>HNNU, <sup>3</sup>CRCV, UCF

</div>

In this work, we introduce a novel general video recognition pipeline OST. We prompt an LLM to augment category names into Spatio-Temporal Descriptors and refine the semantic knowledge via Optimal Descriptor Solver.

<div align=center> <img width="500" alt="image" src="imgs/teaser.png"> </div> Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the textual discrepancy between descriptive narratives and concise category names. This oversight results in a less separable latent space, which may hinder video recognition. <div align=center> <img width="1080" alt="image" src="imgs/pipeline.png"> </div>

We query the Large Language Model to augment category names to generate corresponding Category Descriptors. The descriptors disentangled category names into Spatio-Temporal Descriptors for static visual cues and temporal evolution, respectively. To fully refine the textual knowledge, we propose Optimal Descriptor Solver that adaptively aligns descriptors with video frames. An optimal matching flow is calculated through the iterative solving of the entropy-regularized OT problem to assign optimal descriptors for each video instance.

Todo

Environments

Our codebase is mainly built on ViFi-CLIP, please follow the instruction provided in their repository to build environments.

(Note that you may need to build your environments on mmcv 1.x)

Train & Eval

Fow all the experiments illustrated in the main paper, we provide config files in the configs folder. For example, to train OST on Kinetics-400, you can run the following command:

python -m torch.distributed.launch --nproc_per_node=8 \ 
main_nce.py -cfg configs/zero_shot/train/k400/16_32_ost.yaml --output /PATH/TO/OUTPUT 

To evaluate a model, please use the specific config file in the configs folder according to the dataset and data splits. To evaluate OST in the zero-shot setting with 32 frames on UCF-101 zero-shot split-1, you can run the command below:

python -m torch.distributed.launch --nproc_per_node=8 \
main_nce.py -cfg configs/zero_shot/eval/ucf/16_32_ost_zs_ucf101_split1.yaml --output /PATH/TO/OUTPUT \
--only_test --resume /PATH/TO/CKPT

Please note that we use 8 GPUs in all of our main experiments, and the results may vary due to different environment settings and hardwares.

Checkpoints

We use OpenAI pretrained CLIP-B/16 model in all of our experiments. We provide checkpoints of our OST in zero-shot and few-shot settings below. All of the model checkpoints are available in the HuggingFace Space 🤗.

Zero-shot setting

For the zero-shot setting, the model is first fine-tuned on Kinetics-400 and then directly evaluated on 3 downstream datasets. So here we provide our Kinetics-400 fine-tuned model weights for reproducing the zero-shot results illustrated in our main paper.

ConfigInputHMDB-51UCF-101Kinetics-600Checkpoints
OST$32*224^2$55.979.775.1Link

Few-shot setting

For few-shot setting, we follows the evaluation protocal of mainstream pipelines. We evaluate OST in two different settings (Directly tuning on CLIP & Fine-tuned on K400).

Directly tuning on CLIP

ConfigInputShotsDatasetTop-1 Acc.Checkpoints
OST$32*224^2$2HMDB-5159.1Link
OST$32*224^2$4HMDB-5162.9Link
OST$32*224^2$8HMDB-5164.9Link
OST$32*224^2$16HMDB-5168.2Link
OST$32*224^2$2UCF-10182.5Link
OST$32*224^2$4UCF-10187.5Link
OST$32*224^2$8UCF-10191.7Link
OST$32*224^2$16UCF-10193.9Link
OST$32*224^2$2Something-Something V27.0Link
OST$32*224^2$4Something-Something V27.7Link
OST$32*224^2$8Something-Something V28.9Link
OST$32*224^2$16Something-Something V212.2Link

Fine-tuned on K400

Please note for this setting, you only need to replace the original CLIP with our Kinetics-400 finetuned model.

ConfigInputShotsDatasetTop-1 Acc.Checkpoints
OST$32*224^2$2HMDB-5164.8Link
OST$32*224^2$4HMDB-5166.7Link
OST$32*224^2$8HMDB-5169.2Link
OST$32*224^2$16HMDB-5171.6Link
OST$32*224^2$2UCF-10190.3Link
OST$32*224^2$4UCF-10192.6Link
OST$32*224^2$8UCF-10194.4Link
OST$32*224^2$16UCF-10196.2Link
OST$32*224^2$2Something-Something V28.0Link
OST$32*224^2$4Something-Something V28.9Link
OST$32*224^2$8Something-Something V210.5Link
OST$32*224^2$16Something-Something V212.6Link

Citation

If you find this work useful, please consider citing our paper! ;-)

@article{
    chen2023ost,
    title={OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition},
    author={Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen.},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}, 
 }

Acknowledgment

The work was done while Tongjia was a research intern mentored by Chen Chen. We thank Ming Li (UCF) and Yong He (UWA) for proof-reading and discussion.

This repository is built upon portions of ViFi-CLIP, MAXI, and Text4Vis. We sincerely thank the authors for releasing their code.