Awesome
Streaming Video Model
Streaming Video Model <br> Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha <br> CVPR 2023 <br>
Description
Streaming video model is a general video model, which is applicable to general video understanding tasks. Traditionally, video understanding tasks have been modeled by two separate architectures, specially tailored for two distinct tasks. Streaming video model is the first deep learning architecture that unifies video understanding tasks. We build an instance of streaming video model, namely the streaming video Transformer (S-ViT).S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task.
Usage
Installation
Clone the repo and install requirements:
conda create -n svm python=3.7 -y
conda activate svm
conda install pytorch==1.12.0 torchvision==0.13.0 cudatoolkit=11.3 -c pytorch
pip install git+https://github.com/JonathonLuiten/TrackEval.git
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12/index.html
pip install mmdet==2.26.0
pip install -r requirements/build.txt
pip install --user -v -e .
pip install einops
pip install future tensorboard
pip install -U fvcore
pip install click imageio[ffmpeg] path
Dataset preparation
Download MOT17, crowdhuman, and MOTSynth datasets and put them under the data directory. The data directory is structured as follows:
data
|-- crowdhuman
│ ├── annotation_train.odgt
│ ├── annotation_val.odgt
│ ├── train
│ │ ├── Images
│ │ ├── CrowdHuman_train01.zip
│ │ ├── CrowdHuman_train02.zip
│ │ ├── CrowdHuman_train03.zip
│ ├── val
│ │ ├── Images
│ │ ├── CrowdHuman_val.zip
|-- MOT17
│ ├── train
│ ├── test
|-- MOTSynth
| ├── videos
│ ├── annotations
Then, we need to convert the all dataset to COCO format. We provide scripts to do this:
# crowdhuman
python ./tools/convert_datasets/crowdhuman2coco.py -i ./data/crowdhuman -o ./data/crowdhuman/annotations
# MOT17
python ./tools/convert_datasets/mot2coco.py -i ./data/MOT17/ -o ./data/MOT17/annotations --split-train --convert-det
# MOTSynth
python ./tools/convert_datasets/extract_motsynth.py --input_dir_path ./data/MOTSynth/video --out_dir_path ./data/MOTSynth/train/
python ./tools/convert_datasets/motsynth2coco.py --anns ./data/MOTSynth/annotations --out ./data/MOTSynth/all_cocoformat.json
The processed dataset will be structured as follows:
data
|-- crowdhuman
│ ├── annotation_train.odgt
│ ├── annotation_val.odgt
│ ├── train
│ │ ├── Images
│ │ ├── CrowdHuman_train01.zip
│ │ ├── CrowdHuman_train02.zip
│ │ ├── CrowdHuman_train03.zip
│ ├── val
│ │ ├── Images
│ │ ├── CrowdHuman_val.zip
| ├── annotations
│ │ ├── crowdhuman_train.json
│ │ ├── crowdhuman_val.json
|-- MOT17
│ ├── train
│ │ ├── MOT17-02-DPM
│ │ ├── ...
│ ├── test
│ ├── annotations
│ │ ├── half-train_cocoformat.json
│ │ ├── ...
|-- MOTSynth
| ├── videos
│ ├── annotations
│ ├── train
│ │ ├── 000
│ │ │ ├── img1
│ │ │ │ ├── 000001.jpg
│ │ │ │ ├── ...
│ │ ├── ...
│ ├── all_cocoformat.json
Pretrained models
We use CLIP pretrained ViT models. You can download them from here and put them under the pretrain
directory.
Training and Evaluation
Training on single node
bash ./tools/dist_train.sh configs/mot/svm/svm_base.py 8 --cfg-options \
model.detector.backbone.pretrain=./pretrain/ViT-B-16.pt
Evaluation on MOT17 half validation set
bash ./tools/dist_test.sh configs/mot/svm/svm_test.py 8 \
--eval bbox track --checkpoint svm_motsync_ch_mot17half.pth
Main Results
MOT17
Method | Dataset | Train Data | MOTA | HOTA | IDF1 | URL |
---|---|---|---|---|---|---|
SVM | MOT17 | MOT17 half-train + crowdhuman + MOTSynth | 79.7 | 68.1 | 80.9 | model |
Citation
If you find this work useful in your research, please consider citing:
@InProceedings{Zhao_2023_CVPR,
author = {Zhao, Yucheng and Luo, Chong and Tang, Chuanxin and Chen, Dongdong and Codella, Noel and Zha, Zheng-Jun},
title = {Streaming Video Model},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14602-14612}
}
Acknowledgement
Our code are built on top of MMTracking and CLIP. Many thanks for their wonderful works.