Awesome
AR-Seg
Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos <br> Yubin Hu, Yuze He, Yanghao Li, Jisheng Li, Yuxing Han, Jiangtao Wen, Yong-Jin Liu <br> CVPR 2023
Introduction
AR-Seg is an efficient video semantic segmentation framework for compressed videos. It consists of an HR branch for keyframes and an LR branch for non-keyframes.
<p align="center" width="100%"> <img width="80%" src="./static/diagram.png"> </p>We design a Cross Resolution Feature Fusion (CReFF) module and a Feature Similarity Training (FST) strategy to compensate for the performance drop because of low-resolution.
<p align="center" width="100%"> <img width="80%" src="./static/method.png"> </p>Environment
Create from Conda Config
conda env create -f environment.yml
conda activate AR-Seg
Create with Separate Steps
conda create -n AR-Seg python=3.6
conda activate AR-Seg
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch
pip install -r requirements.txt
Dataset & Pre-processing
Please refer to the documentation.
Evaluation
We provide the sample code, checkpoints and processed data for evaluting AR-Seg on CamVid and Cityscapes datasets.
Checkpoints
Please download the checkpoints from TsinghuaCloud / GoogleDrive. And then unzip the files into directory ./checkpoints/
. After unzipping, the directory structure should look like ./checkpoints/camvid-bise18/HR/
.
We release the checkpoints trained on CamVid for different LR branch resolutions, ranging from 0.3x to 0.9x. As for Cityscapes, we release the checkpoints trained for 0.5x LR resolution.
Processed Data
You can pre-process the CamVid and Cityscapes dataset following the instructions in documentation, and then place the processed data under ./data/
.
Or you can download our example processed data of CamVid from TsinghuaCloud / GoogleDrive. And then unzip the files into directory ./data/
. After unzipping, the directory structure should look like ./data/camvid-sequence/3M-GOP12
.
Run the Evaluation Script
You can run the evaluation script with different backbones on different datasets.
python evaluation.py --dataset [camvid(default) or cityscapes] --backbone [psp18(default) or bise18] --mode [1 1 1(default) or 0 0 1 or etc.]
For example, if you want to evaluate the HR branch performance with BiseNet-18 on CamVid, you can run the script below.
python evaluation.py --dataset camvid --backbone bise18 --mode 1 0 0
Check the Evaluation Results
The evaluation results will be stored under ./evaluation-result
. Each file contains $L$ $mIoU_d$ values for each reference distance, ranging from 0 to $L-1$, and the average $mIoU$ in the last row. In the example case, we have $L=12$.
Training
Soft Link of the Processed Dataset
CamVid:
ln -s camvid_root ./data/CamVid
ln -s camvid_sequence_root ./data/camvid-sequence
Note that the camvid_root
and camvid_sequence_root
is the same to the one you set when processing the dataset following documentation.
Cityscapes:
ln -s cityscapes_root ./data/cityscapes
ln -s cityscapes_root/leftImg8bit_sequence ./data/cityscapes-sequence
Note that the cityscapes_root
is the same to the one you set when processing the dataset following documentation.
Phase 1: Training of the HR branch
For phase 1, you can use a pre-trained image segmentation model or train an image segmentation model from scratch.
Train on CamVid:
## PSPNet-18
python train.py --data-path=./data/CamVid/ --models-path=./exp/pspnet18-camvid/scale1.0_epoch100_pure --backend='resnet18' --batch-size=8 --epochs=100 --scale=1.0 --gpu=4
## BiseNet-18
python train.py --data-path=./data/CamVid/ --models-path=./exp/bisenet18-camvid/scale1.0_epoch100_pure --backend='resnet18' --batch-size=8 --epochs=100 --scale=1.0 --gpu=7 --model_type=bisenet
Train on Cityscapes:
## PSPNet-18
python train.py --data-path=./data/cityscapes --models-path=./exp/pspnet18-cityscapes/scale1.0_epoch200_pure_bs8_0.5-2.0-aug-512x1024-lr-0.01-semsegPSP --backend='resnet18' --batch-size=8 --epochs=200 --scale=1.0 --gpu=4 --start-lr=0.01 --model_type=pspnet --dataset=cityscapes
## For BiseNet18, we directy use a pretrained model and convert its format.
Phase 2: Training of the LR branch
Train on CamVid:
## PSPNet-18
python train_pair.py --data-path=./data/CamVid/ --sequence-path=./data/camvid-sequence --models-path=./exp/pspnet18-camvid/paper/camvid-psp18-scale0.5-3M-GOP12-30fps/ --backend='resnet18' --batch-size=8 --epochs=100 --scale=0.5 --gpu=0,1 --feat_loss=mse --stage1_epoch=50 --ref_gap=12 --with_motion=1
## BiseNet-18
python train_pair.py --data-path=./data/CamVid/ --sequence-path=./data/camvid-sequence --models-path=./exp/bisenet18-camvid/paper/camvid-bise18-scale0.5-3M-GOP12-30fps/ --backend='resnet18' --batch-size=8 --epochs=100 --scale=0.5 --gpu=0 --feat_loss=mse --stage1_epoch=50 --ref_gap=12 --with_motion=1 --model_type=bisenet
Train on Cityscapes:
## PSPNet-18
python convert_model_for_cityscapes.py --backbone psp18
python train_pair.py --data-path=./data/cityscapes --sequence-path=./data/cityscapes-sequence --models-path=./exp/pspnet18-cityscapes/paper/cityscapes-psp18-scale0.5-5M-GOP12-30fps_0.01_epoch200-semseg-auxLoss/ --backend='resnet18' --batch-size=8 --epochs=200 --scale=0.5 --gpu=1,2 --feat_loss=mse --stage1_epoch=0 --ref_gap=12 --with_motion=1 --model_type=pspnet --start-lr=0.01 --dataset=cityscapes --bitrate=5
## BiseNet-18
python convert_model_for_cityscapes.py --backbone bise18
python train_pair.py --data-path=./data/cityscapes --sequence-path=./data/cityscapes-sequence --models-path=./exp/bisenet18-cityscapes/paper/cityscapes-bise18-scale0.5-5M-GOP12-30fps_0.01_epoch200 --backend='resnet18' --batch-size=16 --epochs=200 --scale=0.5 --gpu=2 --feat_loss=mse --start-lr=0.01 --stage1_epoch=0 --ref_gap=12 --with_motion=1 --model_type=bisenet --dataset=cityscapes --bitrate=5
If you want to train on the Cityscapes dataset, please download the initialization checkpoints of BiseNet from TsinghuaCloud / GoogleDrive. And then unzip the files into directory ./cityscapes_pretrained/
.
Citation
@InProceedings{Hu_2023_CVPR,
author = {Hu, Yubin and He, Yuze and Li, Yanghao and Li, Jisheng and Han, Yuxing and Wen, Jiangtao and Liu, Yong-Jin},
title = {Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {22627-22637}
}