Home

Awesome

Real-Time Video Inference on Edge Devices via Adaptive Model Streaming

demo

Table of Contents

Installation

For installing the required packages using Conda, use:

git clone https://github.com/modelstreaming/ams.git
cd ams
conda env create -f environment.yml

The current version relies on tensorflow-v1 for the training and inference purposes.

Running AMS

For running the code, use:

conda activate ams
python run.py --help

Extracting labels

To speed up experiments, we first extract teacher inferred labels from video frames and save them for future use. We define each video by a number (VIDEO_NUM) and a name (VIDEO_NAME). Together they are used to ID a video (e.g 12 and la, 12-la.mp4) and load meta information such as video length and class labels (from exp_configs.py). Videos from the four datasets described below have already been added.

Finally, to extract the labels, use this command:

python ams/extract_labels.py --input_video PATH_TO_VIDEO/VIDEO_NUM-VIDEO_NAME.mp4 
--dump_path PATH_TO_GT/VIDEO_NUM-VIDEO_NAME/ --teacher_checkpoint PATH_TO_TEACHER_MODEL

Models & Checkpoints

Student

For lightweight (student) models we use DeeplabV3 with MobileNetV2 backbone. We use official pretrained checkpoints released in Deeplab's github repo here. For compatibilty with TF1 and our code, you may directly use the following checkpoints:

Teacher

For the teacher model, we use DeeplabV3 with Xception65 backbone for Outdoor Scenes, Cityscapes, and A2D2 datasets. For a compatible teacher checkpoint, use:

For the LVS dataset, we follow Mullapedi et al. [3] in using Mask R-CNN as the teacher and directly use the teacher labels they provide.

Datasets

Outdoor Scenes

The Outdoor Scenes video dataset that we introduce includes seven publicly available videos from Youtube, with 7-15 minutes in duration. These videos span different levels of scene variability and were captured with four types of cameras: Stationary, Phone, Headcam, and DashCam. For each video, we manually select 5-7 classes that are detected frequently by our best semantic segmentation model (DeeplabV3 with Xception65 backbone trained on Cityscapes data) at full resolution.

VideoLinkTime Interval (min:sec)
Interviewhttps://www.youtube.com/watch?v=zkIADOEhk5I00:20 - 07:25
Dance recordinghttps://www.youtube.com/watch?v=2mtaoDYcisY00:06 - 15:20
Street comedianhttps://www.youtube.com/watch?v=1ESzHVhAKBI00:00 - 13:55
Walking in Parishttps://www.youtube.com/watch?v=09oFgM5IHSI12:37 - 27:51
Walking in NYChttps://www.youtube.com/watch?v=H_zosklgz18100:40 - 115:43
Driving in LAhttps://www.youtube.com/watch?v=Cw0d-nqSNE808:24 - 23:38
Runninghttps://www.youtube.com/watch?v=S9xzNyi_5TI16:21 - 29:56

Cityscapes

We use the entire Frankfurt unlabeled long video sequence from the Cityscapes dataset [1]. You can download this video here.

A2D2

We use the entire video sequence of front center cameras at Gaimersheim, Ingolstadt, and Munich from the Audi Autonomous Driving Dataset (A2D2) [2].

LVS

For downloading the Long Videos Dataset (LVS) [1], you may check here.

Citation

You can cite this work using:

@inproceedings{khani2021real,
  title={Real-Time Video Inference on Edge Devices via Adaptive Model Streaming},
  author={Khani, Mehrdad and Hamadanian, Pouya and Nasr-Esfahany, Arash and Alizadeh, Mohammad},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={4572--4582},
  year={2021}
}

References

  1. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S. and Schiele, B., 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213-3223).
  2. Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A.S., Hauswald, L., Pham, V.H., Mühlegg, M., Dorn, S. and Fernandez, T., 2020. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320.
  3. Mullapudi, R.T., Chen, S., Zhang, K., Ramanan, D. and Fatahalian, K., 2019. Online model distillation for efficient video inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3573-3582).