Home

Awesome

dd_performances

DeepDetect performance report

This report documents the performances of the DeepDetect Open Source Deep Learning server on a variety of platforms and popular or particularly effective neural network architectures. The full server source code is available from https://github.com/beniz/deepdetect.

Reference platforms

Given different platforms, the result should serve as a reference for parties and users interested in choosing the right NN model for their work on their server or embedded systems.

Ordered from most to less powerful:

Note that the 1080Ti and TX1 use the CuDNN NVidia accelerator library, while the TK1 uses GPU implementation without CuDNN and Raspberry uses CPU only.

For a detailed description of all platforms, see the dedicated platform section.

Reference networks

We conducts an experiment with multiple contemporary Neural Networks (NN) models.

FLOPS and Parameters

One important aspect of choosing a model is the limitation of the hardware, such as the computational output (in flops), and the amount of available RAM. The number of flops required for a single pass for a model is displayd below, along with the number of parameters (weights in the network).

<table style="width=100%"> <tr> <th><img src="cost/cost.png" width="450"></th> <th><img src="cost/small_cost.png" width="450"></th> </tr> </table>

Results Overview

Below are performances, displayed in log scale. The reported performances are per image in ms. When batch size is greater than one, the reported value is the average time per image for that batch size. On GPUs and platforms with limited memory, not all batch sizes are applicable.

With Caffe as a backend

The reported performances use a customized version of Caffe as backend.

<table style="width=100%"> <tr> <th><img src="graph/gtx1080_log.png" width="450"></th> <th><img src="graph/TK1_log.png" width="450"></th> </tr> </table> <table style="width=100%"> <tr> <th><img src="graph/TX1_log.png" width="450"></th> <th><img src="graph/TX2_caffe_log.png" width="450"></th> </tr> </table> <table style="width=100%"> <tr> <th><img src="graph/Jetson-nano-log.png" width="450"></th> <th><img src="graph/Raspi_log.png" width="450"></th> </tr> </table>

With TensorRT as a backend

<table style="width=100%"> <tr> <th><img src="graph/TX2_TensorRT_log.png" width="450"></th> </tr> </table> <details> <summary>See linear-scale plot</summary>

alt text

</details>

With NCNN as a backend

The graph shows the performances difference between the Raspberry Pi 3 and the Raspberry Pi 4 (2 GB) using NCNN as a backend.

<table style="width=100%"> <tr> <th><img src="graph/NCNN_models_RPI3_RPI4.png" width="450"></th> </tr> </table>

Discussion

Platforms

alt text

<details> <summary>see linear plot</summary>

alt text

</details>

alt text

<details> <summary>See linear-scale plot</summary>

alt text

</details>

alt text

<details> <summary>See linear-scale plot</summary>

alt text

</details>

alt text

<details> <summary>See linear-scale plot</summary>

alt text

</details>

alt text

<details> <summary>See linear-scale plot</summary>

alt text

</details>

alt text

<details> <summary>See linear-scale plot</summary>

alt text

</details>

Networks comparison across platforms

The reported performances use a customized version of Caffe as backend. The results of the comparison of each model accross multiple platform are displayed below. The legend shows the number of batch size in color coded manner. Note that not all batch sizes are available for all architectures.

<details> <summary>see all plots..</summary>

alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text

</details>

Selecting an embedded platform and network

The challenge of implementing NN on an embedded system is the limitation on memory and computational resources.

That is to say it should have a small computational trace without losing the accuracy. To this purpose we looked into three rather novel architectures: SqueezeNet, MobileNet and ShuffleNet.

<!-- We looked into <a href = >Mobilenet</a>. -->

MobileNet

Mobilenet is an implementation of Google's MobileNet. Mobilenet has Top-1 accuracy of 70.81% and Top-5 accuracy of 89.5% compared to the leading model in accuracy, Densenet201, with 77.31% for Top-1 and 93.64% for Top-5. The MobileNet architecture has shown rather minimal lost in accuracy while reducing the footprint from 4.7 Gflops to 0.56 Gflops.

But the result was rather underwhelming. While faster than densenet201, the mobilenet is nowhere near the leading models in term of speed. The reason lies with the vanilla implementation of grouped convolutions in Caffe. A dedicated rewrite of depthwise convolutions (modified from https://github.com/BVLC/caffe/pull/5665) yielded an order of magnitude speed-up, making MobileNet usable again.

Our baseline was customized from https://github.com/shicai/MobileNet-Caffe.

You can witness the performance gain from the naive MobileNet implementation with vanilla Caffe below. On CPU

<table style="width=80%"> <tr> <th><img src="mobilenet/mobilenet_GTX1080Ti.png" width="450"></th> <th><img src="mobilenet/mobilenet_TX1.png" width="450"></th> </tr> </table> <table style="width=80%"> <tr> <th><img src="mobilenet/mobilenet_TK1.png" width="450"></th> <th><img src="mobilenet/mobilenet_RasPi3.png" width="450"></th> </tr> </table>

The gain is negligible on the Raspberry Pi 3 pure CPU platform. On GPU platforms the gain improves with batch size.

ShuffleNet

The ShuffleNet promised a more efficient NN via the dephtwise convolutions and a dedicated shuffling of channels.

We used a customized implementation from https://github.com/farmingyard/ShuffleNet, and that exhibit good performances.

Methodology

benchmarking

The benchmark uses the dd_bench.py Python script with images that can be downloaded from https://deepdetect.com/stuff/bench.tar.gz.

Assuming you had successfully build DeepDetect and it's up and running, the following call to the benchmark tool was used:

python dd_bench.py --host localhost --port 8080 --sname imageserv --gpu --remote-bench-data-dir <bench folder's location> --max-batch-size 128 --create <NN model folder name>

Of course, you'd need to change <bench folder's location> to your location to the bench folder and <NN model folder name> to your model folder name or path, assuming it is saved under DeepDetect/models.

This will create a service on the DD server with the name of imgserv with server listening from localhost:8080. It will use the available GPU according to --gpu and will make attempts of increasing batchsize up to 128.

Using additional models

To use additional models for benchmarking, 2 files are needed,

To train your own model beforehand, please refer to the section <a href="https://www.deepdetect.com/overview/train_images/">here</a>.

For the prototxt file taken from other resources, we need to make sure that the input and output are compatible with DeepDetect.

In the general case we will add the first layer to take the input as 224x224 image and on the output we will add a layer to treat the output with softmax. A useful reference template is https://github.com/beniz/deepdetect/blob/6d0a1f2d1e487b492e004d7d5972f302d4182ab1/templates/caffe/googlenet/deploy.prototxt

</details>

Raw Data

<details>

5 pass average processing time(GTX 1080 Ti):

Top 1 accuracy70.81missing75.376.47767.974.977.359.559.570.571.3missing
batch sizemobilenetmobilenet_depthwiseres50res101res152googlenetdensenet121densenet201Squeezenetv1.0Squeezenetv1.1vgg16vgg19shufflenet
137.212.219.835.844.416.645.6698.48.61414.615
236.36.214.122.527.89.82438.64.15.59.911.29.1
422.14.38.813.818.55.2516.525.92.63.556.958.26.95
821.23.527.2710.414.63.9311.9218.52.382.335.76.254.55
1619.53.736.338.6311.63.189.0613.72.161.975.186.214.71
3218.23.235.97.82x3.3xx2.592.965.156.053.49
6419.33.12xxx3.13xx2.52.334.825.633.26
12816.82.63xxx3.05xx2.22.24.975.572.87

5 pass average processing time(Jetson TX1):

Top 1 accuracy70.81missing75.376.47767.974.977.359.559.570.571.3missing
batch sizemobilenetmobilenet_depthwiseres50res101res152googlenetdensenet121densenet201Squeezenetv1.0Squeezenetv1.1vgg16vgg19shufflenet
117133.88914219543.613424833.430.213315260
217329.277.712218029.698.515923.717.916518738.8
41642769.6112x2493.7x20.714.212714921.7
815526.166.7xx21.8xx18.612.111013020.6
16x25.6xxx20.2xx17.711.810012021.8
32x25.5xxx19.7xx17.511.8xx22.9
64xxxxx20xx17.611.5xxx
128xxxxxxxxx11.6xxx

5 pass average processing time(Jetson TK1):

Top 1 accuracy70.81missing75.376.47767.974.977.359.559.570.571.3missing
batch sizemobilenetmobilenet_depthwiseres50res101res152googlenetdensenet121densenet201Squeezenetv1.0Squeezenetv1.1vgg16vgg19shufflenet
146433620328340019729463711990.2xx82.8
2462210231351477127225x8871.3xx63.8
4453135234xx87.2xx70.850.9xx53.4
8441141xxx78.8xx62.953.6xx52
16452137xxx87.8xx6740xx51.3
32xxxxx93xx8146.8xxx
64xxxxxxxxx45.2xxx
128xxxxxxxxxxxxx

5 pass average processing time(Raspberry pi 3):

Top 1 accuracy70.81missing75.376.47767.974.977.359.559.570.571.3missing
batch sizemobilenetmobilenet_depthwiseres50res101res152googlenetdensenet121densenet201Squeezenetv1.0Squeezenetv1.1vgg16vgg19shufflenet
1124614433560xx7980xx1492910xx1115
212301370xxx8008xx1478917xx1067
4x1372xxx7943xx1493919xx1047
8x1401xxx8015xx1444913xx1046
16xxxxxxxx1456909xxx
32xxxxxxxxxxxxx
64xxxxxxxxxxxxx
128xxxxxxxxxxxxx

flops and params for each model:

mobilenetmobilenet_depthwiseres50res101res152googlenetdensenet121densenet201Squeezenetv1.0Squeezenetv1.1vgg16vgg19shufflenet
Giga flops0.56870.55143.85807.570211.2821.58263.06314.77270.84750.349115.47019.6320.1234
million params4.23094.230925.55644.54860.1916.99027.977820.0121.24441.2315138.34143.651.8137
</details>

The bulk of this work was done by https://github.com/jsaksris/ while on internship at jolibrain.