Home

Awesome

Pruned-OpenVINO-YOLO

简体中文

Prerequisite

Install mish-cuda first: https://github.com/JunnYu/mish-cuda Testing platform:WIN10+RTX3090+CUDA11.2

If you can't install it on your device, you can also try https://github.com/thomasbrandon/mish-cuda

Development log

<details><summary> <b>Expand</b> </summary> </details>

Introduction

When deploying YOLOv3/v4 on OpenVINO, the full version of the model has low FPS, while the tiny model has low accuracy and poor stability. The full version of the model structure is often designed to be able to detect 80 or more classes in more complex scenes. In our actual use, there are often only a few classes and the scenes are not that complicated. This tutorial will share how to prune YOLOv3/v4 model, and then deploy it on OpenVINO. With little loss of accuracy, the frame rate can be increased by several times on the intel inference devices.On the intel GPU device, it can even realize the simultaneous inference of four channels of video and guarantee the basic real-time requirements

The general process is as follows:

<img src="assets/enprocess.png" alt="image-20201113214025134" style="zoom:50%;" />

The following takes the YOLOv3-SPP and YOLOv4 models as examples to introduce the details of baseline training, model pruning and deployment on OpenVINO

Note: The data set I used is the two classes of people + car extracted by COCO2014 and the UA-DETRAC dataset I picked and labeled. There are 54647 training sets and 22998 test sets.

Baseline training

Basic training is to train with your own dataset normally, and train the model to the appropriate accuracy。

Recommended:

Note: As the above-mentioned projects do not support YOLOv4 very well, it may cause the training result of YOLOv4 to be slightly worse.

YOLOv3-SPP baseline training result

PRmAP@0.5ParamsSize of .weightsInference_time (Tesla P100)BFLOPS
0.5540.7090.66762.5M238M17.4ms65.69

YOLOv4 baseline training result

PRmAP@0.5ParamsSize of .weightsInference_time (Tesla T4)BFLOPS
0.5870.6990.66962.5M244M28.3ms59.57

Model pruning

i use this repos yolov3-channel-and-layer-pruning

Thanks tanluren andzbyuan for their great work

This model pruning project is based on ultralytics/yolov3.The folder named after Pruneyolov3v4 is the version of the pruning code I used.This version is based on the June 2020's ultralytics/yolov3, for reference only.The method of use is the same as yolov3-channel-and-layer-pruning. Because more tricks are added to the training process, the training mAP@0.5 will be slightly higher, and P and R will not be too far apart.

If you have any questions about the model pruning part, you can also ask questions here yolov3-channel-and-layer-pruning tanlurenand zbyuanwill be more professional.I only used part of the pruning strategy and share the pruning results here.

Sparsity training

Note: You could use python -c "from models import *; convert('cfg/yolov3.cfg', 'weights/last.pt')" to convert .pt file to .weights file.

.pt file will include epoch information.If you convert it to .weights,you could train model from epoch 0.

python train.py --cfg cfg/my_cfg.cfg --data data/my_data.data --weights weights/last.weights --epochs 300 --batch-size 32 -sr --s 0.001 --prune 1

Important note!

The picture below is my tensorboard diagram of the sparse training of the YOLOv4 model

image-20201114103527008

Although mAP declined in the early stage of sparse training, the minimum remained above 0.4, indicating that the selected s value was appropriate. However, training was abnormal at 230 epochs, P increased sharply, and R decreased sharply (at one time it was close to 0 ), mAP@0.5 also fell sharply. This kind of situation will not appear under normal circumstances in the middle and late stages of training. Even if you encounter a situation like mine, don't panic. If the various indicators have a tendency to return to normal, it will have no effect. If it can't recover after a delay, you may need to retrain.

image-20201114105332649

The figure below is the result of YOLOv4's sparse training for 300 epochs. It can be seen that most of the Gmma weights tend to 0, and the closer to 0 the more sparseness is. The following figure can already be considered an acceptable sparseness result and is for reference only.

image-20201114105904566

tensorboard also provides the Gmma weight distribution map of the BN layer before sparse training, which can be used as a comparison:

image-20201114110122310

After sparsity training, mAP@0.5 of YOLOv3-SPP dropped by 4 points, and YOLOv4 dropped by 7 points.

YOLOv3-SPP after sparsity training

ModelPRmAP@0.5
Sparsity training0.5250.670.624

YOLOv4 after sparsity training

ModelPRmAP@0.5
Sparsity training0.6650.5700.595

Model pruning

Pruning can be started after sufficient sparseness. Pruning can be divided into channel pruning and layer pruning, both of which are evaluated based on the Gmma weight of the BN layer, so whether the sparse training is sufficient will directly affect the effect of pruning. Channel pruning greatly reduces the number of model parameters and the size of weight files. The speed-up effect on desktop GPU devices may not be as obvious as on embedded devices. Layer pruning has a more universal acceleration effect. After the pruning is complete, fine-tune the model to restore accuracy.

The following uses YOLOv3-SPP and YOLOv4 as examples to introduce how to find a suitable pruning point (maintain a high mAP@0.5 under the greatest possible pruning force), and i call it the "optimal pruning point":

channel pruing

python slim_prune.py --cfg cfg/my_cfg.cfg --data data/my_data.data --weights weights/last.pt --global_percent 0.8 --layer_keep 0.01

When setting the global channel pruning ratio (Global percent), you can choose a strategy of large intervals and then gradually subdividing to approach the "optimal pruning point". For example, the Global percent first takes 0.7, 0.8, and 0.9. It is found that when 0.7 and 0.8 are taken, the model obtains the compression effect while the accuracy does not decline seriously, and even slightly exceeds the model after spasity training. However, when taking 0.9, P rises sharply, and R and mAP@0.5 drop sharply. It can be inferred that When Global percent is 0.9, it just exceeds the "optimal pruning point", so the Global percent is gradually subdivided into 0.88 and 0.89.And when the Global percent is 0.88 and 0.89,the parameters are the same with three decimal places retained. And the model accuracy is very close to model after spasity training,but 0.89 will have a better compression effect. If we take Global percent as 0.91, 0.92, 0.93, we can find that when we take 0.9, P has risen to the limit 1, and R and mAP@0.5 are close to 0. After this limit is exceeded (that means the Global percent is greater than 0.91), P, R, mAP@0.5 is infinitely close to 0. This also means that the key channels have been cut off.

So it can be determined that when the Global percent is 0.89, it is the "optimal pruning point"

YOLOv3-SPP's parameters of the model after spasity training under different global channel pruning scales

Global percentPRmAP@0.5ParamsSize of .weightsInference_time (Tesla P100)BFLOPS
0.70.5720.6590.62715.7M59.8M16.7ms25.13
0.80.5750.6560.6267.8M30M16.7ms18.07
0.880.5740.6520.6212.7M10.2M16.6ms13.27
0.890.5740.6520.6212.6M10.1M16.5ms13.23
0.90.8590.2590.4842.5M9.41M16.3ms12.71
0.9110.000680.142.1M9.02M16.4ms11.69
0.92000.001181.9M7.15M16.1ms10.99
0.930001.7M6.34M16.5ms10.37

YOLOv4's parameters of the model after spasity training under different global channel pruning scales

Global percentPRmAP@0.5ParamsSize of .weightsInference_time (Tesla T4)BFLOPS
0.50.6930.5590.59419.8M75.8M18.0ms26.319
0.60.6970.5520.58412.8M49.1M17.7ms20.585
0.70.6990.550.5817.1M27.0M17.6ms15.739
0.80.6960.5440.5783.0M11.6M16.4ms11.736
0.820.6970.5420.5752.4M9.49M16.5ms11.033
0.840.6980.540.5742.0M7.84M16.5ms10.496
0.860.6980.540.5711.7M6.58M16.4ms9.701
0.880.7060.5360.571.5M6.09M16.4ms8.692
0.890.7870.06340.2041.3M5.36M16.5ms8.306
0.90.8510.000790.03291.2M4.79M16.5ms7.927

In the same way, it can be judged that when the Global percent is 0.88, it is the "optimal pruning point" for channel pruning.

After channel pruning, we could perform layer pruning.

layer prunine

python layer_prune.py --cfg cfg/my_cfg.cfg --data data/my_data.data --weights weights/last.pt --shortcuts 12

The parameter shortcuts is the number of cut Resunits, which is the Cut Resunit parameter in the table below. ​ YOLOv3-SPP-Global-Percent0.89's parameters under different layer pruning forces

Cut ResunitPRmAP@0.5ParamsSize of .weightsInference_time (Tesla P100)BFLOPS
160.4920.4210.3972.3M8.97M10.4ms12.39
170.480.3650.3422.2M8.55M9.7ms11.79
180.5470.1660.2052.1M7.99M9.1ms11.02
190.5610.05820.1082.0M7.82M8.9ms10.06
200.6310.03490.09641.9M7.43M8.2ms9.93

Analyzing the above table, it can be found that for each additional Res unit cut, P will increase, and R and mAP@0.5 will fall. This is also in line with the theoretical expectations introduced during channel pruning. Generally speaking, a good model P and R should be at a higher level and closer. When 18 Resunits are cut off, both R and mAP@0.5 have dropped significantly, and there is already a large gap between R and P at this time, so the optimal pruning point has been exceeded at this time. If you go further Increase the number of Resunits pruned, R and mAP@0.5 have begun to approach 0. In order to maximize the acceleration effect, you should cut off as many Resunits as possible, and cutting off 17 Resunits (51 layers in total) is obviously the best choice to maintain the accuracy of the model as much as possible, that is the "optimal pruning point".

At the same time, Inference_time also reflects the obvious acceleration effect of layer pruning compared with the baseline model.

YOLOv4-Global Percent0.88's parameters under different layer pruning forces

Cut ResunitPRmAP@0.5ParamsSize of .weightsInference_time (Tesla T4)BFLOPS
140.6860.4730.5071.5M5.78M12.1ms8.467
170.7040.3440.4191.4M5.39M11.0ms7.834
180.6780.310.3771.3M5.33M10.9ms7.815
190.7810.04260.1211.3M5.22M10.5ms7.219
200.7650.01130.0551.24.94M10.4ms6.817

In the same way, it can be judged that the global channel pruning ratio is 0.88, and 18 Res units are cut (that is, 54 layers are cut) is the "optimal pruning point" of YOLOv4.

Model fine-tuning

python train.py --cfg cfg/prune_0.85_my_cfg.cfg --data data/my_data.data --weights weights/prune_0.85_last.weights --epochs 100 --batch-size 32

Warmup is set in the first few epochs of the model, which helps to restore the accuracy of the model after pruning. The default is 6, if you think it is too much, you can modify the train.py code by yourself.

use the default warmup of 6 epochs, and the results of fine-tuning are as follows:

Comparison of YOLOv3-SPP baseline model and the model after pruning and fine-tuning

ModelPRmAP@0.5ParamsSize of .weightsInference_time (Tesla P100)BFLOPS
baseline0.5540.7090.66762.5M238M17.4ms65.69
After finetune0.5560.6630.6312.2M8.55M9.7ms11.79

image-20201114113556530

Distribution map of the absolute value of the weight of the BN layer of the model after YOLOv3-SPP pruning (left) and after fine-tuning (right)

So far, the whole process of model pruning of YOLOv3-SPP is completed. After model pruning, the model accuracy loses 3 points, and the total model parameters and weight file size are reduced by 96.4%. Model BFLOPS is reduced by 82%, and the inference speed on Tesla P100 GPU is increased by 44%.

Comparison of YOLOv4 baseline model and the model after pruning and fine-tuning

ModelPRmAP@0.5ParamsSize of .weightsInference_time (Tesla T4)BFLOPS
baseline0.5870.6990.66962.5M244M28.3ms59.57
After finetune0.5650.6260.6011.3M5.33M10.9ms7.815

image-20201114113814882

Distribution map of the absolute value of the weight of the BN layer of the model after YOLOv4 pruning (left) and after fine-tuning (right)

So far, the whole process of model pruning of YOLOv4 is completed. After model pruning, the model accuracy loses 7 points, and the total model parameters and weight file size are reduced by 98%. Model BFLOPS is reduced by 87%, and the inference speed on Tesla T4 GPU is increased by 61%.

The model training of pytorch and darknet are quite different in many details.It is often better to fine tune training under the framework of darknet.It should be noted that you only need to use the pruned .cfg file and do not need to load the pre training weights!

Deployment of the model after pruning on OpenVINO

There are many optimization algorithms for the YOLO model, but because the model is converted to the OpenVINO IR model, tensorflow1.x based on the static graph design is used, which makes it necessary to adjust the tensorflow code as long as the model structure is changed. In order to simplify this process, I made a tool to analyze the cfg file of the pruned model and generate tensorflow code. With this tool, the pruned model can be quickly deployed in OpenVINO.

Repositories: https://github.com/TNTWEN/OpenVINO-YOLO-Automatic-Generation

Under OpenVINO, the pruned model can get a 2~3 times increase in frame rate for inference using intel CPU, GPU, HDDL, and NCS2. We can use video splicing, four channels of 416×416 videos are spliced into 832×832, so that OpenVINO four channels of video can simultaneously perform YOLO and ensure basic real-time requirements.

And this tool has the potential to be compatible with other YOLO optimization algorithms. It only needs to provide the cfg file and weight file of the optimized model to complete the model conversion.

Thank you for your use and hope it will help you!