Home

Awesome

Read this in other languages: English, 简体中文.

News:

Tutorial Video

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

INTRO

  1. High level interface for C++/Python.
  2. Simplify the implementation of custom plugin. And serialization and deserialization have been encapsulated for easier usage.
  3. Simplify the compile of fp32, fp16 and int8 for facilitating the deployment with C++/Python in server or embeded device.
  4. Models ready for use also with examples are RetinaFace, Scrfd, YoloV5, YoloX, Arcface, AlphaPose, CenterNet and DeepSORT(C++)

YoloX and YoloV5-series Model Test Report

<details> <summary>app_yolo.cpp speed testing</summary>
  1. Resolution (YoloV5P5, YoloX) = (640x640), (YoloV5P6) = (1280x1280)
  2. max batch size = 16
  3. preprocessing + inference + postprocessing
  4. cuda10.2, cudnn8.2.2.26, TensorRT-8.0.1.6
  5. RTX2080Ti
  6. num of testing: take the average on the results of 100 times but excluding the first time for warmup
  7. Testing log: [workspace/perf.result.std.log (workspace/perf.result.std.log)
  8. code for testing: src/application/app_yolo.cpp
  9. images for testing: 6 images in workspace/inference
    • with resolution 810x1080,500x806,1024x684,550x676,1280x720,800x533 respetively
  10. Testing method: load 6 images. Then do the inference on the 6 images, which will be repeated for 100 times. Note that each image should be preprocessed and postprocessed.

ModelResolutionTypePrecisionElapsed TimeFPS
yolox_x640x640YoloXFP3221.87945.71
yolox_l640x640YoloXFP3212.30881.25
yolox_m640x640YoloXFP326.862145.72
yolox_s640x640YoloXFP323.088323.81
yolox_x640x640YoloXFP166.763147.86
yolox_l640x640YoloXFP163.933254.25
yolox_m640x640YoloXFP162.515397.55
yolox_s640x640YoloXFP161.362734.48
yolox_x640x640YoloXINT84.070245.68
yolox_l640x640YoloXINT82.444409.21
yolox_m640x640YoloXINT81.730577.98
yolox_s640x640YoloXINT81.060943.15
yolov5x61280x1280YoloV5_P6FP3268.02214.70
yolov5l61280x1280YoloV5_P6FP3237.93126.36
yolov5m61280x1280YoloV5_P6FP3220.12749.69
yolov5s61280x1280YoloV5_P6FP328.715114.75
yolov5x640x640YoloV5_P5FP3218.48054.11
yolov5l640x640YoloV5_P5FP3210.11098.91
yolov5m640x640YoloV5_P5FP325.639177.33
yolov5s640x640YoloV5_P5FP322.578387.92
yolov5x61280x1280YoloV5_P6FP1620.87747.90
yolov5l61280x1280YoloV5_P6FP1610.96091.24
yolov5m61280x1280YoloV5_P6FP167.236138.20
yolov5s61280x1280YoloV5_P6FP163.851259.68
yolov5x640x640YoloV5_P5FP165.933168.55
yolov5l640x640YoloV5_P5FP163.450289.86
yolov5m640x640YoloV5_P5FP162.184457.90
yolov5s640x640YoloV5_P5FP161.307765.10
yolov5x61280x1280YoloV5_P6INT812.20781.92
yolov5l61280x1280YoloV5_P6INT87.221138.49
yolov5m61280x1280YoloV5_P6INT85.248190.55
yolov5s61280x1280YoloV5_P6INT83.149317.54
yolov5x640x640YoloV5_P5INT83.704269.97
yolov5l640x640YoloV5_P5INT82.255443.53
yolov5m640x640YoloV5_P5INT81.674597.40
yolov5s640x640YoloV5_P5INT81.143874.91
</details> <details> <summary>app_yolo_fast.cpp speed testing. Never stop desiring for being faster</summary>
ModelResolutionTypePrecisionElapsed TimeFPS
yolox_x_fast640x640YoloXFP3221.59846.30
yolox_l_fast640x640YoloXFP3212.19981.97
yolox_m_fast640x640YoloXFP326.819146.65
yolox_s_fast640x640YoloXFP322.979335.73
yolox_x_fast640x640YoloXFP166.764147.84
yolox_l_fast640x640YoloXFP163.866258.64
yolox_m_fast640x640YoloXFP162.386419.16
yolox_s_fast640x640YoloXFP161.259794.36
yolox_x_fast640x640YoloXINT83.918255.26
yolox_l_fast640x640YoloXINT82.292436.38
yolox_m_fast640x640YoloXINT81.589629.49
yolox_s_fast640x640YoloXINT80.9541048.47
yolov5x6_fast1280x1280YoloV5_P6FP3267.07514.91
yolov5l6_fast1280x1280YoloV5_P6FP3237.49126.67
yolov5m6_fast1280x1280YoloV5_P6FP3219.42251.49
yolov5s6_fast1280x1280YoloV5_P6FP327.900126.57
yolov5x_fast640x640YoloV5_P5FP3218.55453.90
yolov5l_fast640x640YoloV5_P5FP3210.06099.41
yolov5m_fast640x640YoloV5_P5FP325.500181.82
yolov5s_fast640x640YoloV5_P5FP322.342427.07
yolov5x6_fast1280x1280YoloV5_P6FP1620.53848.69
yolov5l6_fast1280x1280YoloV5_P6FP1610.40496.12
yolov5m6_fast1280x1280YoloV5_P6FP166.577152.06
yolov5s6_fast1280x1280YoloV5_P6FP163.087323.99
yolov5x_fast640x640YoloV5_P5FP165.919168.95
yolov5l_fast640x640YoloV5_P5FP163.348298.69
yolov5m_fast640x640YoloV5_P5FP162.015496.34
yolov5s_fast640x640YoloV5_P5FP161.087919.63
yolov5x6_fast1280x1280YoloV5_P6INT811.23689.00
yolov5l6_fast1280x1280YoloV5_P6INT86.235160.38
yolov5m6_fast1280x1280YoloV5_P6INT84.311231.97
yolov5s6_fast1280x1280YoloV5_P6INT82.139467.45
yolov5x_fast640x640YoloV5_P5INT83.456289.37
yolov5l_fast640x640YoloV5_P5INT82.019495.41
yolov5m_fast640x640YoloV5_P5INT81.425701.71
yolov5s_fast640x640YoloV5_P5INT80.8441185.47
</details>

Setup and Configuration

<details> <summary>Linux</summary>
  1. VSCode (highly recommended!)
  2. Configure your path for cudnn, cuda, tensorRT8.0 and protobuf.
  3. Configure the compute capability matched with your nvidia graphics card in Makefile/CMakeLists.txt
  4. Configure your library path in .vscode/c_cpp_properties.json
  5. CUDA version: CUDA10.2
  6. CUDNN version: cudnn8.2.2.26. Note that dev(.h file) and runtime(.so file) should be downloaded.
  7. tensorRT version:tensorRT-8.0.1.6-cuda10.2
  8. protobuf version(for onnx parser):protobufv3.11.4
</details> <details> <summary>Linux: Compile for Python</summary> </details> <details> <summary>Windows</summary>
  1. Please check the lean/README.md for the detailed dependency

  2. In TensorRT.vcxproj, replace the <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 10.0.props" /> with your own CUDA path

  3. In TensorRT.vcxproj, replace the <Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 10.0.targets" /> with your own CUDA path

  4. In TensorRT.vcxproj, replace the <CodeGeneration>compute_61,sm_61</CodeGeneration> with your compute capability.

  5. Configure your dependency or download it to the foler /lean. Configure VC++ dir (include dir and refence)

  6. Configure your env, debug->environment

  7. Compile and run the example, where 3 options are available.

</details> <details> <summary>Windows: Compile for Python</summary>
  1. Compile pytrtc.pyd. Choose python in visual studio to compile
  2. Copy dll and execute 'python/copy_dll_to_pytrt.bat'
  3. Execute the example in python dir by 'python test_yolov5.py'
</details> <details> <summary>Other Protobuf Version</summary>
#cd the path in terminal to /onnx
cd onnx

#execuete the command to make pb files
bash make_pb.sh
mkdir build && cd build
cmake ..
make yolo -j64
make yolo -j64
</details> <details> <summary>TensorRT 7.x support</summary>
  1. Replace onnx_parser_for_7.x/onnx_parser to src/tensorRT/onnx_parser
    • bash onnx_parser/use_tensorrt_7.x.sh
  2. Configure Makefile/CMakeLists.txt path to TensorRT7.x
  3. Execute make yolo -j64
</details> <details> <summary>TensorRT 8.x support</summary>
  1. Replace onnx_parser_for_8.x/onnx_parser to src/tensorRT/onnx_parser
    • bash onnx_parser/use_tensorrt_8.x.sh
  2. Configure Makefile/CMakeLists.txt path to TensorRT8.x
  3. Execute make yolo -j64
</details>

Guide for Different Tasks/Model Support

<details> <summary>YoloV5 Support</summary>
  1. Download yolov5
git clone git@github.com:ultralytics/yolov5.git
  1. Modify the code for dynamic batchsize
# line 55 forward function in yolov5/models/yolo.py 
# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
# modified into:

bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
bs = -1
ny = int(ny)
nx = int(nx)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

# line 70 in yolov5/models/yolo.py
#  z.append(y.view(bs, -1, self.no))
# modified into:
z.append(y.view(bs, self.na * ny * nx, self.no))

############# for yolov5-6.0 #####################
# line 65 in yolov5/models/yolo.py
# if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
#    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
# modified into:
if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

# disconnect for pytorch trace
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view(1, -1, 1, 1, 2)

# line 70 in yolov5/models/yolo.py
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# modified into:
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh

# line 73 in yolov5/models/yolo.py
# wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# modified into:
wh = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh
############# for yolov5-6.0 #####################


# line 52 in yolov5/export.py
# torch.onnx.export(dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # shape(1,3,640,640)
#                                'output': {0: 'batch', 1: 'anchors'}  # shape(1,25200,85)  修改为
# modified into:
torch.onnx.export(dynamic_axes={'images': {0: 'batch'},  # shape(1,3,640,640)
                                'output': {0: 'batch'}  # shape(1,25200,85) 
  1. Export to onnx model
cd yolov5
python export.py --weights=yolov5s.pt --dynamic --include=onnx --opset=11
  1. Copy the model and execute it
cp yolov5/yolov5s.onnx tensorRT_cpp/workspace/
cd tensorRT_cpp
make yolo -j32
</details> <details> <summary>YoloV7 Support</summary> 1. Download yolov7 and pth
# from cdn
# or wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt

wget https://cdn.githubjs.cf/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt
git clone git@github.com:WongKinYiu/yolov7.git
  1. Modify the code for dynamic batchsize
# line 45 forward function in yolov7/models/yolo.py 
# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
# modified into:

bs, _, ny, nx = map(int, x[i].shape)  # x(bs,255,20,20) to x(bs,3,20,20,85)
bs = -1
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

# line 52 in yolov7/models/yolo.py
# y = x[i].sigmoid()
# y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# z.append(y.view(bs, -1, self.no))
# modified into:
y = x[i].sigmoid()
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, -1, 1, 1, 2)  # wh
classif = y[..., 4:]
y = torch.cat([xy, wh, classif], dim=-1)
z.append(y.view(bs, self.na * ny * nx, self.no))

# line 57 in yolov7/models/yolo.py
# return x if self.training else (torch.cat(z, 1), x)
# modified into:
return x if self.training else torch.cat(z, 1)


# line 52 in yolov7/models/export.py
# output_names=['classes', 'boxes'] if y is None else ['output'],
# dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # size(1,3,640,640)
#               'output': {0: 'batch', 2: 'y', 3: 'x'}} if opt.dynamic else None)
# modified into:
output_names=['classes', 'boxes'] if y is None else ['output'],
dynamic_axes={'images': {0: 'batch'},  # size(1,3,640,640)
              'output': {0: 'batch'}} if opt.dynamic else None)

  1. Export to onnx model
cd yolov7
python models/export.py --dynamic --grid --weight=yolov7.pt
  1. Copy the model and execute it
cp yolov7/yolov7.onnx tensorRT_cpp/workspace/
cd tensorRT_cpp
make yolo -j32
</details> <details> <summary>YoloX Support</summary>
  1. Download YoloX
git clone git@github.com:Megvii-BaseDetection/YOLOX.git
cd YOLOX
  1. Modify the code The modification ensures a successful int8 compilation and inference, otherwise Missing scale and zero-point for tensor (Unnamed Layer* 686) will be raised.
# line 206 forward fuction in yolox/models/yolo_head.py. Replace the commented code with the uncommented code
# self.hw = [x.shape[-2:] for x in outputs] 
self.hw = [list(map(int, x.shape[-2:])) for x in outputs]


# line 208 forward function in yolox/models/yolo_head.py. Replace the commented code with the uncommented code
# [batch, n_anchors_all, 85]
# outputs = torch.cat(
#     [x.flatten(start_dim=2) for x in outputs], dim=2
# ).permute(0, 2, 1)
proc_view = lambda x: x.view(-1, int(x.size(1)), int(x.size(2) * x.size(3)))
outputs = torch.cat(
    [proc_view(x) for x in outputs], dim=2
).permute(0, 2, 1)


# line 253 decode_output function in yolox/models/yolo_head.py Replace the commented code with the uncommented code
#outputs[..., :2] = (outputs[..., :2] + grids) * strides
#outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides
#return outputs
xy = (outputs[..., :2] + grids) * strides
wh = torch.exp(outputs[..., 2:4]) * strides
return torch.cat((xy, wh, outputs[..., 4:]), dim=-1)

# line 77 in tools/export_onnx.py
model.head.decode_in_inference = True
  1. Export to onnx

# download model
wget https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_m.pth

# export
export PYTHONPATH=$PYTHONPATH:.
python tools/export_onnx.py -c yolox_m.pth -f exps/default/yolox_m.py --output-name=yolox_m.onnx --dynamic --no-onnxsim
  1. Execute the command
cp YOLOX/yolox_m.onnx tensorRT_cpp/workspace/
cd tensorRT_cpp
make yolo -j32
</details> <details> <summary>YoloV3 Support</summary>
  1. Download yolov3
git clone git@github.com:ultralytics/yolov3.git
  1. Modify the code for dynamic batchsize
# line 55 forward function in yolov3/models/yolo.py 
# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
# modified into:

bs, _, ny, nx = map(int, x[i].shape)  # x(bs,255,20,20) to x(bs,3,20,20,85)
bs = -1
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()


# line 70 in yolov3/models/yolo.py
#  z.append(y.view(bs, -1, self.no))
# modified into:
z.append(y.view(bs, self.na * ny * nx, self.no))

# line 62 in yolov3/models/yolo.py
# if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
#    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
# modified into:
if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view(1, -1, 1, 1, 2)

# line 70 in yolov3/models/yolo.py
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# modified into:
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh

# line 73 in yolov3/models/yolo.py
# wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# modified into:
wh = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh


# line 52 in yolov3/export.py
# torch.onnx.export(dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # shape(1,3,640,640)
#                                'output': {0: 'batch', 1: 'anchors'}  # shape(1,25200,85) 
# modified into:
torch.onnx.export(dynamic_axes={'images': {0: 'batch'},  # shape(1,3,640,640)
                                'output': {0: 'batch'}  # shape(1,25200,85) 
  1. Export to onnx model
cd yolov3
python export.py --weights=yolov3.pt --dynamic --include=onnx --opset=11
  1. Copy the model and execute it
cp yolov3/yolov3.onnx tensorRT_cpp/workspace/
cd tensorRT_cpp

# change src/application/app_yolo.cpp: main
# test(Yolo::Type::V3, TRT::Mode::FP32, "yolov3");

make yolo -j32
</details> <details> <summary>UNet Support</summary>
make dunet -j32
</details> <details> <summary>Retinaface Support</summary>
  1. Download Pytorch_Retinaface Repo
git clone git@github.com:biubug6/Pytorch_Retinaface.git
cd Pytorch_Retinaface
  1. Download model from the Training of README.md in https://github.com/biubug6/Pytorch_Retinaface#training .Then unzip it to the /weights . Here, we use mobilenet0.25_Final.pth

  2. Modify the code

# line 24 in models/retinaface.py
# return out.view(out.shape[0], -1, 2) is modified into 
return out.view(-1, int(out.size(1) * out.size(2) * 2), 2)

# line 35 in models/retinaface.py
# return out.view(out.shape[0], -1, 4) is modified into
return out.view(-1, int(out.size(1) * out.size(2) * 2), 4)

# line 46 in models/retinaface.py
# return out.view(out.shape[0], -1, 10) is modified into
return out.view(-1, int(out.size(1) * out.size(2) * 2), 10)

# The following modification ensures the output of resize node is based on scale rather than shape such that dynamic batch can be achieved.
# line 89 in models/net.py
# up3 = F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode="nearest") is modified into
up3 = F.interpolate(output3, scale_factor=2, mode="nearest")

# line 93 in models/net.py
# up2 = F.interpolate(output2, size=[output1.size(2), output1.size(3)], mode="nearest") is modified into
up2 = F.interpolate(output2, scale_factor=2, mode="nearest")

# The following code removes softmax (bug sometimes happens). At the same time, concatenate the output to simplify the decoding.
# line 123 in models/retinaface.py
# if self.phase == 'train':
#     output = (bbox_regressions, classifications, ldm_regressions)
# else:
#     output = (bbox_regressions, F.softmax(classifications, dim=-1), ldm_regressions)
# return output
# the above is modified into:
output = (bbox_regressions, classifications, ldm_regressions)
return torch.cat(output, dim=-1)

# set 'opset_version=11' to ensure a successful export
# torch_out = torch.onnx._export(net, inputs, output_onnx, export_params=True, verbose=False,
#     input_names=input_names, output_names=output_names)
# is modified into:
torch_out = torch.onnx._export(net, inputs, output_onnx, export_params=True, verbose=False, opset_version=11,
    input_names=input_names, output_names=output_names)




  1. Export to onnx
python convert_to_onnx.py
  1. Execute
cp FaceDetector.onnx ../tensorRT_cpp/workspace/mb_retinaface.onnx
cd ../tensorRT_cpp
make retinaface -j64
</details> <details> <summary>DBFace Support</summary>
make dbface -j64
</details> <details> <summary>Scrfd Support</summary> </details> <details> <summary>Arcface Support</summary>
auto arcface = Arcface::create_infer("arcface_iresnet50.fp32.trtmodel", 0);
auto feature = arcface->commit(make_tuple(face, landmarks)).get();
cout << feature << endl;  // 1x512
</details> <details> <summary>CenterNet Support</summary>

check the great details in tutorial/2.0

</details> <details> <summary>Bert Support(Chinese Classification)</summary> </details>

the INTRO to Interface

<details> <summary>Python Interface:Get onnx and trtmodel from pytorch model more easily</summary>
import pytrt

model = models.resnet18(True).eval()
pytrt.from_torch(
    model, 
    dummy_input, 
    max_batch_size=16, 
    onnx_save_file="test.onnx", 
    engine_save_file="engine.trtmodel"
)
</details> <details> <summary>Python Interface:TensorRT Inference</summary>
import pytrt

yolo   = tp.Yolo(engine_file, type=tp.YoloType.X)   # engine_file is the trtmodel file
image  = cv2.imread("inference/car.jpg")
bboxes = yolo.commit(image).get()
import pytrt

model     = models.resnet18(True).eval().to(device) # pt model
trt_model = tp.from_torch(model, input)
trt_out   = trt_model(input)
</details> <details> <summary>C++ Interface:YoloX Inference</summary>

// create infer engine on gpu 0
auto engine = Yolo::create_infer("yolox_m.fp32.trtmodel", Yolo::Type::X, 0);

// load image
auto image = cv::imread("1.jpg");

// do inference and get the result
auto box = engine->commit(image).get();
</details> <details> <summary>C++ Interface:Compile Model in FP32/FP16</summary>
TRT::compile(
  TRT::Mode::FP32,   // compile model in fp32
  3,                          // max batch size
  "plugin.onnx",              // onnx file
  "plugin.fp32.trtmodel",     // save path
  {}                         //  redefine the shape of input when needed
);
</details> <details> <summary>C++ Interface:Compile in int8</summary>
// define int8 calibration function to read data and handle it to tenor.
auto int8process = [](int current, int count, vector<string>& images, shared_ptr<TRT::Tensor>& tensor){
    for(int i = 0; i < images.size(); ++i){
    // int8 compilation requires calibration. We read image data and set_norm_mat. Then the data will be transfered into the tensor.
        auto image = cv::imread(images[i]);
        cv::resize(image, image, cv::Size(640, 640));
        float mean[] = {0, 0, 0};
        float std[]  = {1, 1, 1};
        tensor->set_norm_mat(i, image, mean, std);
    }
};


// Specify TRT::Mode as INT8
auto model_file = "yolov5m.int8.trtmodel";
TRT::compile(
  TRT::Mode::INT8,            // INT8
  3,                          // max batch size
  "yolov5m.onnx",             // onnx
  model_file,                 // saved filename
  {},                         // redefine the input shape
  int8process,                // the recall function for calibration
  ".",                        // the dir where the image data is used for calibration
  ""                          // the dir where the data generated from calibration is saved(a.k.a where to load the calibration data.)
);
</details> <details> <summary>C++ Interface:Inference</summary>
// load model and get a shared_ptr. get nullptr if fail to load.
auto engine = TRT::load_infer("yolov5m.fp32.trtmodel");

// print model info
engine->print();

// load image
auto image = imread("demo.jpg");

// get the model input and output node, which can be accessed by name or index
auto input = engine->input(0);   // or auto input = engine->input("images");
auto output = engine->output(0); // or auto output = engine->output("output");

// put the image into input tensor by calling set_norm_mat()
float mean[] = {0, 0, 0};
float std[]  = {1, 1, 1};
input->set_norm_mat(i, image, mean, std);

// do the inference. Here sync(true) or async(false) is optional
engine->forward(); // engine->forward(true or false)

// get the outut_ptr, which can used to access the output
float* output_ptr = output->cpu<float>();
</details> <details> <summary>C++ Interface:Plugin</summary>
template<>
__global__ void HSwishKernel(float* input, float* output, int edge) {

    KernelPositionBlock;
    float x = input[position];
    float a = x + 3;
    a = a < 0 ? 0 : (a >= 6 ? 6 : a);
    output[position] = x * a / 6;
}

int HSwish::enqueue(const std::vector<GTensor>& inputs, std::vector<GTensor>& outputs, const std::vector<GTensor>& weights, void* workspace, cudaStream_t stream) {

    int count = inputs[0].count();
    auto grid = CUDATools::grid_dims(count);
    auto block = CUDATools::block_dims(count);
    HSwishKernel <<<grid, block, 0, stream >>> (inputs[0].ptr<float>(), outputs[0].ptr<float>(), count);
    return 0;
}


RegisterPlugin(HSwish);
</details>

About Us