Awesome
ONNX-MLIR Serving
This project implements a GRPC server written with C++ to serve onnx-mlir compiled models. Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high throughput.
ONNX Servring provides dynamic batch aggregation and workers pool feature to fully utilize AI accelerators on the machine.
ONNX-MLIR
ONNX-MLIR is compiler technology to transform a valid Open Neural Network Exchange (ONNX) graph into code that implements the graph with minimum runtime support. It implements the ONNX standard and is based on the underlying LLVM/MLIR compiler technology.
Build
There are two ways to build this project.
- Build ONNX-MLIR Serving on local environment
- Build ONNX-MLIR Serving on Docker environment (Recommended)
Build ONNX-MLIR Serving on local environment
Prerequisite
1. GPRC Installed
GPRC Installation DIR example: grpc/cmake/install
2. ONNX MLIR Build is built
Copy include files from onnx-mlir source to onnx-mlir build dir.
ls onnx-mlir-serving/onnx-mlir-build/*
onnx-mlir-sering/onnx-mlir-build/include:
benchmark CMakeLists.txt google onnx onnx-mlir OnnxMlirCompiler.h OnnxMlirRuntime.h rapidcheck rapidcheck.h
onnx-mlir-serving/onnx-mlir-build/lib:
libcruntime.a
Build ONNX-MLIR Serving
cmake -DCMAKE_BUILD_TYPE=Release -DGRPC_DIR:STRING={GPRC_SRC_DIR} -DONNX_COMPILER_DIR:STRING={ONNX_MLIR_BUILD_DIR} -DCMAKE_PREFIX_PATH={GPRC_INSTALL_DIR} ../..
make -j
Build ONNX-MLIR Serving on Docker environment
Build AI GPRC Server and Client
docker build -t onnx/aigrpc-server .
Run ONNX-MLIR Server and Client
Server:
./grpc_server -h
usage: grpc_server [options]
-w arg wait time for batch size, default is 0
-b arg server side batch size, default is 1
-n arg thread numberm default is 1
./grpc_server
Add more models
Build Models Directory
/cmake/build
mkdir models
example models directory
models
└── mnist
├── config
├── model.so
└── model.onnx
config
discripte model configs, can be generated usng utils/OnnxReader <model.onnx> examle of mnist config
input {
name: "Input3"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 1
}
dim {
dim_value: 1
}
dim {
dim_value: 28
}
dim {
dim_value: 28
}
}
}
}
}
output {
name: "Plus214_Output_0"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 1
}
dim {
dim_value: 10
}
}
}
}
}
max_batch_size: 1
Inference request
see utils/inference.proto and utils/onnx.proto
Use Batching
There are two place to input batch size
- In model config file 'max_batch_size'
- When start grpc_server -b [batch size]
situation_1: grpc_server without -b, defaule batch size is 1, means no batching situation_2: grpc_server -b <batch_size>, batch_size > 1, and model A config max_batch_size > 1, when query model A, will use the mininum batch size. situation_3: grpc_server -b <batch_size>, batch_size > 1, and model B config max_batch_size = 1 (generated by default), when query model B, will not using batching.
example client:
example/cpp or example/python
Example
See grpc-test.cc
- TEST_F is a simpliest example to serve minst model.