Home

Awesome

ONNX-MLIR Serving

This project implements a GRPC server written with C++ to serve onnx-mlir compiled models. Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high throughput.

ONNX Servring provides dynamic batch aggregation and workers pool feature to fully utilize AI accelerators on the machine.

ONNX-MLIR

ONNX-MLIR is compiler technology to transform a valid Open Neural Network Exchange (ONNX) graph into code that implements the graph with minimum runtime support. It implements the ONNX standard and is based on the underlying LLVM/MLIR compiler technology.

Build

There are two ways to build this project.

Build ONNX-MLIR Serving on local environment

Prerequisite

1. GPRC Installed

Build GRPC from Source

GPRC Installation DIR example: grpc/cmake/install

2. ONNX MLIR Build is built

Copy include files from onnx-mlir source to onnx-mlir build dir.

ls onnx-mlir-serving/onnx-mlir-build/*
onnx-mlir-sering/onnx-mlir-build/include:
benchmark  CMakeLists.txt  google  onnx  onnx-mlir  OnnxMlirCompiler.h  OnnxMlirRuntime.h  rapidcheck  rapidcheck.h

onnx-mlir-serving/onnx-mlir-build/lib:
libcruntime.a

Build ONNX-MLIR Serving

cmake -DCMAKE_BUILD_TYPE=Release -DGRPC_DIR:STRING={GPRC_SRC_DIR} -DONNX_COMPILER_DIR:STRING={ONNX_MLIR_BUILD_DIR} -DCMAKE_PREFIX_PATH={GPRC_INSTALL_DIR} ../..
make -j

Build ONNX-MLIR Serving on Docker environment

Build AI GPRC Server and Client

docker build -t onnx/aigrpc-server .

Run ONNX-MLIR Server and Client

Server:

./grpc_server -h
usage: grpc_server [options]
    -w arg     wait time for batch size, default is 0
    -b arg     server side batch size, default is 1
    -n arg     thread numberm default is 1

./grpc_server

Add more models

Build Models Directory

/cmake/build
mkdir models

example models directory

models
└── mnist
    ├── config
    ├── model.so
    └── model.onnx

config

discripte model configs, can be generated usng utils/OnnxReader <model.onnx> examle of mnist config

input {
  name: "Input3"
  type {
    tensor_type {
      elem_type: 1
      shape {
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 28
        }
        dim {
          dim_value: 28
        }
      }
    }
  }
}
output {
  name: "Plus214_Output_0"
  type {
    tensor_type {
      elem_type: 1
      shape {
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 10
        }
      }
    }
  }
}
max_batch_size: 1

Inference request

see utils/inference.proto and utils/onnx.proto

Use Batching

There are two place to input batch size

  1. In model config file 'max_batch_size'
  2. When start grpc_server -b [batch size]

situation_1: grpc_server without -b, defaule batch size is 1, means no batching situation_2: grpc_server -b <batch_size>, batch_size > 1, and model A config max_batch_size > 1, when query model A, will use the mininum batch size. situation_3: grpc_server -b <batch_size>, batch_size > 1, and model B config max_batch_size = 1 (generated by default), when query model B, will not using batching.

example client:

example/cpp or example/python

Example

See grpc-test.cc