Awesome

TensorFlow Backend

The Triton backend for TensorFlow. You can learn more about backends in the backend repo. Ask questions or report problems in the main Triton issues page.

Frequently Asked Questions

Full documentation is included below but these shortcuts can help you get started in the right direction.

Where can I ask general questions about Triton and Triton backends?

Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the main Triton issues page.

What versions of TensorFlow are supported by this backend?

Starting from 23.04, the TensorFlow backend only supports TensorFlow 2.x. You can find the specific version supported for any release by checking the Release Notes which are available from the main server repo.

Is the TensorFlow backend configurable?

Each model's configuration can enabled TensorFlow-specific optimizations. There are also a few command-line options that can be used to configure the backend when launching Triton.

How do I build the TensorFlow backend?

See build instructions below.

Can I use any version of TensorFlow when building the backend?

Currently you must use a version of TensorFlow from NGC. See custom TensorFlow build instructions below.

How does the TensorFlow backend manage GPU memory?

The TensorFlow backend does not "release" GPU memory until the Triton process exits. TensorFlow uses a pool allocator and so it retains any memory it allocates until its own process exits. It will reuse that memory if you load another TensorFlow model, but it will not return it to the system, even if it is no longer using it. For this reason, it is preferred to keep TensorFlow models grouped together on the same Triton process if you will be repeatedly loading/unloading them.

From the TensorFlow GPU docs: "Memory is not released since it can lead to memory fragmentation".

Workarounds

The following are a few available options to limit the total amount of memory that TensorFlow allocates:

You can use gpu-memory-fraction as described here. This restricts an upper-bound on the total memory TensorFlow can allocate for the process. However, note when using this option that allow-growth is set to false, hence running TF models might still fail if TF needs to allocate more memory for its executions than what's allowed.
To limit large growths in memory from concurrent TensorFlow executions, you can also use the rate limiter in Triton to limit the number of requests allowed to enter execution.

Auto-Complete Model Configuration

Assuming Triton was not started with --disable-auto-complete-config command line option, the TensorFlow backend makes use of the metadata available in TensorFlow SavedModel to populate the required fields in the model's config.pbtxt. You can learn more about Triton's support for auto-completing model configuration from here.

However, in Graphdef format, models do not carry sufficient metadata and hence Triton cannot generate model configuration for them. As a result, config.pbtxt must be provided for such models explicitly.

TensorFlow backend can complete the following fields in model configuration:

max_batch_size

Auto-completing max_batch_size follows the following rules:

Autocomplete has determined the model is capable of batching requests.
max_batch_size is 0 in the model configuration or max_batch_size is omitted from the model configuration.

If the above two rules are met, max_batch_size is set to default-max-batch-size. Otherwise max_batch_size is set as 0.

Inputs and Outputs

The TensorFlow backend is able to fill in the name, data_type, and dims provided this information is available in the model. Known limitations are inputs which are defined in the ragged_batching and sequence_batching fields. There is not enough information in the model for the backend to be able to autocomplete these. Additionally, the backend cannot auto complete configuration for scalar tensors.

Autocompleting outputs follows the following rules:

If outputs is empty or undefined in the model configuration, all outputs in the savedmodel will be autocompleted
If one or more output is defined in outputs, those outputs which are defined will be autocompleted and those which are omitted will be ignored.

Dynamic Batching

If max_batch_size > 1 and no scheduler is provided, the dynamic batch scheduler will be enabled with default settings.

Command-line Options

The command-line options configure properties of the TensorFlow backend that are then applied to all models that use the backend.

--backend-config=tensorflow,allow-soft-placement=<boolean>

Instruct TensorFlow to use CPU implementation of an operation when a GPU implementation is not available.

--backend-config=tensorflow,gpu-memory-fraction=<float>

Reserve a portion of GPU memory for TensorFlow models. Default value 0.0 indicates that TensorFlow should dynamically allocate memory as needed. Value of 1.0 indicates that TensorFlow should allocate all of GPU memory.

--backend-config=tensorflow,version=<int>

Select the version of the TensorFlow library to be used. Default version is 2. Note that starting from 23.04 release, the TensorFlow backend only supports TensorFlow 2. If you'd like to use TensorFlow 1 with Triton prior to 23.04, you can specify the version to 1 using this command-line option.

--backend-config=tensorflow,default-max-batch-size=<int>

The default value to use for max_batch_size during auto-completing model configuration when batching support is detected in the model. Note that if not explicitly provided, the default value for this option is 4.

Build the TensorFlow Backend

Use a recent cmake to build. First install the required dependencies.

$ apt-get install patchelf rapidjson-dev

The backend can be built to support TensorFlow 2.x. Starting from 23.04, Triton no longer supports TensorFlow 1.x and exclusively uses TensorFlow 2.x. An appropriate TensorFlow container from NGC must be used. For example, to build a backend that uses the 23.04 version of the TensorFlow 2.x container from NGC:

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_TENSORFLOW_DOCKER_IMAGE="nvcr.io/nvidia/tensorflow:23.04-tf2-py3" ..
$ make install

The following required Triton repositories will be pulled and used in the build. By default the "main" branch/tag will be used for each repo but the listed CMake argument can be used to override.

triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

Build the TensorFlow Backend With Custom TensorFlow

Currently, Triton requires that a specially patched version of TensorFlow be used with the TensorFlow backend. The full source for these TensorFlow versions are available as Docker images from NGC. For example, the TensorFlow 2.x version compatible with the 23.04 release of Triton is available as nvcr.io/nvidia/tensorflow:23.04-tf2-py3.

You can modify and rebuild TensorFlow within these images to generate the shared libraries needed by the Triton TensorFlow backend. In the TensorFlow 2.x container you rebuild using:

$ /opt/tensorflow/nvbuild.sh

After rebuilding within the container you should save the updated container as a new Docker image (for example, by using docker commit), and then build the backend as described above with TRITON_TENSORFLOW_DOCKER_IMAGE set to refer to the new Docker image.

Using the TensorFlow Backend

Platform

TensorFlow recognizes two kinds of model format, GraphDef and SavedModel. To differentiate the model format, specify the appropriate platform in model's config.pbtxt file, such as:

# config.pbtxt for Graphdef format
...
platform: "tensorflow_graphdef"
...

# config.pbtxt for SavedModel format
...
platform: "tensorflow_savedmodel"
...

Parameters

Configuration of TensorFlow for a model is done through the Parameters section of the model's config.pbtxt file. The parameters and their description are as follows.

TF_NUM_INTRA_THREADS: Number of threads to use to parallelize the execution of an individual op. Auto-configured by default. See protobuf here. Should be a non-negative number.
TF_NUM_INTER_THREADS: Controls the number of operators that can be executed simultaneously. Auto-configured by default. See protobuf here.
TF_USE_PER_SESSION_THREADS: Boolean value to see if per session thread is used. "True", "On" and "1" are accepted as true.
TF_GRAPH_TAG: Tag of the graphs to use. See protobuf here
TF_SIGNATURE_DEF: Signature def to use. See protobuf here
MAX_SESSION_SHARE_COUNT: This parameter specifies the maximum number of model instances that can share a TF session. The default value is 1 which means Triton will create a separate TF session for each model instance. If this parameter is set to the total number of instances, then Triton will create only a single TF session which will be shared by all the instances. Sharing TF sessions among model instances can reduce memory footprint of loading and executing the model.
TF_INIT_OPS_FILE: This parameter specifies the name of the file in JSON format that contains the initialization operations. The JSON file must have a single element named 'init_ops' which describes the list of initialization operations. This file can be stored in the model version folder or in the model directory. If it is provided in both locations, the model version folder takes precedence over the one provided in the model folder. If it is provided in the model version folder, the directory structure should look like below:

|-- 1
|   |-- model.graphdef
|   `-- init_ops.json
`-- config.pbtxt

Below is an example of the contents of the init_ops.json file.

{
    "init_ops": ["init"]
}

The section of model config file specifying these parameters will look like:

parameters: {
  key: "TF_NUM_INTRA_THREADS"
  value: {
    string_value:"2"
  }
}

parameters: {
  key: "TF_USE_PER_SESSION_THREADS"
  value: {
    string_value:"yes"
  }
}

parameters: {
  key: "TF_GRAPH_TAG"
  value: {
    string_value: "serve1"
  }
}

parameters: {
  key: "TF_INIT_OPS_FILE"
  value: {
    string_value: "init_ops.json"
  }
}

parameters: {
  key: "TF_SIGNATURE_DEF"
  value: {
    string_value: "serving2"
  }
}

Important Notes

We have observed memory growth issues with the SavedModel format during model loading and unloading. It is possible that this is not an actual memory leak but rather a result of the system's malloc heuristics, causing the memory to not be immediately released back to the operating system. We have noticed improved memory footprint by replacing the default malloc library with either tcmalloc or jemalloc. Please refer to the documentation for instructions on how to use tcmalloc or jemalloc with Triton.