Model deployment with Triton Inference Server - Amazon SageMaker AI

Model deployment with Triton Inference Server

Triton Inference Server is an open source inference serving software that streamlines AI inference. With Triton, you can deploy any model built with multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more.

The SageMaker AI Triton containers help you deploy Triton Inference Server on the SageMaker AI Hosting platform to serve trained models in production. It supports the different modes in which SageMaker AI operates. For a list of available Triton Inference Server containers available on SageMaker AI, see NVIDIA Triton Inference Containers (SM support only).

For end-to-end notebook examples, we recommend taking a look at the amazon-sagemaker-examples repository.

Hosting modes

The following SageMaker AI Hosting modes are supported by Triton containers:

  • Single model endpoints

    • This is SageMaker AI’s default mode of operation. In this mode, the Triton container can load a single model, or a single ensemble model.

    • The name of the model must be passed as as a property of the container environment, which is part of the CreateModel SageMaker AI API call. The environment variable used to pass in the model name is SAGEMAKER_TRITON_DEFAULT_MODEL_NAME.

  • Single model endpoints with ensemble

    • Triton Inference Server supports ensemble, which is a pipeline, or a DAG (directed acyclic graph) of models. While an ensemble technically comprises of multiple models, in the default single model endpoint mode, SageMaker AI can treat the ensemble proper (the meta-model that represents the pipeline) as the main model to load, and can subsequently load the associated models.

    • The ensemble proper’s model name must be used to load the model. It must be passed as a property of the container environment, which is part of the CreateModel SageMaker API call. The environment variable used to pass in the model name is SAGEMAKER_TRITON_DEFAULT_MODEL_NAME.

  • Multi-model endpoints

    • In this mode, SageMaker AI can serve multiple models on a single endpoint. You can use this mode by specifying the environment variable ‘MultiModel’: true as a property of the container environment, which is part of the CreateModel SageMaker API call.

    • By default, no model is loaded when the instance starts. To run an inference request against a particular model, specify the corresponding model's *.tar.gz file as an argument to the TargetModel property of the InvokeEndpoint SageMaker API call.

  • Multi-model endpoints with ensemble

    • In this mode, SageMaker AI functions as described for multi-model endpoints. However, the SageMaker AI Triton container can load multiple ensemble models, meaning that multiple model pipelines can run on the same instance. SageMaker AI treats every ensemble as one model, and the ensemble proper of each model can be invoked by specifying the corresponding *.tar.gz archive as the TargetModel.

    • For better memory management during dynamic memory LOAD and UNLOAD, we recommend that you keep the ensemble size small.

Inference payload types

Triton supports two methods of sending an inference payload over the network – json and binary+json (or binary encoded json). The JSON payload in both cases includes the datatype, shape and the actual inference request tensor. The request tensor must be a binary tensor.

With the binary+json format, you must specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. In the SageMaker AI Triton container, this is done using a custom Content-Type header: application/vnd.sagemaker-triton.binary+json;json-header-size={}. This is different from using the Inference-Header-Content-Length header on a stand-alone Triton Inference Server because custom headers are not allowed in SageMaker AI.

Using config.pbtxt to set the model config

For Triton Inference Servers on SageMaker AI, each model must include a config.pbtxt file that specifies, at a minimum, the following configurations for the model:

  • name: While this is optional for models running outside of SageMaker AI, we recommend that you always provide a name for the models to be run in Triton on SageMaker AI.

  • platform and/or backend: Setting a backend is essential to specify the type of the model. Some backends have further classification, such as tensorflow_savedmodel or tensorflow_graphdef. Such options can be specified as part of the platform key in addition to the backend key. The most common backends are tensorrt, onnxruntime, tensorflow, pytorch, python, dali, fil, and openvino.

  • input: Specify three attributes for the input: name, data_type and dims (the shape).

  • output: Specify three attributes for the output: name, data_type and dims (the shape).

  • max_batch_size: Set the batch size to a value greater than or equal to 1 that indicates the maximum batch size that Triton should use with the model.

For more details on configuring config.pbtxt, see Triton’s GitHub repository. Triton provides several configurations for tweaking model behavior. Some of the most common and important configuration options are:

  • instance_groups: Instance groups help with specifying the number and location for a given model. They have the attributes count, kind, and gpus (used when kind is KIND_GPU). The count attribute is equivalent to the number of workers. For regular model serving, each worker has its own copy of the model. Similarly, in Triton, the count specifies the number of model copies per device. For example, if the instance_group type is KIND_CPU, then the CPU has count number of model copies.

    Note

    On a GPU instance, the instance_group configuration applies per GPU device. For example, count number of model copies are placed on each GPU device unless you explicitly specify which GPU devices should load the model.

  • dynamic_batching and sequence_batching: Dynamic batching is used for stateless models, and sequence batching is used for stateful models (where you want to route a request to the same model instance every time). Batching schedulers enable a per-model queue, which help in increasing throughput, depending on the batching configuration.

  • ensemble: An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. It can be configured by specifying platform as ensemble. The ensemble configuration is just a representation of the model pipeline. On SageMaker AI, all the models under an ensemble are treated as dependents of the ensemble model and are counted as a single model for SageMaker AI metrics, such as LoadedModelCount.

Publishing default Triton metrics to Amazon CloudWatch

The NVIDIA Triton Inference Container exposes metrics at port 8002 (configurable) for the different models and GPUs that are utilized in the Triton Inference Server. For full details of the default metrics that are available, see the GitHub page for the Triton Inference Server metrics. These metrics are in Prometheus format and can be scraped using a Prometheus scraper configuration.

Starting with version v23.07 onwards, the SageMaker AI Triton container supports publishing these metrics to Amazon CloudWatch by specifying a few environment variables. In order to scrape the Prometheus metrics, the SageMaker AI Triton container leverages the Amazon CloudWatch agent.

The required environment variables that you must specify to collect metrics are as follows:

Environment variable Description Example value

SAGEMAKER_TRITON_ALLOW_METRICS

Specify this option to allow Triton to publish metrics to its Prometheus endpoint.

"true"

SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH

Specify this option to start the pre-checks necessary to publish metrics to Amazon CloudWatch.

"true"

SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP

Specify this option to point to the log group to which metrics are written.

"/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesTest"

SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE

Specify this option to point to the metric namespace where you want to see and plot the metrics.

"/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesPublicTest"

SAGEMAKER_TRITON_METRICS_PORT

Specify this as 8002, or any other port. If SageMaker AI has not blocked the specified port, it is used. Otherwise, another non-blocked port is chosen automatically.

"8002"

When publishing metrics with Triton on SageMaker AI, keep in mind the following limitations:

  • While you can generate custom metrics through the C-API and Python backend (v23.05 onwards), these are currently not supported for publishing to Amazon CloudWatch.

  • In SageMaker AI multi-model endpoints (MME) mode, Triton runs in an environment that requires model namespacing to be enabled because each model (except ensemble models) are treated as if they are in their own model repository. Currently, this creates a limitation for metrics. When model namespacing is enabled, Triton does not distinguish the metrics between two models with the same name belonging to different ensembles. As a workaround, make sure that every model being deployed has a unique name. This also makes it easier to look up your metrics in CloudWatch.

Environment variables

The following table lists the supported environment variables for Triton on SageMaker AI.

Environment variable Description Type Possible values

SAGEMAKER_MULTI_MODEL

Allows Triton to operate in SageMaker AI multi-model endpoints mode.

Boolean

true, false

SAGEMAKER_TRITON_DEFAULT_MODEL_NAME

Specify the model to be loaded in the SageMaker AI single model (default) mode. For ensemble mode, specify the name of the ensemble proper.

String

<model_name> as specified in config.pbtxt

SAGEMAKER_TRITON_PING_MODE

'ready' is the default mode in SageMaker AI's single model mode, and 'live' is the default in SageMaker AI's multi-model endpoints mode.

String

ready, live

SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING

In the SageMaker AI Triton container, this is set to true by default.

Boolean

true, false

SAGEMAKER_BIND_TO_PORT

While on SageMaker AI, the default port is 8080. You can customize to a different port in multi-container scenarios.

String

<port_number>

SAGEMAKER_SAFE_PORT_RANGE

This is set by the SageMaker AI platform when using multi-container mode.

String

<port_1><port_2>

SAGEMAKER_TRITON_ALLOW_GRPC

While SageMaker AI doesn't support GRPC currently, if you're using Triton in front of a custom reverse proxy, you may choose to enable GRPC.

Boolean

true, false

SAGEMAKER_TRITON_GRPC_PORT

The default port for GRPC is 8001, but you can change it.

String

<port_number>

SAGEMAKER_TRITON_THREAD_COUNT

You can set the number of default HTTP request handler threads.

String

<number>

SAGEMAKER_TRITON_LOG_VERBOSE

true by default on SageMaker AI, but you can selectively turn this option off.

Boolean

true, false

SAGEMAKER_TRITON_LOG_INFO

false by default on SageMaker AI.

Boolean

true, false

SAGEMAKER_TRITON_LOG_WARNING

false by default on SageMaker AI.

Boolean

true, false

SAGEMAKER_TRITON_LOG_ERROR

false by default on SageMaker AI.

Boolean

true, false

SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE

Specify the shm size for the Python backend, in bytes. The default value is 16 MB but can be increased.

String

<number>

SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE

Specify the shm growth size for the Python backend, in bytes. The default value is 1 MB but can be increased to allow greater increments.

String

<number>

SAGEMAKER_TRITON_TENSORFLOW_VERSION

The default value is 2. Triton no longer supports Tensorflow 2 from Triton v23.04. You can configure this variable for previous versions.

String

<number>

SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT

Restrict the maximum GPU memory percentage which is used for model loading, allowing the remainder to be used for the inference requests.

String

<number>

SAGEMAKER_TRITON_ALLOW_METRICS

false by default on SageMaker AI.

Boolean

true, false

SAGEMAKER_TRITON_METRICS_PORT

The default port is 8002.

String

<number>

SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH

false by default on SageMaker AI. Set this variable to true to allow pushing Triton default metrics to Amazon CloudWatch. If this option is enabled, you are responsible for CloudWatch costs when metrics are published to your account.

Boolean

true, false

SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP

Required if you've enabled metrics publishing to CloudWatch.

String

<cloudwatch_log_group_name>

SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE

Required if you've enabled metrics publishing to CloudWatch.

String

<cloudwatch_metric_namespace>

SAGEMAKER_TRITON_ADDITIONAL_ARGS

Appends any additional arguments when starting the Triton Server.

String

<additional_args>