Deploy models with TorchServe
TorchServe is the recommended model server for PyTorch, preinstalled in the AWS PyTorch Deep Learning Container (DLC). This powerful tool offers customers a consistent and user-friendly experience, delivering high performance in deploying multiple PyTorch models across various AWS instances, including CPU, GPU, Neuron, and Graviton, regardless of the model size or distribution.
TorchServe supports a wide array of advanced features, including dynamic batching,
microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX and IPEX.
Moreover, it seamlessly integrates PyTorch's large model solution, PiPPy, enabling
efficient handling of large models. Additionally, TorchServe extends its support to
popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more,
expanding its capabilities even further. With TorchServe, AWS users can confidently
deploy and serve their PyTorch models, taking advantage of its versatility and optimized
performance across various hardware configurations and model types. For more detailed
information, you can refer to the PyTorch
documentation
The following table lists the AWS PyTorch DLCs supported by TorchServe.
Instance type | SageMaker PyTorch DLC link |
---|---|
CPU and GPU |
|
Neuron |
|
Graviton |
The following sections describe the setup to build and test PyTorch DLCs on Amazon SageMaker.
Getting started
To get started, ensure that you have the following prerequisites:
-
Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
-
Locally configure your dependencies, as shown in the following example:
from datetime import datetime import os import json import logging import time # External Dependencies: import boto3 from botocore.exceptions import ClientError import sagemaker sess = boto3.Session() sm = sess.client("sagemaker") region = sess.region_name account = boto3.client("sts").get_caller_identity().get("Account") smsess = sagemaker.Session(boto_session=sess) role = sagemaker.get_execution_role() # Configuration: bucket_name = smsess.default_bucket() prefix = "torchserve" output_path = f"s3://{bucket_name}/{prefix}/models" print(f"account={account}, region={region}, role={role}")
-
Retrieve the PyTorch DLC image, as shown in the following example.
SageMaker PyTorch DLC images are available in all AWS regions. For more information, see the list of DLC container images
. baseimage = sagemaker.image_uris.retrieve( framework="pytorch", region="
<region>
", py_version="py310", image_scope="inference", version="2.0.1", instance_type="ml.g4dn.16xlarge", ) -
Create a local workspace.
mkdir -p workspace/
Adding a package
The following sections describe how to add and preinstall packages to your PyTorch DLC image.
BYOC use cases
The following steps outline how to add a package to your PyTorch DLC image. For
more information about customizing your container, see Building AWS Deep Learning Containers Custom Images
-
Suppose you want to add a package to the PyTorch DLC docker image. Create a Dockerfile under the
docker
directory, as shown in the following example:mkdir -p workspace/docker cat workspace/docker/Dockerfile ARG BASE_IMAGE FROM $BASE_IMAGE #Install any additional libraries RUN pip install transformers==4.28.1
-
Build and publish the customized docker image by using the following build_and_push.sh
script. # Download script build_and_push.sh to workspace/docker ls workspace/docker build_and_push.sh Dockerfile # Build and publish your docker image reponame = "torchserve" versiontag = "demo-0.1" ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}
SageMaker preinstall use cases
The following example shows you how to preinstall a package to your PyTorch DLC
container. You must create a requirements.txt
file locally under the
directory workspace/code
.
mkdir -p workspace/code cat workspace/code/requirements.txt transformers==4.28.1
Create TorchServe model artifacts
In the following example, we use the pre-trained MNIST modelworkspace/mnist
,
implement mnist_handler.pytorch-model-archiver
to build the model artifacts and upload to
Amazon S3.
-
Configure the model parameters in
model-config.yaml
.ls -al workspace/mnist-dev mnist.py mnist_handler.py mnist_cnn.pt model-config.yaml # config the model cat workspace/mnist-dev/model-config.yaml minWorkers: 1 maxWorkers: 1 batchSize: 4 maxBatchDelay: 200 responseTimeout: 300
-
Build the model artifacts by using torch-model-archiver
. torch-model-archiver --model-name mnist --version 1.0 --model-file workspace/mnist-dev/mnist.py --serialized-file workspace/mnist-dev/mnist_cnn.pt --handler workspace/mnist-dev/mnist_handler.py --config-file workspace/mnist-dev/model-config.yaml --archive-format tgz
If you want to preinstall a package, you must include the
code
directory in thetar.gz
file.cd workspace torch-model-archiver --model-name mnist --version 1.0 --model-file mnist-dev/mnist.py --serialized-file mnist-dev/mnist_cnn.pt --handler mnist-dev/mnist_handler.py --config-file mnist-dev/model-config.yaml --archive-format no-archive cd mnist mv ../code . tar cvzf mnist.tar.gz .
-
Upload
mnist.tar.gz
to Amazon S3.# upload mnist.tar.gz to S3 output_path = f"s3://{bucket_name}/{prefix}/models" aws s3 cp mnist.tar.gz {output_path}/mnist.tar.gz
Using single model endpoints to deploy with TorchServe
The following example shows you how to create a single model
real-time inference endpoint, deploy the model to the endpoint, and test
the endpoint by using the Amazon SageMaker Python SDK
from sagemaker.model import Model from sagemaker.predictor import Predictor # create the single model endpoint and deploy it on SageMaker model = Model(model_data = f'{output_path}/mnist.tar.gz', image_uri = baseimage, role = role, predictor_cls = Predictor, name = "mnist", sagemaker_session = smsess) endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()) predictor = model.deploy(instance_type='ml.g4dn.xlarge', initial_instance_count=1, endpoint_name = endpoint_name, serializer=JSONSerializer(), deserializer=JSONDeserializer()) # test the endpoint import random import numpy as np dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()} res = predictor.predict(dummy_data)
Using multi-model endpoints to deploy with TorchServe
Multi-model endpoints are a scalable and cost-effective solution to hosting large numbers of models behind one endpoint. They improve endpoint utilization by sharing the same fleet of resources and serving container to host all of your models. They also reduce deployment overhead because SageMaker manages dynamically loading and unloading models, as well as scaling resources based on traffic patterns. Multi-model endpoints are particularly useful for deep learning and generative AI models that require accelerated compute power.
By using TorchServe on SageMaker multi-model endpoints, you can speed up your development by using a serving stack that you are familiar with while leveraging the resource sharing and simplified model management that SageMaker multi-model endpoints provide.
The following example shows you how to create a multi-model endpoint, deploy the
model to the endpoint, and test the endpoint by using the Amazon SageMaker Python SDK
from sagemaker.multidatamodel import MultiDataModel from sagemaker.model import Model from sagemaker.predictor import Predictor # create the single model endpoint and deploy it on SageMaker model = Model(model_data = f'{output_path}/mnist.tar.gz', image_uri = baseimage, role = role, sagemaker_session = smsess) endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()) mme = MultiDataModel( name = endpoint_name, model_data_prefix = output_path, model = model, sagemaker_session = smsess) mme.deploy( initial_instance_count = 1, instance_type = "ml.g4dn.xlarge", serializer=sagemaker.serializers.JSONSerializer(), deserializer=sagemaker.deserializers.JSONDeserializer()) # list models list(mme.list_models()) # create mnist v2 model artifacts cp mnist.tar.gz mnistv2.tar.gz # add mnistv2 mme.add_model(mnistv2.tar.gz) # list models list(mme.list_models()) predictor = Predictor(endpoint_name=mme.endpoint_name, sagemaker_session=smsess) # test the endpoint import random import numpy as np dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()} res = predictor.predict(date=dummy_data, target_model="mnist.tar.gz")
Metrics
TorchServe supports both system level and model level metrics. You can enable
metrics in either log format mode or Prometheus mode through the environment
variable TS_METRICS_MODE
. You can use the TorchServe central metrics
config file metrics.yaml
to specify the types of metrics to be tracked,
such as request counts, latency, memory usage, GPU utilization, and more. By
referring to this file, you can gain insights into the performance and health of the
deployed models and effectively monitor the TorchServe server's behavior in
real-time. For more detailed information, see the TorchServe metrics documentation
You can access TorchServe metrics logs that are similar to the StatsD format through the Amazon CloudWatch log filter. The following is an example of a TorchServe metrics log:
CPUUtilization.Percent:0.0|#Level:Host|#hostname:my_machine_name,timestamp:1682098185 DiskAvailable.Gigabytes:318.0416717529297|#Level:Host|#hostname:my_machine_name,timestamp:1682098185