Getting started Adding a package Create TorchServe model artifacts Using single model endpoints to deploy with TorchServe Using multi-model endpoints to deploy with TorchServe Metrics

Deploy models with TorchServe

TorchServe is the recommended model server for PyTorch, preinstalled in the AWS PyTorch Deep Learning Container (DLC). This powerful tool offers customers a consistent and user-friendly experience, delivering high performance in deploying multiple PyTorch models across various AWS instances, including CPU, GPU, Neuron, and Graviton, regardless of the model size or distribution.

TorchServe supports a wide array of advanced features, including dynamic batching, microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX and IPEX. Moreover, it seamlessly integrates PyTorch's large model solution, PiPPy, enabling efficient handling of large models. Additionally, TorchServe extends its support to popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more, expanding its capabilities even further. With TorchServe, AWS users can confidently deploy and serve their PyTorch models, taking advantage of its versatility and optimized performance across various hardware configurations and model types. For more detailed information, you can refer to the PyTorch documentation and TorchServe on GitHub.

The following table lists the AWS PyTorch DLCs supported by TorchServe.

Instance type	SageMaker AI PyTorch DLC link
CPU and GPU	SageMaker AI PyTorch containers
Neuron	PyTorch Neuron containers
Graviton	SageMaker AI PyTorch Graviton containers

The following sections describe the setup to build and test PyTorch DLCs on Amazon SageMaker AI.

Getting started

To get started, ensure that you have the following prerequisites:

Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:

Locally configure your dependencies, as shown in the following example:


from datetime import datetime
    import os
    import json
    import logging
    import time
    
    # External Dependencies:
    import boto3
    from botocore.exceptions import ClientError
    import sagemaker
    
    sess = boto3.Session()
    sm = sess.client("sagemaker")
    region = sess.region_name
    account = boto3.client("sts").get_caller_identity().get("Account")
    
    smsess = sagemaker.Session(boto_session=sess)
    role = sagemaker.get_execution_role()
    
    # Configuration:
    bucket_name = smsess.default_bucket()
    prefix = "torchserve"
    output_path = f"s3://{bucket_name}/{prefix}/models"
    print(f"account={account}, region={region}, role={role}")

Retrieve the PyTorch DLC image, as shown in the following example.

SageMaker AI PyTorch DLC images are available in all AWS regions. For more information, see the list of DLC container images.


baseimage = sagemaker.image_uris.retrieve(
        framework="pytorch",
        region="<region>",
        py_version="py310",
        image_scope="inference",
        version="2.0.1",
        instance_type="ml.g4dn.16xlarge",
    )

Create a local workspace.
```
mkdir -p workspace/
```

Adding a package

The following sections describe how to add and preinstall packages to your PyTorch DLC image.

BYOC use cases

The following steps outline how to add a package to your PyTorch DLC image. For more information about customizing your container, see Building AWS Deep Learning Containers Custom Images.

Suppose you want to add a package to the PyTorch DLC docker image. Create a Dockerfile under the docker directory, as shown in the following example:


mkdir -p workspace/docker
    cat workspace/docker/Dockerfile
    
    ARG BASE_IMAGE
    
    FROM $BASE_IMAGE
    
    #Install any additional libraries
    RUN pip install transformers==4.28.1

Build and publish the customized docker image by using the following build_and_push.sh script.


# Download script build_and_push.sh to workspace/docker
    ls workspace/docker
    build_and_push.sh  Dockerfile
    
    # Build and publish your docker image
    reponame = "torchserve"
    versiontag = "demo-0.1"
    
    ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}

SageMaker AI preinstall use cases

The following example shows you how to preinstall a package to your PyTorch DLC container. You must create a requirements.txt file locally under the directory workspace/code.


mkdir -p workspace/code
    cat workspace/code/requirements.txt
    
    transformers==4.28.1

Create TorchServe model artifacts

In the following example, we use the pre-trained MNIST model. We create a directory workspace/mnist, implement mnist_handler.py by following the TorchServe custom service instructions, and configure the model parameters (such as batch size and workers) in model-config.yaml. Then, we use the TorchServe tool torch-model-archiver to build the model artifacts and upload to Amazon S3.

Configure the model parameters in model-config.yaml.


ls -al workspace/mnist-dev
    
    mnist.py
    mnist_handler.py
    mnist_cnn.pt
    model-config.yaml
    
    # config the model
    cat workspace/mnist-dev/model-config.yaml
    minWorkers: 1
    maxWorkers: 1
    batchSize: 4
    maxBatchDelay: 200
    responseTimeout: 300

Build the model artifacts by using torch-model-archiver .


torch-model-archiver --model-name mnist --version 1.0 --model-file workspace/mnist-dev/mnist.py --serialized-file workspace/mnist-dev/mnist_cnn.pt --handler workspace/mnist-dev/mnist_handler.py --config-file workspace/mnist-dev/model-config.yaml --archive-format tgz

If you want to preinstall a package, you must include the code directory in the tar.gz file.


cd workspace
    torch-model-archiver --model-name mnist --version 1.0 --model-file mnist-dev/mnist.py --serialized-file mnist-dev/mnist_cnn.pt --handler mnist-dev/mnist_handler.py --config-file mnist-dev/model-config.yaml --archive-format no-archive
    
    cd mnist
    mv ../code .
    tar cvzf mnist.tar.gz .

Upload mnist.tar.gz to Amazon S3.


# upload mnist.tar.gz to S3
    output_path = f"s3://{bucket_name}/{prefix}/models"
    aws s3 cp mnist.tar.gz {output_path}/mnist.tar.gz

Using single model endpoints to deploy with TorchServe

The following example shows you how to create a single model real-time inference endpoint, deploy the model to the endpoint, and test the endpoint by using the Amazon SageMaker Python SDK.


from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  predictor_cls = Predictor,
                  name = "mnist",
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    predictor = model.deploy(instance_type='ml.g4dn.xlarge',
                             initial_instance_count=1,
                             endpoint_name = endpoint_name,
                             serializer=JSONSerializer(),
                             deserializer=JSONDeserializer())  
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(dummy_data)

Using multi-model endpoints to deploy with TorchServe

Multi-model endpoints are a scalable and cost-effective solution to hosting large numbers of models behind one endpoint. They improve endpoint utilization by sharing the same fleet of resources and serving container to host all of your models. They also reduce deployment overhead because SageMaker AI manages dynamically loading and unloading models, as well as scaling resources based on traffic patterns. Multi-model endpoints are particularly useful for deep learning and generative AI models that require accelerated compute power.

By using TorchServe on SageMaker AI multi-model endpoints, you can speed up your development by using a serving stack that you are familiar with while leveraging the resource sharing and simplified model management that SageMaker AI multi-model endpoints provide.

The following example shows you how to create a multi-model endpoint, deploy the model to the endpoint, and test the endpoint by using the Amazon SageMaker Python SDK. Additional details can be found in this notebook example.


from sagemaker.multidatamodel import MultiDataModel
    from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    mme = MultiDataModel(
        name = endpoint_name,
        model_data_prefix = output_path,
        model = model,
        sagemaker_session = smsess)
    
    mme.deploy(
        initial_instance_count = 1,
        instance_type = "ml.g4dn.xlarge",
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.JSONDeserializer())
    
    # list models
    list(mme.list_models())
    
    # create mnist v2 model artifacts
    cp mnist.tar.gz mnistv2.tar.gz
    
    # add mnistv2
    mme.add_model(mnistv2.tar.gz)
    
    # list models
    list(mme.list_models())
    
    predictor = Predictor(endpoint_name=mme.endpoint_name, sagemaker_session=smsess)
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(date=dummy_data, target_model="mnist.tar.gz")

Metrics

TorchServe supports both system level and model level metrics. You can enable metrics in either log format mode or Prometheus mode through the environment variable TS_METRICS_MODE. You can use the TorchServe central metrics config file metrics.yaml to specify the types of metrics to be tracked, such as request counts, latency, memory usage, GPU utilization, and more. By referring to this file, you can gain insights into the performance and health of the deployed models and effectively monitor the TorchServe server's behavior in real-time. For more detailed information, see the TorchServe metrics documentation.

You can access TorchServe metrics logs that are similar to the StatsD format through the Amazon CloudWatch log filter. The following is an example of a TorchServe metrics log:


CPUUtilization.Percent:0.0|#Level:Host|#hostname:my_machine_name,timestamp:1682098185
    DiskAvailable.Gigabytes:318.0416717529297|#Level:Host|#hostname:my_machine_name,timestamp:1682098185

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Model servers

Deploy models with DJL Serving