Invoke models for real-time inference
After you use Amazon SageMaker AI to deploy a model to an endpoint, you can interact with the model by sending inference requests to it. To send an inference request to a model, you invoke the endpoint that hosts it. You can invoke your endpoints using Amazon SageMaker Studio, the AWS SDKs, or the AWS CLI.
Invoke Your Model Using Amazon SageMaker Studio
After you deploy your model to an endpoint, you can view the endpoint through Amazon SageMaker Studio and test your endpoint by sending single inference requests.
Note
SageMaker AI only supports endpoint testing in Studio for real-time endpoints.
To send a test inference request to your endpoint
-
Launch Amazon SageMaker Studio.
-
In the navigation pane on the left, choose Deployments.
-
From the dropdown, choose Endpoints.
-
Find for your endpoint by name, and choose the name in the table. The endpoint names listed in the Endpoints panel are defined when you deploy a model. The Studio workspace opens the Endpoint page in a new tab.
-
Choose the Test inference tab.
-
For Testing Options, select one of the following:
-
Select Test the sample request to immediately send a request to your endpoint. Use the JSON editor to provide sample data in JSON format, and choose Send Request to submit the request to your endpoint. After submitting your request, Studio shows the inference output in a card to the right of the JSON editor.
-
Select Use Python SDK example code to view the code for sending a request to the endpoint. Then, copy the code example from the Example inference request section and run the code from your testing environment.
-
The top of the card shows the type of request that was sent to the endpoint (only JSON is accepted). The card shows the following fields:
Status – displays one of the following status types:
Success
– The request succeeded.Failed
– The request failed. A response appears under Failure Reason.Pending
– While the inference request is pending, the status shows a spinning, circular icon.
Execution Length – How long the invocation took (end time minus the start time) in milliseconds.
Request Time – How many minutes have passed since the request was sent.
Result Time – How many minutes have passed since the result was returned.
Invoke Your Model by Using the AWS SDK for Python (Boto3)
If you want to invoke a model endpoint in your application code, you can use one of the AWS SDKs, including the AWS SDK for Python (Boto3). To invoke your endpoint with this SDK, you use one of the following Python methods:
-
invoke_endpoint
– Sends an inference request to a model endpoint and returns the response that the model generates.This method returns the inference payload as one response after the model finishes generating it. For more information, see invoke_endpoint
in the AWS SDK for Python (Boto3) API Reference. -
invoke_endpoint_with_response_stream
– Sends an inference request to a model endpoint and streams the response incrementally while the model generates it.With this method, your application receives parts of the response as soon as the parts become available. For more information, see invoke_endpoint
in the AWS SDK for Python (Boto3) API Reference. Use this method only to invoke models that support inference streaming.
Before you can use these methods in your application code, you must initialize a SageMaker AI Runtime client, and you must specify the name of your endpoint. The following example sets up the client and endpoint for the rest of the examples that follow:
import boto3 sagemaker_runtime = boto3.client( "sagemaker-runtime", region_name='
aws_region
') endpoint_name='endpoint-name
'
Invoke to Get an Inference Response
The following example uses the invoke_endpoint
method to invoke an
endpoint with the AWS SDK for Python (Boto3):
# Gets inference from the model hosted at the specified endpoint: response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, Body=bytes('{"features": ["This is great!"]}', 'utf-8') ) # Decodes and prints the response body: print(response['Body'].read().decode('utf-8'))
This example provide input data in the Body
field for SageMaker AI to pass to
the model. This data must be in the same format that was used for training. The
example assigns the response to the response
variable.
The response
variable provides access to the HTTP status, the name of
the deployed model, and other fields. The following snippet prints the HTTP status
code:
print(response["HTTPStatusCode"])
Invoke to Stream an Inference Response
If you deployed a model that supports inference streaming, you can invoke the model to receive its inference payload as a stream of parts. The model delivers these parts incrementally as the model generates them. When an application receives an inference stream, the application doesn't need to wait for the model to generate the whole response payload. Instead, the application immediately receives parts of the response as they become available.
By consuming an inference stream in your application, you can create interactions where your users perceive the inference to be fast because they get the first part immediately. You can implement streaming to support fast interactive experiences, such as chatbots, virtual assistants, and music generators. For example, you could create a chatbot that incrementally shows the text generated by a large language model (LLM).
To get an inference stream, you can use the
invoke_endpoint_with_response_stream
method. In the response body,
the SDK provides an EventStream
object, which gives the inference as a
series of PayloadPart
objects.
Example Inference Stream
The following example is a stream of PayloadPart
objects:
{'PayloadPart': {'Bytes': b'{"outputs": [" a"]}\n'}} {'PayloadPart': {'Bytes': b'{"outputs": [" challenging"]}\n'}} {'PayloadPart': {'Bytes': b'{"outputs": [" problem"]}\n'}} . . .
In each payload part, the Bytes
field provides a portion of the
inference response from the model. This portion can be any content type that a model
generates, such as text, image, or audio data. In this example, the portions are
JSON objects that contain generated text from an LLM.
Usually, the payload part contains a discrete chunk of data from the model. In this example, the discrete chunks are whole JSON objects. Occasionally, the streaming response splits the chunks over multiple payload parts, or it combines multiple chunks into one payload part. The following example shows a chunk of data in JSON format that's split over two payload parts:
{'PayloadPart': {'Bytes': b'{"outputs": '}} {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
When you write application code that processes an inference stream, include logic
that handles these occasional splits and combinations of data. As one strategy, you
could write code that concatenates the contents of Bytes
while your
application receives the payload parts. By concatenating the example JSON data here,
you would combine the data into a newline-delimited JSON body. Then, your code could
process the stream by parsing the whole JSON object on each line.
The following example shows the newline-delimited JSON that you would create when
you concatenate the example contents of Bytes
:
{"outputs": [" a"]} {"outputs": [" challenging"]} {"outputs": [" problem"]} . . .
Example Code to Process an Inference Stream
The following example Python class, SmrInferenceStream
, demonstrates how
you can process an inference stream that sends text data in JSON format:
import io import json # Example class that processes an inference stream: class SmrInferenceStream: def __init__(self, sagemaker_runtime, endpoint_name): self.sagemaker_runtime = sagemaker_runtime self.endpoint_name = endpoint_name # A buffered I/O stream to combine the payload parts: self.buff = io.BytesIO() self.read_pos = 0 def stream_inference(self, request_body): # Gets a streaming inference response # from the specified model endpoint: response = self.sagemaker_runtime\ .invoke_endpoint_with_response_stream( EndpointName=self.endpoint_name, Body=json.dumps(request_body), ContentType="application/json" ) # Gets the EventStream object returned by the SDK: event_stream = response['Body'] for event in event_stream: # Passes the contents of each payload part # to be concatenated: self._write(event['PayloadPart']['Bytes']) # Iterates over lines to parse whole JSON objects: for line in self._readlines(): resp = json.loads(line) part = resp.get("outputs")[0] # Returns parts incrementally: yield part # Writes to the buffer to concatenate the contents of the parts: def _write(self, content): self.buff.seek(0, io.SEEK_END) self.buff.write(content) # The JSON objects in buffer end with '\n'. # This method reads lines to yield a series of JSON objects: def _readlines(self): self.buff.seek(self.read_pos) for line in self.buff.readlines(): self.read_pos += len(line) yield line[:-1]
This example processes the inference stream by doing the following:
-
Initializes a SageMaker AI Runtime client and sets the name of a model endpoint. Before you can get an inference stream, the model that the endpoint hosts must support inference streaming.
-
In the example
stream_inference
method, receives a request body and passes it to theinvoke_endpoint_with_response_stream
method of the SDK. -
Iterates over each event in the
EventStream
object that the SDK returns. -
From each event, gets the contents of the
Bytes
object in thePayloadPart
object. -
In the example
_write
method, writes to a buffer to concatenate the contents of theBytes
objects. The combined contents form a newline-delimited JSON body. -
Uses the example
_readlines
method to get an iterable series of JSON objects. -
In each JSON object, gets a piece of the inference.
-
With the
yield
expression, returns the pieces incrementally.
The following example creates and uses a SmrInferenceStream
object:
request_body = {"inputs": ["Large model inference is"], "parameters": {"max_new_tokens": 100, "enable_sampling": "true"}} smr_inference_stream = SmrInferenceStream( sagemaker_runtime, endpoint_name) stream = smr_inference_stream.stream_inference(request_body) for part in stream: print(part, end='')
This example passes a request body to the stream_inference
method. It
iterates over the response to print each piece that the inference stream returns.
The example assumes that the model at the specified endpoint is an LLM that generates text. The output from this example is a body of generated text that prints incrementally:
a challenging problem in machine learning. The goal is to . . .
Invoke Your Model by Using the AWS CLI
You can invoke your model endpoint by running commands with the AWS Command Line Interface (AWS CLI). The
AWS CLI supports standard inference requests with the invoke-endpoint
command, and it supports asynchronous inference requests with the
invoke-endpoint-async
command.
Note
The AWS CLI doesn't support streaming inference requests.
The following example uses the invoke-endpoint
command to send an
inference request to a model endpoint:
aws sagemaker-runtime invoke-endpoint \ --endpoint-name
endpoint_name
\ --bodyfileb://$file_name
\output_file.txt
For the --endpoint-name
parameter, provide the endpoint name that you
specified when you created the endpoint. For the --body
parameter, provide
input data for SageMaker AI to pass to the model. The data must be in the same format that was
used for training. This example shows how to send binary data to your endpoint.
For more information on when to use file://
over
fileb://
when passing the contents of a file to a parameter of
the AWS CLI, see Best Practices for Local File Parameters
For more information, and to see additional parameters that you can pass, see invoke-endpoint
in the AWS CLI Command Reference.
If the invoke-endpoint
command succeeds it returns a response such as the
following:
{ "ContentType": "<content_type>; charset=utf-8", "InvokedProductionVariant": "<Variant>" }
If the command doesn't succeed, check whether the input payload is in the correct format.
View the output of the invocation by checking the file output file (output_file.txt
in this example).
more output_file.txt