Model inference optimization with Amazon SageMaker - Amazon SageMaker

Model inference optimization with Amazon SageMaker

With Amazon SageMaker, you can improve the performance of your generative AI models by applying inference optimization techniques. By optimizing your models, you can attain better cost-performance for your use case. When you optimize a model, you choose which of the supported optimization techniques to apply, including quantization, speculative decoding, and compilation. After your model is optimized, you can run an evaluation to see performance metrics for latency, throughput, and price.

For many models, SageMaker also provides several pre-optimized versions, where each caters to different applications needs for latency and throughput. For such models, you can deploy one of the optimized versions without first optimizing the model yourself.

Optimization techniques

Amazon SageMaker supports the following optimization techniques.

Speculative decoding

Speculative decoding is a technique to speed up the decoding process of large LLMs. It optimizes models for latency without compromising the quality of the generated text.

This technique uses a smaller but faster model called the draft model. The draft model generates candidate tokens, which are then validated by the larger but slower target model. At each iteration, the draft model generates multiple candidate tokens. The target model verifies the tokens, and if it finds that a particular token is not acceptable, it rejects the token and regenerates it. So, the target model both verifies tokens and generates a small amount of them.

The draft model is significantly faster than the target model. It generates all the tokens quickly and then sends batches of them to the target model for verification. The target model evaluate them all in parallel, which speeds up the final response.

SageMaker offers a pre-built draft model that you can use, so you don't have to build your own. If you prefer to use your own custom draft model, SageMaker also supports this option.

Quantization

Quantization is a technique to reduce the hardware requirements of a model by using a less precise data type for the weights and activations. After you optimize a model with quantization, you can host it on less expensive and more available GPUs. However, the quantized model might be less accurate than the source model that you optimized.

SageMaker supports Activation-aware Weight Quantization (AWQ) for GPUs. AWQ is a quantization technique for LLMs that is efficient, accurate, low-bit, and weight-only.

Compilation

Compilation optimizes the model for the best available performance on the chosen hardware type without a loss in accuracy. You can apply model compilation to optimize LLMs for accelerated hardware, such as AWS Trainium, or AWS Inferentia.

When you optimize a model with compilation, you benefit from ahead-of-time compilation. You reduce the model's deployment time and auto-scaling latency because the model weights don't require just-in-time compilation when the model deploys to a new instance.

Metrics reference for inference performance evaluations

After you successfully evaluate the performance of an optimized model, the evaluation details page in Studio shows the following metrics.

Latency metrics

The Latency section shows the following metrics

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Time to first token (ms)

The time that elapsed between when request is sent and when the first token of a streaming response is received.

Inter-token latency (ms)

The time to generate an output token for each request.

Client latency (ms)

The request latency from the time the request is sent to the time the entire response is received.

Input tokens/sec (count)

The total number of generated input tokens, across all requests, divided by the total duration in seconds for the concurrency.

Output tokens/sec (count)

The total number of generated output tokens, across all requests, divided by total duration in seconds for the concurrency.

Client invocations (count)

The total number of inference requests sent to the endpoint across all users at a concurrency.

Client invocation errors (count)

The total number of inference requests sent to the endpoint across all users at a given concurrency that resulted in an invocation error.

Tokenizer failed (count)

The total number of inference requests where the tokenizer failed to parse the request or the response.

Empty inference response (count)

The total number of inference requests that resulted in zero output tokens or the tokenizer failing to parse the response.

Throughput metrics

The Throughput section shows the following metrics.

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Input tokens/sec/req (count)

The total number of generated input tokens per second per request.

Output tokens/sec/req (count)

The total number of generated output tokens per second per request.

Input tokens (count)

The total number of generated input tokens per request.

Output tokens (count)

The total number of generated output tokens per request.

Price metrics

The Price section shows the following metrics.

Concurrency

The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

Price per million input tokens

Cost of processing 1M input tokens.

Price per million output tokens

Cost of generating 1M output tokens.

Supported models reference

The following table shows the models for which SageMaker support inference optimization, and it shows the supported optimization techniques.

Models that support inference optimization
Model name JumpStart Model ID Supports Quantization Supports Speculative Decoding Speculative Decoding with SageMaker Draft Model
Falcon huggingface-llm-falcon-40b-bf16 Yes Yes No
huggingface-llm-falcon-40b-instruct-bf16 Yes Yes No
huggingface-llm-falcon-180b-chat-bf16 No Yes No
huggingface-llm-falcon-180b-bf16 No Yes No
huggingface-llm-amazon-falconlite Yes Yes No
huggingface-llm-amazon-falconlite2 Yes Yes No
huggingface-llm-tiiuae-falcon-rw-1b Yes Yes No
huggingface-llm-falcon-7b-bf16 Yes Yes No
huggingface-llm-falcon-7b-instruct-bf16 Yes Yes No
huggingface-llm-falcon2-11b Yes Yes No
gpt-neox huggingface-textgeneration2-gpt-neoxt-chat-base-20b-fp16 Yes Yes No
huggingface-textgeneration2-gpt-neox-20b-fp16 Yes Yes No
LLaMA meta-textgeneration-llama-3-70b-instruct Yes Yes Yes
meta-textgeneration-llama-3-70b Yes Yes Yes
meta-textgeneration-llama-3-8b Yes Yes Yes
meta-textgeneration-llama-3-8b-instruct Yes Yes Yes
meta-textgeneration-llama-2-7b Yes Yes Yes
meta-textgeneration-llama-2-7b-f Yes Yes Yes
meta-textgeneration-llama-2-13b Yes Yes Yes
meta-textgeneration-llama-2-13b-f Yes Yes Yes
meta-textgeneration-llama-2-70b Yes Yes Yes
meta-textgeneration-llama-2-70b-f Yes Yes Yes
meta-textgeneration-llama-codellama-7b Yes Yes Yes
meta-textgeneration-llama-codellama-7b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-7b-python Yes Yes Yes
meta-textgeneration-llama-codellama-13b Yes Yes Yes
meta-textgeneration-llama-codellama-13b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-13b-python Yes Yes Yes
meta-textgeneration-llama-codellama-34b Yes Yes Yes
meta-textgeneration-llama-codellama-34b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-34b-python Yes Yes Yes
meta-textgeneration-llama-codellama-70b Yes Yes Yes
meta-textgeneration-llama-codellama-70b-instruct Yes Yes Yes
meta-textgeneration-llama-codellama-70b-python Yes Yes Yes
meta-textgeneration-llama-guard-7b Yes Yes Yes
Bloom huggingface-textgeneration-bloom-1b7 Yes Yes No
huggingface-textgeneration-bloom-1b1 Yes Yes No
huggingface-textgeneration-bloom-560m Yes Yes No
huggingface-textgeneration-bloomz-560m Yes Yes No
huggingface-textgeneration-bloomz-1b1 Yes Yes No
huggingface-textgeneration-bloomz-1b7 Yes Yes No
huggingface-textgeneration1-bloomz-7b1-fp16 Yes Yes No
huggingface-textgeneration1-bloom-7b1 Yes Yes No
huggingface-textgeneration1-bloomz-3b-fp16 Yes Yes No
huggingface-textgeneration1-bloom-3b Yes Yes No
huggingface-textembedding-bloom-7b1 Yes Yes No
huggingface-textembedding-bloom-7b1-fp16 Yes Yes No
Cohere huggingface-llm-cohereforai-c4ai-command-r-plus Yes
Gemma huggingface-llm-gemma-7b Yes Yes No
huggingface-llm-gemma-7b-instruct Yes Yes No
huggingface-llm-gemma-2b Yes Yes No
huggingface-llm-gemma-2b-instruct Yes Yes No
huggingface-llm-zephyr-7b-gemma Yes Yes No
gpt2 huggingface-textgeneration-gpt2 Yes No No
huggingface-textgeneration-distilgpt2 Yes No No
Mistral huggingface-llm-mistral-7b Yes Yes Yes
huggingface-llm-mistral-7b-instruct Yes Yes Yes
huggingface-llm-mistral-7b-openorca-gptq Yes Yes Yes
huggingface-llm-amazon-mistrallite Yes Yes Yes
huggingface-llm-thebloke-mistral-7b-openorca-awq Yes Yes Yes
huggingface-llm-huggingfaceh4-mistral-7b-sft-beta Yes Yes Yes
huggingface-llm-huggingfaceh4-mistral-7b-sft-alpha Yes Yes Yes
huggingface-llm-teknium-openhermes-2-mistral-7b Yes Yes Yes
huggingface-llm-nousresearch-yarn-mistral-7b-128k Yes Yes Yes
huggingface-llm-dolphin-2-2-1-mistral-7b Yes Yes Yes
huggingface-llm-cultrix-mistraltrix-v1 Yes Yes Yes
Mixtral huggingface-llm-mixtral-8x7b-instruct Yes Yes Yes
huggingface-llm-mixtral-8x7b-instruct-gptq Yes Yes Yes
huggingface-llm-mixtral-8x7b Yes Yes Yes
huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1 Yes Yes Yes
huggingface-llm-dolphin-2-5-mixtral-8x7b Yes Yes Yes
huggingface-llm-dolphin-2-7-mixtral-8x7b Yes Yes Yes
Phi huggingface-llm-phi-2 Yes

Pre-optimized JumpStart models

The following are the JumpStart models that have pre-optimized configurations.

Meta
  • Llama 3 8B Instruct

  • Llama 3 8B

  • Llama 3 70B Instruct

  • Llama 3 70B

  • Llama 2 70B Chat

  • Llama 2 7B Chat

  • Llama 2 13B Chat

HuggingFace
  • Mixtral 8x7B Instruct

  • Mixtral 8x7B

  • Mistral 7B Instruct

  • Mistral 7B

Pre-compiled JumpStart models

For some models and configurations, SageMaker provides models that are pre-compiled for specific AWS Inferentia and AWS Trainium instances. For these, if you create a compilation or optimization job, and you choose ml.inf2.48xlarge or ml.trn1.32xlarge as the deployment instance type, SageMaker fetches the compiled artifacts. Because the job uses a model that’s already compiled, it completes quickly without running the compilation from scratch.

The following are the JumpStart models for which SageMaker has pre-compiled models:

Meta
  • Llama3 8B

  • Llama3 70B

  • Llama2 7B

  • Llama2 70B

  • Llama2 13B

  • Code Llama 7B

  • Code Llama 70B

HuggingFace
  • Mistral 7B