Supported models reference
The following tables show the models for which SageMaker AI support inference optimization, and they show the supported optimization techniques.
Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation |
---|---|---|---|---|
Meta Llama 2 13B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 2 13B Chat |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 2 70B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 2 70B Chat |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 2 7B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 2 7B Chat |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 3 70B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 3 70B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 3 8B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Llama 3 8B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Meta Code Llama 13B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 13B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 13B Python |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 34B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 34B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 34B Python |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 70B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 70B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 70B Python |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 7B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 7B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Code Llama 7B Python |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Meta Llama 2 13B Neuron | None | No | No |
AWS Neuron |
Meta Llama 2 13B Chat Neuron | None | No | No |
AWS Neuron |
Meta Llama 2 70B Neuron | None | No | No |
AWS Neuron |
Meta Llama 2 70B Chat Neuron | None | No | No |
AWS Neuron |
Meta Llama 2 7B Neuron | None | No | No |
AWS Neuron |
Meta Llama 2 7B Chat Neuron | None | No | No |
AWS Neuron |
Meta Llama 3 70B Neuron | None | No | No |
AWS Neuron |
Meta Llama 3 70B Instruct Neuron | None | No | No |
AWS Neuron |
Meta Llama 3 8B Neuron | None | No | No |
AWS Neuron |
Meta Llama 3 8B Instruct Neuron | None | No | No |
AWS Neuron |
Meta Code Llama 70B Neuron | None | No | No |
AWS Neuron |
Meta Code Llama 7B Neuron | None | No | No |
AWS Neuron |
Meta Code Llama 7B Python Neuron | None | No | No |
AWS Neuron |
Meta Llama 3.1 405B FP8 | None | Yes | Yes |
None |
Meta Llama 3.1 405B Instruct FP8 | None | Yes | Yes |
None |
Meta Llama 3.1 70B |
INT4-AWQ FP8 |
Yes | Yes |
None |
Meta Llama 3.1 70B Instruct |
INT4-AWQ FP8 |
Yes | Yes |
None |
Meta Llama 3.1 8B |
INT4-AWQ FP8 |
Yes | Yes |
None |
Meta Llama 3.1 8B Instruct |
INT4-AWQ FP8 |
Yes | Yes |
None |
Meta Llama 3.1 70B Neuron | None | No | No |
AWS Neuron |
Meta Llama 3.1 70B Instruct Neuron | None | No | No |
AWS Neuron |
Meta Llama 3 1 8B Neuron | None | No | No |
AWS Neuron |
Meta Llama 3.1 8B Instruct Neuron | None | No | No |
AWS Neuron |
Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation |
---|---|---|---|---|
Mistral 7B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Mistral 7B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
AWS Neuron TensorRT-LLM |
Mistral 7B Neuron | None | No | No |
AWS Neuron |
Mistral 7B Instruct Neuron | None | No | No |
AWS Neuron |
Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation |
---|---|---|---|---|
Mixtral-8x22B-Instruct-v0.1 |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Mixtral-8x22B V1 |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Mixtral 8x7B |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |
Mixtral 8x7B Instruct |
INT4-AWQ INT8-SmoothQuant FP8 |
Yes | Yes |
TensorRT-LLM |