Inference Container Features
The Amazon Nova SageMaker inference container includes a set of features that you can enable to customize model behavior during inference. Each feature is introduced in a specific container version and may require environment variables, request parameters, or both to activate.
This page lists the features available in the inference container, describes how to enable each one, and identifies the container version in which the feature was introduced. Use this reference to determine which features are available for your deployment and how to configure them.
Features that are enabled through environment variables are set when you create the SageMaker
model or endpoint configuration. Include them in the Environment parameter of
the CreateModel API
call. Features that are enabled through request parameters are set per invocation in the
request body.
Note
Always use the latest container image to get access to all available features.
The SM-Inference-latest tag currently points to v1.4.
Feature summary
The following table provides a quick reference of all features supported in the Amazon Nova SageMaker inference container.
| Feature | How to enable | Default | Supported models | Introduced in |
|---|---|---|---|---|
| Default temperature | Environment variable | 1.0 |
All Amazon Nova models | v1.0 |
| Default top-p | Environment variable | 1.0 |
All Amazon Nova models | v1.0 |
| Default top-k | Environment variable | -1 (disabled) |
All Amazon Nova models | v1.0 |
| Default max new tokens | Environment variable | Model's max context length | All Amazon Nova models | v1.0 |
| Default logprobs | Environment variable | Disabled | All Amazon Nova models | v1.0 |
| Eagle3 speculative decoding | Enabled by default | Enabled | All Amazon Nova models | v1.0 |
| KV cache data type | Environment variable | Same as model data type | All Amazon Nova models | v1.3 |
| Quantization | Environment variable | Disabled* | All Amazon Nova models | v1.3 |
| Number of speculative tokens | Environment variable | 3 |
All Amazon Nova models | v1.4 |
| Suffix decoding | Environment variable | Disabled | All Amazon Nova models | v1.4 |
Important
* FP8 quantization is automatically enabled and cannot be disabled for the following model and instance type combinations:
-
Amazon Nova Lite on
ml.g6.12xlargeorml.g6.24xlarge -
Nova 2 Lite on
ml.g6.48xlarge
For these configurations, you do not need to set QUANTIZATION_DTYPE.
See Quantization for
details.
Default temperature
Sets the default sampling temperature for all inference requests sent to the endpoint.
Temperature controls how random or predictable the model's output is. A value of
0 makes the model always pick the most likely next word, producing
consistent and repeatable output. Higher values (up to 2) make the model
more willing to pick less likely words, producing more creative and varied responses.
When to use: Lower the temperature (for example,
0.1 to 0.3) for tasks that need factual, consistent answers
such as classification or data extraction. Raise it (for example, 0.7 to
1.0) for creative tasks such as story writing or brainstorming. Temperature
works together with top-p and top-k — all three control how the model selects tokens,
and you can combine them to fine-tune output behavior.
- Introduced in
v1.0- Supported models
All Amazon Nova models
- How to enable
-
Set the
DEFAULT_TEMPERATUREenvironment variable when creating the SageMaker model. - Default value
1.0- Valid values
Float between
0and2(inclusive)
Environment variable
"Environment": { "DEFAULT_TEMPERATURE": "0.7" }
Note
You can override this default on a per-request basis by including the
temperature parameter in the request body.
Default top-p
Sets the default top-p value for all inference requests. Top-p controls output diversity
by limiting the model's choices to a subset of the most likely words. Specifically, the
model sorts all possible next words by probability and considers only the smallest group
whose combined probability reaches the top-p value. For example, a top-p of
0.9 means the model only considers words that together account for 90% of
the probability, ignoring the remaining unlikely options.
When to use: Use a lower top-p value (for example,
0.5) to make the model stick to high-confidence words, producing more
focused output. Use a higher value (for example, 0.95) to allow more
variety. Top-p is often used as an alternative to temperature — both control output
diversity, but top-p adapts dynamically based on the model's confidence at each step.
You can use both together, in which case the model applies whichever constraint is
more restrictive at each step.
- Introduced in
v1.0- Supported models
All Amazon Nova models
- How to enable
-
Set the
DEFAULT_TOP_Penvironment variable when creating the SageMaker model. - Default value
1.0- Valid values
Float between
1e-10and1(inclusive)
Environment variable
"Environment": { "DEFAULT_TOP_P": "0.9" }
Note
You can override this default on a per-request basis by including the
top_p parameter in the request body.
Default top-k
Sets the default top-k value for all inference requests. Top-k limits the model's
choices to a fixed number of the most likely next words. For example, a top-k of
50 means the model only considers the 50 most probable words at each
step, regardless of their individual probabilities. A value of -1
disables this limit, allowing the model to consider all possible words.
When to use: Use top-k when you want a hard cap on
the number of word choices the model considers. Lower values (for example,
10) produce more predictable output, while higher values allow more
variety. Top-k can be combined with temperature and top-p — when multiple sampling
controls are active, the model applies all of them, using whichever is most
restrictive at each step.
- Introduced in
v1.0- Supported models
All Amazon Nova models
- How to enable
-
Set the
DEFAULT_TOP_Kenvironment variable when creating the SageMaker model. - Default value
-1(disabled)- Valid values
Integer,
-1or greater. Use-1to consider all tokens.
Environment variable
"Environment": { "DEFAULT_TOP_K": "50" }
Note
You can override this default on a per-request basis by including the
top_k parameter in the request body.
Default max new tokens
Sets the default maximum number of tokens (words or word pieces) the model generates in a response. This value applies to all requests unless overridden. Use this to control response length and manage costs across your endpoint.
When to use: Set this when you want to enforce a
consistent maximum response length across all requests. For example, set it to
256 for short-answer tasks or 2048 for longer content
generation. The maximum allowed value depends on the CONTEXT_LENGTH
configured for your endpoint, because input tokens plus output tokens cannot exceed
the context length.
- Introduced in
v1.0- Supported models
All Amazon Nova models
- How to enable
-
Set the
DEFAULT_MAX_NEW_TOKENSenvironment variable when creating the SageMaker model. - Default value
Model's maximum context length
- Valid values
Integer,
1or greater
Environment variable
"Environment": { "DEFAULT_MAX_NEW_TOKENS": "512" }
Note
You can override this default on a per-request basis by including the
max_tokens or max_completion_tokens parameter in the
request body. The maximum allowed value depends on the CONTEXT_LENGTH
configured for your endpoint.
Default logprobs
Sets the default number of log probabilities to return for each generated token. A log probability is a numerical score that indicates how confident the model was in choosing each word. When enabled, the response includes these scores for each output token, which is useful for evaluating model confidence, comparing alternative word choices, and debugging generation behavior.
When to use: Enable logprobs when you need to assess how confident the model is in its output — for example, to flag low-confidence responses for human review, or to compare the likelihood of different completions. Enabling logprobs may slightly increase response latency and response payload size.
- Introduced in
v1.0- Supported models
All Amazon Nova models
- How to enable
-
Set the
DEFAULT_LOGPROBSenvironment variable when creating the SageMaker model. - Default value
Disabled
- Valid values
Integer between
1and20(inclusive)
Environment variable
"Environment": { "DEFAULT_LOGPROBS": "5" }
Note
You can override this default on a per-request basis by including the
logprobs and top_logprobs parameters in the request
body. Enabling logprobs may slightly increase response latency.
Eagle3 speculative decoding
Eagle3 speculative decoding is an optimization technique that speeds up text generation. It works by using a smaller, faster draft model to predict several tokens ahead, then checking those predictions against the primary model in a single step. When the predictions are correct, the model effectively generates multiple tokens in the time it would normally take to generate one. The primary model always verifies the draft tokens, so the final output is identical to what the primary model would produce on its own — only the speed changes, not the quality.
When to use: Eagle3 speculative decoding is enabled by default and benefits most workloads. Consider disabling it only if you observe unexpected behavior or need to isolate performance characteristics during debugging.
- Introduced in
v1.0. Support for FP8 quantization with Eagle3 speculative decoding was added inv1.4.- Supported models
All Amazon Nova models
- How to enable
-
Eagle3 speculative decoding is enabled by default with no configuration required. Use
DISABLE_SPECULATIVE_DECODINGto disable it. - Default value
false(Eagle3 speculative decoding is enabled)- Valid values
true,false
Environment variable
The following example disables Eagle3 speculative decoding:
"Environment": { "DISABLE_SPECULATIVE_DECODING": "true" }
KV cache data type
Sets the data type for the key-value (KV) cache used during inference. The KV cache stores the model's memory of previous tokens in a conversation, allowing it to generate each new token without reprocessing the entire input. For long sequences, this cache can consume significant GPU memory. Setting the KV cache to a lower-precision data type such as FP8 reduces memory usage and can improve throughput, at the cost of minor numerical differences in output.
When to use: Enable FP8 KV cache when you need to support longer context lengths or higher concurrency on your instance. This is especially useful on GPU instances with limited memory. Test your use case to verify that output quality meets your requirements, as lower precision may produce slightly different results.
- Introduced in
v1.3- Supported models
All Amazon Nova models
- How to enable
-
Set the
KV_CACHE_DTYPEenvironment variable when creating the SageMaker model. - Default value
Same as the model's data type
- Valid values
fp8
Environment variable
"Environment": { "KV_CACHE_DTYPE": "fp8" }
Note
Changing the KV cache data type may produce slightly different outputs compared to the default precision. Test your use case to verify that output quality meets your requirements.
Quantization
Sets the quantization data type for model weights. Quantization compresses the model's weights into a lower-precision format (FP8 instead of the default higher precision), which reduces the amount of GPU memory the model requires. This can improve inference throughput and allow larger models to fit on smaller instance types, with minimal impact on output quality.
When to use: Use FP8 quantization when you want to reduce memory usage to support higher concurrency or fit a model on a smaller instance type. Note that some model and instance type combinations require FP8 quantization automatically — see the warning below.
- Introduced in
v1.3- Supported models
All Amazon Nova models
- How to enable
-
Set the
QUANTIZATION_DTYPEenvironment variable when creating the SageMaker model. - Default value
Disabled. However, FP8 quantization is automatically enabled for certain model and instance type combinations. See the note below.
- Valid values
fp8
Environment variable
"Environment": { "QUANTIZATION_DTYPE": "fp8" }
Important
The following model and instance type combinations require FP8 quantization. For these configurations, quantization is enabled automatically and cannot be disabled or overridden:
-
Amazon Nova Lite on
ml.g6.12xlargeorml.g6.24xlarge -
Nova 2 Lite on
ml.g6.48xlarge
For all other configurations, see Supported models and instances for details.
Number of speculative tokens
Controls how many tokens the draft model predicts ahead during each Eagle3 speculative decoding step. A higher value means the draft model attempts to predict more tokens at once, which can improve throughput when predictions are accurate. If the draft model's predictions frequently diverge from the primary model, a lower value may be more efficient.
When to use: Increase this value when your workload produces predictable output patterns (for example, structured data or templated text) where the draft model is likely to guess correctly. Decrease it for creative or highly variable output where predictions are less reliable.
- Introduced in
v1.4- Supported models
All Amazon Nova models
- How to enable
-
Set the
NUM_SPECULATIVE_TOKENSenvironment variable when creating the SageMaker model. - Default value
3- Valid values
Integer between
1and10(inclusive)
Environment variable
"Environment": { "NUM_SPECULATIVE_TOKENS": "5" }
Note
This setting only applies when Eagle3 speculative decoding is enabled
(DISABLE_SPECULATIVE_DECODING is false). It has no
effect when speculative decoding is disabled or when using suffix decoding.
Suffix decoding
Suffix decoding is an alternative method for speeding up text generation. Instead of using a separate draft model (as Eagle3 does), suffix decoding looks for repeated patterns in the text that has already been generated or in the input prompt, and reuses those patterns to predict future tokens. This approach works well when the output is likely to contain repeated phrases, structured formats, or content that closely mirrors the input.
When to use: Use suffix decoding for tasks where the output contains repetitive patterns, such as generating structured data, filling in templates, or summarizing content that reuses phrases from the source. For general-purpose generation where output is highly varied, the default Eagle3 method typically provides better throughput.
- Introduced in
v1.4- Supported models
All Amazon Nova models
- How to enable
-
Set the
SPECULATIVE_DECODING_METHODenvironment variable tosuffixwhen creating the SageMaker model. - Default value
eagle3- Valid values
eagle3,suffix
Environment variable
"Environment": { "SPECULATIVE_DECODING_METHOD": "suffix" }
Note
To use suffix decoding, DISABLE_SPECULATIVE_DECODING must be set
to false (the default). Setting
DISABLE_SPECULATIVE_DECODING to true disables all
speculative decoding methods, including suffix decoding.