# Core inference
<a name="core-inference"></a>

Inference is the process of sending a request to a Amazon Nova model and receiving a generated response. Amazon Nova models support inferencing through two API options:
+ **Converse API** ([Converse](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html), [ConverseStream](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ConverseStream.html)): Provides a consistent interface across different models, making it easier to switch between models or build applications that work with multiple models. Recommended for most use cases.
+ **Invoke API** ([InvokeModel](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html), [InvokeModelWithResponseStream](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModelWithResponseStream.html)): Request payloads are structured specifically for each model's native format. Runs inference using the prompt and inference parameters provided in the request body.

Both APIs support the same core features including: 
+  Multi-turn conversations 
+  Multimodal inputs (text, images, video, audio) 
+  Tool use 
+  Guardrails 
+  Streaming responses 

 The request structure is nearly identical between the two APIs, differing only in how byte data (documents, images, video and audio) is encoded.For model request parameters unique to Amazon Nova models, such as `reasoningConfig` and `TopK`, these are placed within an additional `inferenceConfig` object within the `additionalModelRequestFields`. These are top-level parameters for `InvokeModel` and `InvokeModelWithResponseStream`.

**Note**  
 Review Code Samples and Troubleshooting for a list of code samples with Nova 2 models.

Set the `modelId` to one of the following to use Amazon Nova models:


| Model | Model ID | 
| --- |--- |
| Nova 2 Lite |    global.amazon.nova-2-lite-v1:0   us.amazon.nova-2-lite-v1:0    | 
| Nova 2 Sonic |    global.amazon.nova-2-sonic-v1:0   us.amazon.nova-2-sonic-v1:0    | 
| Nova Multimodal Embeddings | amazon.nova-2-multimodal-embeddings-v1:0 | 

## Important: Timeout Configuration
<a name="important-timeout-configuration"></a>

**Important**  
Amazon Nova inference requests can take up to 60 minutes to complete. Configure your client timeout settings accordingly:  
 The following example is Python code. Users can check the documentation for their preferred SDK language version in that SDK's API docs.   

```
from botocore.config import Config

bedrock = boto3.client(
    'bedrock-runtime',
    region_name='us-east-1',
    config=Config(
        read_timeout=3600  # 60 minutes
    )
)
```

## Core Inference Topics
<a name="core-inference-topics"></a>

 This section discusses the following topics: 
+  Using the converse API: A consistent interface offering compatibility across most Bedrock models 
+  Using the invoke API: An interface unique to Nova models and not portable to other Bedrock models 
+  Streaming responses: Real-time response generation 
+  Using Amazon Nova embeddings: Text embeddings capabilities 
+  On-demand inference: Pay-per-use inference model 

# Using the Converse API
<a name="using-converse-api"></a>

The Converse API provides a unified interface for interacting with Amazon Nova models. It abstracts away model-specific details and provides a consistent way to handle multi-turn conversations, system prompts and streaming responses across all Amazon Nova models.

**Topics**
+ [

## Request structure
](#converse-api-request-structure)
+ [

## Using system prompts
](#converse-api-system-prompt)
+ [

## Inference parameters
](#converse-api-inference-params)
+ [

## Using reasoning
](#converse-api-reasoning)

## Request structure
<a name="converse-api-request-structure"></a>
+ **Multi-turn conversations:** Maintain context across multiple exchanges
+ **System prompts:** System instructions such as personas or response guidelines
+ **Document chat:** Interact with and query documents or collections of documents
+ **Vision:** Process and analyze images and video
+ **Tool use:** Enable models to use external tools and APIs
+ **Guardrails:** Apply content filtering and safety controls
+ **Reasoning:** Extended thinking for complex problem-solving

A basic Converse API request includes the model ID and a list of messages:

```
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.amazon.nova-2-lite-v1:0',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'What is machine learning?'}]
        }
    ]
)

content_list = response["output"]["message"]["content"]
# Extract the first text block
text = next((item["text"] for item in content_list if "text" in item), None)
if text is not None:
    print(text)
```

## Using system prompts
<a name="converse-api-system-prompt"></a>

System prompts provide context and instructions to the model:

```
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.amazon.nova-2-lite-v1:0',
    system=[
        {'text': 'You are a helpful AI assistant specializing in cloud computing.'}
    ],
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'Explain serverless computing.'}]
        }
    ]
)

# Print the response text
content_list = response["output"]["message"]["content"]
text = next((item["text"] for item in content_list if "text" in item), None)
if text is not None:
    print(text)
```

## Inference parameters
<a name="converse-api-inference-params"></a>

Control the model's output using inference parameters. The following are available inference parameters. 
+  maxTokens (integer): Maximum number of tokens to generate (up to 65,000). If not specified, the model uses a dynamic default based on the request context. 
+  temperature (float): Controls randomness (0.0-1.0, default 0.7). Lower values make output more deterministic 
+  topP (float): Nucleus sampling threshold (0-1, default 0.9). Lower values make output more focused 
+  stopSequences (array): Sequences of characters that stop generation when encountered 

 Example: 

```
import boto3
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.amazon.nova-2-lite-v1:0',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'Write a short story.'}]
        }
    ],
    inferenceConfig={
        'maxTokens': 512,
        'temperature': 0.7,
        'topP': 0.9,
        'stopSequences': ['END']
    }
)

content_list = response["output"]["message"]["content"]
text = next((item["text"] for item in content_list if "text" in item), None)
if text is not None:
    print(text)
```

## Using reasoning
<a name="converse-api-reasoning"></a>

Nova 2 Lite supports extended thinking for complex problem-solving. Enable reasoning with `reasoningConfig`.

By default, reasoning is disabled to optimize for speed and cost on simple queries. When you needto go beyond these straightforward tasks, you can enable reasoning. Nova 2 provides flexible control over reasoning depth through three effort levels:

Low effort (`maxReasoningEffort: "low"`)  
BEST FOR: TASKS WITH ADDED COMPLEXITY REQUIRING STRUCTURED THINKING. For example, you can use this for code review and improvement suggestions where the model needs to carefully consider existing code quality, analysis tasks that require thoughtful consideration of multiple factors, or problem-solving scenarios that benefit from a methodical approach. Low effort is ideal for compound tasks where basic reasoning improves accuracy without requiring deep multi-step planning. 

Medium effort (`maxReasoningEffort: "medium"`)  
BEST FOR: MULTI-STEP TASKS AND CODING WORKFLOWS. For example, you can use this for software development and debugging where the model needs to understand existing code structure before implementing changes, code generation that requires coordination across multiple files or components, multi-step calculations with interdependencies, or planning tasks with multiple constraints. Medium effort is optimal for agentic workflows that coordinate multiple tools and require the model to maintain context across several sequential operations.

High effort (`maxReasoningEffort: "high"`)  
BEST FOR: STEM REASONING AND ADVANCED PROBLEM-SOLVING. For example, you can use this for advanced mathematical problems and proofs that require rigorous step-by-step verification, scientific analysis and research tasks demanding deep investigation, complex system design with architectural considerations across multiple dimensions, or critical decision-making scenarios with significant implications. High effort delivers maximum accuracy for tasks requiring sophisticated reasoning, careful evaluation of alternatives, and thorough validation of conclusions.

The following examples show different reasoning effort levels:

------
#### [ Low effort ]

```
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.amazon.nova-2-lite-v1:0',
    system=[{"text": "You are a highly capable personal assistant"}],
    messages=[{
        "role": "user",
        "content": [{"text": "Provide a meal plan for a gluten free family of 4."}]
    }],
    inferenceConfig={
        "temperature": 0.7,
        "topP": 0.9,
        "maxTokens": 10000
    },
    additionalModelRequestFields={
        "reasoningConfig": {
            "type": "enabled",
            "maxReasoningEffort": "low"
        }
    }
)
```

------

 ** Reasoning parameters: ** 

 The following are reasoning parameters 
+ `type`: `enabled` or `disabled` (default: `disabled`)
+ `maxReasoningEffort`: `low`, `medium`, or `high`. This is required when reasoning is enabled. 

**Note**  
Temperature, topP and topK cannot be used with `maxReasoningEffort` set to `high`. This will cause an error.

The response includes reasoning content: 

```
{
    "output": {
        "message": {
            "role": "assistant",
            "content": [
                {
                    "reasoningContent": {
                        "reasoningText": {
                            "text": "[REDACTED]"
                        }
                    }
                },
                {
                    "text": "Based on the premises, we can conclude..."
                }
            ]
        }
    },
    "stopReason": "end_turn"
}
```

**Note**  
With Amazon Nova 2, reasoning content displays as `[REDACTED]`. You're still charged for reasoning tokens as they contribute to improved output quality. We include this field in the response structure now to preserve the option of exposing reasoning content in the future. We are actively working with customers to determine the best approach for surfacing the model's reasoning process.

# Using the Invoke API
<a name="using-invoke-api"></a>

 The Invoke API provides direct access to Amazon Nova models with more ability to control the request and response format. Unlike the Converse API which abstracts model-specific details, the Invoke API allows you to work directly with the model's native request and response structures. 

**Note**  
The Invoke API supports the same features as the Converse API except for document input modality, which is specific to the Converse API.

## Request structure
<a name="invoke-api-request"></a>

An Invoke API request requires the model ID and a JSON request body:

```
import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

request_body = {
    'messages': [
        {
            'role': 'user',
            'content': [{'text': 'What is machine learning?'}]
        }
    ],
    'inferenceConfig': {
        'maxTokens': 512,
        'temperature': 0.7
    }
}

response = bedrock.invoke_model(
    modelId='us.amazon.nova-2-lite-v1:0',
    body=json.dumps(request_body)
)

response_body = json.loads(response['body'].read())
content_list = response_body["output"]["message"]["content"]
# Extract the first text block
text_block = next((item for item in content_list if "text" in item), None)
if text_block is not None:
    print(text_block["text"])
```

## Request parameters
<a name="invoke-api-parameters"></a>

The Invoke API supports the following key parameters:
+ `messages`: Array of conversation messages with role and content
+ `system`: Optional system prompt for context and instructions
+ `inferenceConfig`: Parameters controlling model output (temperature, maxTokens, topP, topK, stopSequences, reasoningConfig)
+ `toolConfig`: Tool specifications and tool choice for function calling

# Streaming responses
<a name="streaming-responses"></a>

Streaming allows you to receive model responses incrementally as they are generated, providing a more interactive user experience. Both the Converse API and Invoke API support streaming.

## Streaming with ConverseStream
<a name="streaming-converse"></a>

Use `ConverseStream` to receive responses as a stream of events:

```
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse_stream(
    modelId='us.amazon.nova-2-lite-v1:0',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'Write a short story about AI.'}]
        }
    ]
)

for event in response['stream']:
    if 'contentBlockDelta' in event:
        delta = event['contentBlockDelta']['delta']
        if 'text' in delta:
            print(delta['text'], end='', flush=True)
```

## Streaming with InvokeModelWithResponseStream
<a name="streaming-invoke"></a>

Use `InvokeModelWithResponseStream` for streaming with the Invoke API:

```
import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

request_body = {
    'messages': [
        {
            'role': 'user',
            'content': [{'text': 'Explain quantum computing.'}]
        }
    ]
}

response = bedrock.invoke_model_with_response_stream(
    modelId='us.amazon.nova-2-lite-v1:0',
    body=json.dumps(request_body)
)

for event in response['body']:
    chunk = json.loads(event['chunk']['bytes'])
    if 'contentBlockDelta' in chunk:
        delta = chunk['contentBlockDelta']['delta']
        if 'text' in delta:
            print(delta['text'], end='', flush=True)
```

## Stream event types
<a name="streaming-events"></a>

Streaming responses include several event types:
+ `messageStart`: Indicates the start of a message
+ `contentBlockStart`: Indicates the start of a content block
+ `contentBlockDelta`: Contains incremental text or data
+ `contentBlockStop`: Indicates the end of a content block
+ `messageStop`: Indicates the end of the message with stop reason
+ `metadata`: Contains usage information (token counts)

# Using Amazon Nova embeddings
<a name="embeddings"></a>

Amazon Nova Multimodal Embeddings is a multimodal embeddings model for agentic RAG and semantic search applications. It supports text, documents, images, video and audio through a single model, enabling cross-modal retrieval. Nova Multimodal Embeddings maps each of these content types into a unified semantic space, enabling you to conduct unimodal, cross-modal and multimodal vector operations.

When a piece of content is passed through Nova embeddings, the model converts that content into a universal numerical format, referred to as a vector. A vector is a set of numerical values that can be used for various search functionalities. Similar content is given a closer vector than less similar content. 

 Applications:
+ Semantic Content Retrieval and Recommendation: Generate embeddings for your content, then use them to find similar items or provide personalized recommendations. 
+ Multimodal Search: Combine embeddings from different content types to enable cross-modal search capabilities.
+ RAG: Generate embeddings from multimodal content such as documents with interleaved text and images to power your retrieval workflow for GenAI applications.

## Key features
<a name="embeddings-features"></a>

 These key features distinguish Nova Embeddings: 
+ Support for text, image, document image, video, and audio in a unified semantic space. Maximum context length is 8K tokens or 30s of video and 30s of audio.
+ Synchronous and asynchronous APIs: The API supports both synchronous and asynchronous use.
+ Large file segmentation: The async API makes it easy to work with large inputs by providing API built segmentation for long text, video, and audio, controlled by user-defined parameters. The model will generate a single embedding for each segment.
+ Video with audio: Process video with audio simultaneously. Specify if you would like a single embedding representing both modalities or two separate embeddings. 
+ Embedding purpose: Optimize your embeddings depending on the intended downstream application (retrieval/RAG/Search, classification, clustering). 
+ Dimension sizes: 4 dimension sizes to trade-off embedding accuracy and vector storage cost: 3072; 1024; 384; 256.
+ Input methods: Pass content to be embedded by specifying an S3 URI or inline as a base64 encoding.

## Generating embeddings
<a name="embeddings-examples"></a>

 Complete the following to generate embeddings. 

Synchronous embedding for text: 

```
import boto3
import json

# Create the Bedrock Runtime client.
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1',
)

# Define the request body.
request_body = {
    'taskType': 'SINGLE_EMBEDDING',
    'singleEmbeddingParams': {
        'embeddingPurpose': 'GENERIC_INDEX',
        'embeddingDimension': 3072,
        'text': {'truncationMode': 'END', 'value': 'Hello, World!'},
    },
}

try:
    # Invoke the Nova Embeddings model.
    response = bedrock_runtime.invoke_model(
        body=json.dumps(request_body, indent=2),
        modelId='amazon.nova-2-multimodal-embeddings-v1:0',
        accept='application/json',
        contentType='application/json',
    )
    
except Exception as e:
    # Add your own exception handling here.
    print(e)
    
# Print the request ID.
print('Request ID:', response.get('ResponseMetadata').get('RequestId'))

# Print the response body.
response_body = json.loads(response.get('body').read())
print(json.dumps(response_body, indent=2))
```

 try: 

```
response = bedrock_runtime.invoke_model( 
        body=json.dumps(request_body, indent=2), 
        modelId='amazon.nova-2-multimodal-embeddings-v1:0', 
        accept='application/json', 
        contentType='application/json', 
    ) 
except Exception as e: 
    print(e) 
 
print('Request ID:', response.get('ResponseMetadata').get('RequestId')) 
response_body = json.loads(response.get('body').read()) 
print(json.dumps(response_body, indent=2))
```

 Asynchronous embedding for video: 

```
 import boto3 
 
bedrock_runtime = boto3.client( 
    service_name='bedrock-runtime', 
    region_name='us-east-1', 
) 
 
model_input = { 
    'taskType': 'SEGMENTED_EMBEDDING', 
    'segmentedEmbeddingParams': { 
        'embeddingPurpose': 'GENERIC_INDEX', 
        'embeddingDimension': 3072, 
        'video': { 
            'format': 'mp4', 
            'embeddingMode': 'AUDIO_VIDEO_COMBINED', 
            'source': { 
                's3Location': {'uri': 's3://my-bucket/path/to/video.mp4'} 
            }, 
            'segmentationConfig': { 
                'durationSeconds': 15  # Segment into 15 second chunks 
            }, 
        }, 
    }, 
}
```

 try: 

```
response = bedrock_runtime.start_async_invoke( 
        modelId='amazon.nova-2-multimodal-embeddings-v1:0', 
        modelInput=model_input, 
        outputDataConfig={ 
            's3OutputDataConfig': { 
                's3Uri': 's3://my-bucket' 
            } 
        }, 
    ) 
except Exception as e: 
    print(e) 
 
print('Request ID:', response.get('ResponseMetadata').get('RequestId')) 
print('Invocation ARN:', response.get('invocationArn'))
```

**Batch inference**

 With batch inference, you can submit multiple requests and generate embeddings asynchronously. Batch inference helps you process many requests efficiently by sending a single request and generating responses in an Amazon S3 bucket. 
+  After defining model inputs in files you create, you upload the files to an S3 bucket. 
+  You then submit a batch inference request and specify the S3 bucket. 
+ After the job is complete, you can retrieve the output files from S3. 
+  You can use batch inference to improve the performance of model inference on large datasets. 

**Format and upload your batch inference data**

You must add your batch inference data to an S3 location that you'll choose or specify when submitting a model invocation job. The S3 location must the following items: 

 At least one JSONL file that defines the model inputs. A JSONL contains rows of JSON objects. Your JSONL file must end in the extension .jsonl and be in the following format: 

```
{
    "recordId": "record001",
    "modelInput": {
        "taskType": "SINGLE_EMBEDDING",
        "singleEmbeddingParams": {
            "embeddingPurpose": "GENERIC_INDEX",
            "embeddingDimension": 3072,
            "text": {
                "source": {
                    "s3Location": {
                        "uri": "s3://batch-inference-input-bucket/text_001.txt",
                        "bucketOwner": "111122223333"
                    }
                },
                "truncationMode": "END"
            }
        }
    }
}
```

** Supported input file types **
+ Image Formats: PNG, JPEG, WEBP, GIF
+ Audio Formats: MP3, WAV, OGG
+ Video Formats: MP4, MOV, MKV, WEBM, FLV, MPEG, MPG, WMV, 3GP

# On-demand inference
<a name="on-demand-inference"></a>

On-demand inference provides serverless access to Amazon Nova models without requiring provisioned capacity. This mode automatically scales to handle your workload and charges based on usage.

## Benefits
<a name="on-demand-benefits"></a>

On-demand inference offers several advantages:
+ **No capacity planning:** Automatically scales to meet demand
+ **Pay per use:** Charged only for tokens processed
+ **Instant availability:** No provisioning or warm-up time required
+ **Cost effective:** Ideal for variable or unpredictable workloads

## Using on-demand inference
<a name="on-demand-usage"></a>

On-demand inference is the default mode for Amazon Nova models. Simply specify the model ID when making API calls:

```
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.converse(
    modelId='us.amazon.nova-2-lite-v1:0',
    messages=[
        {
            'role': 'user',
            'content': [{'text': 'Hello, Nova!'}]
        }
    ]
)

# Print the response text
content_list = response["output"]["message"]["content"]
text = next((item["text"] for item in content_list if "text" in item), None)
if text is not None:
    print(text)
```

## Pricing
<a name="on-demand-pricing"></a>

On-demand inference is billed based on the number of input and output tokens processed. For current pricing details, see [Amazon Bedrock pricing](https://aws.amazon.com/bedrock/pricing/).

## Quotas and limits
<a name="on-demand-limits"></a>

On-demand inference has default quotas that vary by model and region. To request quota increases, use the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).