

# Speech-to-Speech (Amazon Nova 2 Sonic)
<a name="using-conversational-speech"></a>

Amazon Nova 2 Sonic enables real-time conversational AI with speech input and output. The following section covers advanced capabilities for building interactive voice assistants, customer service automation, and conversational applications.

## Key features
<a name="sonic-key-features"></a>

Amazon Nova 2 Sonic provides the following capabilities:
+ State-of-the-art streaming speech understanding with bidirectional streaming API that enables real-time, low-latency multi-turn conversations.
+ Multilingual support with automatic language detection and switching. Expressive voices are offered, including both masculine-sounding and feminine-sounding voices, in the following languages:
  + English (US, UK, India, Australia)
  + French
  + Italian
  + German
  + Spanish
  + Portuguese
  + Hindi
+ Polyglot voices that can speak any of the supported languages to enable a consistent user experience even when the user switches languages within the same session.
+ Robustness to background noise for real world deployment scenarios.
+ Robustness to different accents for supported languages.
+ Natural, human-like conversational AI experiences with contextual richness across all supported languages.
+ Adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech.
+ Intelligent turn-taking that detects when users finish speaking and when the assistant should respond, creating natural dialogue rhythm.
+ Graceful handling of user interruptions without dropping conversational context.
+ Knowledge grounding with enterprise data using Retrieval Augmented Generation (RAG).
+ Function calling and agentic workflow support for building complex AI applications.
+ Asynchronous tool handling that executes tool calls while maintaining conversation flow, allowing the assistant to continue speaking while tools process in the background.
+ Cross-modal input support for both audio and text inputs within the same conversation, enabling flexible interaction patterns.
+ Connection limit of 8 minutes, with connection renewal and session continuation pattern available in code samples.

# Getting started with speech-to-speech
<a name="sonic-getting-started"></a>

The following sections provide an example and step-by-step explanation of how to implement a simple, real-time audio streaming application using Amazon Nova 2 Sonic. This simplified version demonstrates the core functionality needed to create an audio conversation with the Amazon Nova 2 Sonic model.

You can access the following example in our [Nova samples GitHub repo](https://github.com/aws-samples/amazon-nova-samples).

There is a connection limit of 8 minutes, with connection renewal and conversation continuation pattern available on [ GitHub ](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/session-continuation/console-python).

## State the imports and configuration
<a name="sonic-imports-config"></a>

This section imports necessary libraries and sets audio configuration parameters:
+ `asyncio`: For asynchronous programming
+ `base64`: For encoding and decoding audio data
+ `pyaudio`: For audio capture and playback
+ Amazon Bedrock SDK components for streaming
+ Audio constants define the format of audio capture (16kHz sample rate, mono channel)

```
import os
import asyncio
import base64
import json
import uuid
import pyaudio
from aws_sdk_bedrock_runtime.client import BedrockRuntimeClient, InvokeModelWithBidirectionalStreamOperationInput
from aws_sdk_bedrock_runtime.models import InvokeModelWithBidirectionalStreamInputChunk, BidirectionalInputPayloadPart
from aws_sdk_bedrock_runtime.config import Config, HTTPAuthSchemeResolver, SigV4AuthScheme
from smithy_aws_core.identity import EnvironmentCredentialsResolver

# Audio configuration
INPUT_SAMPLE_RATE = 16000
OUTPUT_SAMPLE_RATE = 24000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK_SIZE = 1024
```

## Define the `SimpleNovaSonic` class
<a name="sonic-define-class"></a>

The `SimpleNovaSonic` class is the main class that handles the Amazon Nova Sonic interaction:
+ `model_id`: The Amazon Nova Sonic model ID (`amazon.nova-2-sonic-v1:0`)
+ `region`: The AWS Region, the default is `us-east-1`
+ Unique IDs for prompt and content tracking
+ An asynchronous queue for audio playback

```
class SimpleNovaSonic:
    def __init__(self, model_id='amazon.nova-2-sonic-v1:0', region='us-east-1'):
        self.model_id = model_id
        self.region = region
        self.client = None
        self.stream = None
        self.response = None
        self.is_active = False
        self.prompt_name = str(uuid.uuid4())
        self.content_name = str(uuid.uuid4())
        self.audio_content_name = str(uuid.uuid4())
        self.audio_queue = asyncio.Queue()
        self.display_assistant_text = False
```

## Initialize the client
<a name="sonic-initialize-client"></a>

This method configures the Amazon Bedrock client with the following:
+ The appropriate endpoint for the specified region
+ Authentication information using environment variables for AWS credentials
+ The SigV4 authentication scheme for the AWS API calls

```
    def _initialize_client(self):
        """Initialize the Bedrock client."""
        config = Config(
            endpoint_uri=f"https://bedrock-runtime.{self.region}.amazonaws.com",
            region=self.region,
            aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
            auth_scheme_resolver=HTTPAuthSchemeResolver(),
            auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="bedrock")}
        )
        self.client = BedrockRuntimeClient(config=config)
```

## Handle events
<a name="sonic-handle-events"></a>

This helper method sends JSON events to the bidirectional stream, which is used for all communication with the Amazon Nova Sonic model:

```
    async def send_event(self, event_json):
        """Send an event to the stream."""
        event = InvokeModelWithBidirectionalStreamInputChunk(
            value=BidirectionalInputPayloadPart(bytes_=event_json.encode('utf-8'))
        )
        await self.stream.input_stream.send(event)
```

## Start the session
<a name="sonic-start-session"></a>

This method initiates the session and setups the remaining events to start audio streaming. These events need to be sent in the same order.

```
    async def start_session(self):
        """Start a new session with Nova Sonic."""
        if not self.client:
            self._initialize_client()
            
        # Initialize the stream
        self.stream = await self.client.invoke_model_with_bidirectional_stream(
            InvokeModelWithBidirectionalStreamOperationInput(model_id=self.model_id)
        )
        self.is_active = True
        
        # Send session start event
        session_start = '''
        {
          "event": {
            "sessionStart": {
              "inferenceConfiguration": {
                "maxTokens": 1024,
                "topP": 0.9,
                "temperature": 0.7
              },
              "turnDetectionConfiguration": {
                "endpointingSensitivity": "HIGH"
              }
            }
          }
        }
        '''
        await self.send_event(session_start)
        
        # Send prompt start event
        prompt_start = f'''
        {{
          "event": {{
            "promptStart": {{
              "promptName": "{self.prompt_name}",
              "textOutputConfiguration": {{
                "mediaType": "text/plain"
              }},
              "audioOutputConfiguration": {{
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 24000,
                "sampleSizeBits": 16,
                "channelCount": 1,
                "voiceId": "matthew",
                "encoding": "base64",
                "audioType": "SPEECH"
              }}
            }}
          }}
        }}
        '''
        await self.send_event(prompt_start)
        
        # Send system prompt
        text_content_start = f'''
        {{
            "event": {{
                "contentStart": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.content_name}",
                    "type": "TEXT",
                    "interactive": true,
                    "role": "SYSTEM",
                    "textInputConfiguration": {{
                        "mediaType": "text/plain"
                    }}
                }}
            }}
        }}
        '''
        await self.send_event(text_content_start)
        
        system_prompt = "You are a friendly assistant. The user and you will engage in a spoken dialog " \
            "exchanging the transcripts of a natural real-time conversation. Keep your responses short, " \
            "generally two or three sentences for chatty scenarios."
        
        
        text_input = f'''
        {{
            "event": {{
                "textInput": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.content_name}",
                    "content": "{system_prompt}"
                }}
            }}
        }}
        '''
        await self.send_event(text_input)
        
        text_content_end = f'''
        {{
            "event": {{
                "contentEnd": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.content_name}"
                }}
            }}
        }}
        '''
        await self.send_event(text_content_end)
        
        # Start processing responses
        self.response = asyncio.create_task(self._process_responses())
```

## Handle audio input
<a name="sonic-handle-audio-input"></a>

These methods handle the audio input lifecycle:
+ `start_audio_input`: Configures and starts the audio input stream
+ `send_audio_chunk`: Encodes and sends audio chunks to the model
+ `end_audio_input`: Properly closes the audio input stream

```
    async def start_audio_input(self):
        """Start audio input stream."""
        audio_content_start = f'''
        {{
            "event": {{
                "contentStart": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.audio_content_name}",
                    "type": "AUDIO",
                    "interactive": true,
                    "role": "USER",
                    "audioInputConfiguration": {{
                        "mediaType": "audio/lpcm",
                        "sampleRateHertz": 16000,
                        "sampleSizeBits": 16,
                        "channelCount": 1,
                        "audioType": "SPEECH",
                        "encoding": "base64"
                    }}
                }}
            }}
        }}
        '''
        await self.send_event(audio_content_start)
    
    async def send_audio_chunk(self, audio_bytes):
        """Send an audio chunk to the stream."""
        if not self.is_active:
            return
            
        blob = base64.b64encode(audio_bytes)
        audio_event = f'''
        {{
            "event": {{
                "audioInput": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.audio_content_name}",
                    "content": "{blob.decode('utf-8')}"
                }}
            }}
        }}
        '''
        await self.send_event(audio_event)
    
    async def end_audio_input(self):
        """End audio input stream."""
        audio_content_end = f'''
        {{
            "event": {{
                "contentEnd": {{
                    "promptName": "{self.prompt_name}",
                    "contentName": "{self.audio_content_name}"
                }}
            }}
        }}
        '''
        await self.send_event(audio_content_end)
```

## End the session
<a name="sonic-end-session"></a>

This method properly closes the session by:
+ Sending a `promptEnd` event
+ Sending a `sessionEnd` event
+ Closing the input stream

```
    async def end_session(self):
        """End the session."""
        if not self.is_active:
            return
            
        prompt_end = f'''
        {{
            "event": {{
                "promptEnd": {{
                    "promptName": "{self.prompt_name}"
                }}
            }}
        }}
        '''
        await self.send_event(prompt_end)
        
        session_end = '''
        {
            "event": {
                "sessionEnd": {}
            }
        }
        '''
        await self.send_event(session_end)
        # close the stream
        await self.stream.input_stream.close()
```

## Handle responses
<a name="sonic-handle-responses"></a>

This method continuously processes responses from the model and does the following:
+ Waits for output from the stream.
+ Parses the JSON response.
+ Handles text output by printing to the console with automatic speech recognition and transcription.
+ Handles audio output by decoding and queuing for playback.

```
    async def _process_responses(self):
        """Process responses from the stream."""
        try:
            while self.is_active:
                output = await self.stream.await_output()
                result = await output[1].receive()
                
                if result.value and result.value.bytes_:
                    response_data = result.value.bytes_.decode('utf-8')
                    json_data = json.loads(response_data)
                    
                    if 'event' in json_data:
                        # Handle content start event
                        if 'contentStart' in json_data['event']:
                            content_start = json_data['event']['contentStart'] 
                            # set role
                            self.role = content_start['role']
                            # Check for speculative content
                            if 'additionalModelFields' in content_start:
                                additional_fields = json.loads(content_start['additionalModelFields'])
                                if additional_fields.get('generationStage') == 'SPECULATIVE':
                                    self.display_assistant_text = True
                                else:
                                    self.display_assistant_text = False
                                
                        # Handle text output event
                        elif 'textOutput' in json_data['event']:
                            text = json_data['event']['textOutput']['content']    
                           
                            if (self.role == "ASSISTANT" and self.display_assistant_text):
                                print(f"Assistant: {text}")
                            elif self.role == "USER":
                                print(f"User: {text}")
                        
                        # Handle audio output
                        elif 'audioOutput' in json_data['event']:
                            audio_content = json_data['event']['audioOutput']['content']
                            audio_bytes = base64.b64decode(audio_content)
                            await self.audio_queue.put(audio_bytes)
        except Exception as e:
            print(f"Error processing responses: {e}")
```

## Playback audio
<a name="sonic-playback-audio"></a>

This method will perform the following tasks:
+ Initialize a `PyAudio` input stream
+ Continuously retrieves audio data from the queue
+ Plays the audio through the speakers
+ Properly cleans up resources when done

```
    async def play_audio(self):
        """Play audio responses."""
        p = pyaudio.PyAudio()
        stream = p.open(
            format=FORMAT,
            channels=CHANNELS,
            rate=OUTPUT_SAMPLE_RATE,
            output=True
        )
```

try:

```
            while self.is_active:
                audio_data = await self.audio_queue.get()
                stream.write(audio_data)
        except Exception as e:
            print(f"Error playing audio: {e}")
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()
```

## Capture audio
<a name="sonic-capture-audio"></a>

This method will perform the following tasks:
+ Initializes a `PyAudio` output stream
+ Starts the audio input session
+ Continuously captures audio chunks from the microphone
+ Sends each chunk to the Amazon Nova Sonic model
+ Properly cleans up resources when done

```
    async def capture_audio(self):
        """Capture audio from microphone and send to Nova Sonic."""
        p = pyaudio.PyAudio()
        stream = p.open(
            format=FORMAT,
            channels=CHANNELS,
            rate=INPUT_SAMPLE_RATE,
            input=True,
            frames_per_buffer=CHUNK_SIZE
        )
        
        print("Starting audio capture. Speak into your microphone...")
        print("Press Enter to stop...")
        
        await self.start_audio_input()
```

try:

```
            while self.is_active:
                audio_data = stream.read(CHUNK_SIZE, exception_on_overflow=False)
                await self.send_audio_chunk(audio_data)
                await asyncio.sleep(0.01)
        except Exception as e:
            print(f"Error capturing audio: {e}")
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()
            print("Audio capture stopped.")
            await self.end_audio_input()
```

## Run the main function
<a name="sonic-run-main"></a>

The main function orchestrates the entire process by performing the following:
+ Creates an Amazon Nova 2 Sonic client
+ Starts the session
+ Creates concurrent tasks for audio playback and capture
+ Waits for the user to press **Enter** to stop
+ Properly ends the session and cleans up tasks

```
async def main():
    # Create Nova Sonic client
    nova_client = SimpleNovaSonic()
    
    # Start session
    await nova_client.start_session()
    
    # Start audio playback task
    playback_task = asyncio.create_task(nova_client.play_audio())
    
    # Start audio capture task
    capture_task = asyncio.create_task(nova_client.capture_audio())
    
    # Wait for user to press Enter to stop
    await asyncio.get_event_loop().run_in_executor(None, input)
    
    # End session
    nova_client.is_active = False
    
    # First cancel the tasks
    tasks = []
    if not playback_task.done():
        tasks.append(playback_task)
    if not capture_task.done():
        tasks.append(capture_task)
    for task in tasks:
        task.cancel()
    if tasks:
        await asyncio.gather(*tasks, return_exceptions=True)
    
    # cancel the response task
    if nova_client.response and not nova_client.response.done():
        nova_client.response.cancel()
    
    await nova_client.end_session()
    print("Session ended")

if __name__ == "__main__":
    # Set AWS credentials if not using environment variables
    # os.environ['AWS_ACCESS_KEY_ID'] = "your-access-key"
    # os.environ['AWS_SECRET_ACCESS_KEY'] = "your-secret-key"
    # os.environ['AWS_DEFAULT_REGION'] = "us-east-1"

    asyncio.run(main())
```

# Code examples
<a name="sonic-code-examples"></a>

These code examples help you quickly get started with Amazon Nova 2 Sonic. You can access the complete list of examples in the [Amazon Nova Sonic GitHub samples](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic) page.

## Getting started examples
<a name="sonic-getting-started-examples"></a>

For simple examples designed to get you started using Amazon Nova 2 Sonic, refer to the following implementations:
+ [Basic Amazon Nova 2 Sonic implementation (Python)](https://github.com/aws-samples/amazon-nova-samples/blob/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/console-python/nova_sonic_simple.py): A basic implementation that demonstrates how events are structured in the bidirectional streaming API. This version does not support barge-in functionality (interrupting the assistant while it is speaking) and does not implement true bidirectional communication.
+ [Full featured Amazon Nova 2 Sonic implementation (Python)](https://github.com/aws-samples/amazon-nova-samples/blob/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/console-python/nova_sonic_tool_use.py): The full-featured implementation with real bidirectional communication and barge-in support. This allows for more natural conversations where users can interrupt the assistant while it is speaking, similar to human conversations.
+ [Amazon Nova 2 Sonic with tool use (Python)](https://github.com/aws-samples/amazon-nova-samples/blob/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/console-python/nova_sonic_tool_use.py): An advanced implementation that extends the bidirectional communication capabilities with tool use examples. This version demonstrates how Amazon Nova 2 Sonic can interact with external tools and APIs to provide enhanced functionality.
+ [Nova Sonic with text and mixed Input (Python)](https://github.com/aws-samples/amazon-nova-samples/blob/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/console-python/nova_sonic_with_text.py): Example implementation to showcase how Amazon Nova 2 Sonic can have text as an input.
+ [Java WebSocket implementation (Java)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/websocket-java): This example implements a bidirectional WebSocket-based audio streaming application that integrates with Amazon Nova 2 Sonic for real-time speech-to-speech conversation using Java.
+ [NodeJS Websocket implementation (NodeJS)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/websocket-nodejs): This example implements a bidirectional WebSocket-based audio streaming application that integrates with Amazon Nova 2 Sonic for real-time speech-to-speech conversation using NodeJS.
+ [NodeJS Websocket implementation (C\$1)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/sample-codes/websocket-dotnet): This example implements a bidirectional WebSocket-based audio streaming application that integrates with Amazon Nova 2 Sonic for real-time speech-to-speech conversation using .NET.

## Advanced use cases
<a name="sonic-advanced-examples"></a>

For advanced examples demonstrating more complex use cases, refer to the following implementations:
+ [Amazon Bedrock Knowledge Base implementation (NodeJS)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/bedrock-knowledge-base): This example demonstrates how to build an intelligent conversational application by integrating Amazon Nova 2 Sonic with Amazon Bedrock Knowledge Base using NodeJS.
+ [Chat History Management (Python)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/chat-history-logger): This example includes a chat history logging system that captures and preserves all interactions between the user and Amazon Nova 2 Sonic using Python.
+ [Hotel Reservation Cancellation (NodeJS)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/customer-service/hotel-cancellation-websocket): This example demonstrates a practical customer service use case for Amazon Nova 2 Sonic, implementing a hotel reservation cancellation system using NodeJS.
+ [LangChain Knowledge Base integration (Python)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/langchain-knowledge-base): This implementation demonstrates how to integrate Amazon Nova 2 Sonic speech-to-speech capabilities with a LangChain-powered knowledge base for enhanced conversational experiences using Python.
+ [Conversation Resumption (NodeJS)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/resume-conversation): This example demonstrates how to implement conversation resumption capabilities with Amazon Nova 2 Sonic. Using a hotel reservation cancellation scenario as the context, the application shows how to maintain conversation state across sessions, allowing users to seamlessly continue interactions that were previously interrupted using NodeJS.
+ [Nova 2 Sonic Speaks First (NodeJS)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/nova-sonic-speaks-first): This example demonstrates how Amazon Nova 2 Sonic can initiate conversations proactively.
+ [Session Continuation (Python)](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/session-continuation/console-python): This example demonstrates how to enable unlimited conversation length with Amazon Nova 2 Sonic by implementing seamless session transitions. The application automatically creates and switches to new sessions in the background, allowing conversations to continue indefinitely without interruption or context loss.

## Hands-on workshop
<a name="sonic-workshop"></a>

A hands-on workshop is available that guides you through building a voice chat application using Amazon Nova 2 Sonic with a bidirectional streaming interface. You can [access the workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/5238419f-1337-4e0f-8cd7-02239486c40d/en-US) and find the [complete code examples](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/workshops).

# Voice conversation prompts
<a name="sonic-system-prompts"></a>

Nova 2 introduces **Speech Prompts** – a specialized prompting capability designed to control speech-specific transcription formatting for Hindi. Speech prompts work alongside your system prompt but serve a distinct purpose:
+ **System Prompt**: Controls your assistant's behavior, personality, and response style
+ **Speech Prompt**: Controls transcription formatting for Hindi code-switching (Latin/Devanagari/mixed scripts)

## Important Guidelines
<a name="sonic-important-guidelines"></a>

**Speech prompts are pre-configured and should be used exactly as documented.** They are designed for specific transcription formatting needs and should not be modified or customized, as changes may cause unexpected behavior.

**When to use Speech Prompts:**
+ You need to control script output for Hindi code-switching (Latin/Devanagari/mixed)

**When NOT to use Speech Prompts:**
+ For general instructions or assistant behavior (use system prompt instead)
+ If you're not working with Hindi transcription formatting
+ If the specific formatting need doesn't apply to your use case

**Best Practice:** Only include a speech prompt if you specifically need Hindi transcription formatting. All other instructions – including language preferences, response style, verbosity, and reasoning – should go in your system prompt.

**Important:** Speech prompts must be sent **after** the system prompt to the model.

## Recommended Baseline System Prompt for Voice
<a name="sonic-baseline-system-prompt"></a>

```
You are a warm, professional, and helpful AI assistant. Give accurate answers that sound natural, direct, and human. Start by answering the user's question clearly in 1–2 sentences. Then, expand only enough to make the answer understandable, staying within 3–5 short sentences total. Avoid sounding like a lecture or essay.
```

## Speech Prompt Configuration
<a name="sonic-speech-prompt-config"></a>

### Code Switching
<a name="sonic-code-switching"></a>

**Note:** This feature currently applies to Hindi language only.

Choose one of the following prompts based on your desired output script:

**For Latin script output (Romanized Hindi):**

```
If the input audio/speech contains hindi, then the transcription and response should be in All Latin script (romanized Hindi).
```

**For Devanagari script output:**

```
If the input audio/speech contains hindi, then the transcription and response should be in All Devanagari script (Hindi).
```

**For mixed script output (natural code-switching):**

```
If the input audio/speech contains hindi, then the transcription and response can mix Latin and Devanagari scripts naturally for code-switching.
```

## System Prompt Configuration
<a name="sonic-system-prompt-config"></a>

### Controlling Response Verbosity
<a name="sonic-controlling-verbosity"></a>

**Concise, conversational responses:**

```
You are a warm, professional, and helpful AI assistant. Give accurate answers that sound natural, direct, and human. Start by answering the user's question clearly in 1–2 sentences. Then, expand only enough to make the answer understandable, staying within 3–5 short sentences total. Avoid sounding like a lecture or essay.
```

**Detailed, thorough responses:**

```
You are a warm, professional, and helpful AI assistant. Give accurate, complete answers that sound warm, direct, and human. Answer the question directly in the first 1–2 sentences. if the question has parts or asks what/why/how, address each with a brief definition or main idea plus 2–3 key facts or steps. Offer practical, actionable advice. Keep a confident, kind, conversational tone; never robotic or theatrical. Be thorough; add examples or context only when helpful. Prefer accuracy and safety over speculation; if unsure, say so and suggest what to check.
```

### Language Mirroring
<a name="sonic-language-mirroring"></a>

Nova can recognize and respond in the language the user speaks. Use this prompt to maintain language consistency:

```
CRITICAL LANGUAGE MIRRORING RULES:
- Always reply in the language spoken. DO NOT mix with English. However, if the user talks in English, reply in English.
- Please respond in the language the user is talking to you in, If you have a question or suggestion, ask it in the language the user is talking in. I want to ensure that our communication remains in the same language as the user.
```

## Gender Agreement for Gendered Languages
<a name="sonic-gender-agreement"></a>

Some languages require gender agreement in verbs, adjectives, or pronouns when the assistant describes itself. For these languages, specify the assistant's gender in your system prompt to match your selected voice.

**Languages affected:** Hindi, Portuguese, French, Italian, Spanish, Russian, Polish

**When gender agreement matters:**
+ **Hindi:** Always needed - verbs conjugate based on speaker's gender in first person
+ **Portuguese/French:** Needed when using past participles or adjectives (such as, "I am tired" - "Estou cansada/cansado")
+ **Italian/Spanish:** Needed when using adjectives to describe oneself (such as, "I am happy" - "Sono contenta/contento")

**Implementation:**

Include the appropriate gender identifier at the start of your system prompt based on your voice selection:

**For feminine-sounding voices (kiara, carolina, ambre, beatrice, lupe, tiffany):**

```
You are a warm, professional, and helpful female AI assistant.
```

**For masculine-sounding voices (arjun, leo, florian, lorenzo, carlos, matthew):**

```
You are a warm, professional, and helpful male AI assistant.
```

**Examples:**

**Hindi with feminine voice (kiara):**

```
You are a warm, professional, and helpful female AI assistant.
```

Result: "मैं अच्छी हूँ" (main achchhi hoon) vs "मैं अच्छा हूँ" (main achchha hoon)

**Italian with masculine voice (lorenzo):**

```
You are a warm, professional, and helpful male AI assistant.
```

Result: "Sono contento" vs "Sono contenta"

## Chain of thought for Speech: Constitutional Reasoning
<a name="sonic-constitutional-reasoning"></a>

Use this prompt when you want the model to show its reasoning for complex problems:

```
You are a friendly assistant. The user will give you a problem. Explain your reasoning following the guidelines given in CONSTITUTION - REASONING, and summarize your decision at the end of your response, in one sentence.

## CONSTITUTION - REASONING
1. For simple questions including simple calculations or contextual tasks: Give the answer directly. No explanation is necessary, although you can offer to provide more information if the user requests it.
2. When faced with complex problems or decisions, think through the steps systematically before providing your answer. Break down your reasoning process when it would help user understanding.
3. For subjective matters or comparisons: explain your thought process step-by-step.
```

**Note:** If you don't want the model to go through reasoning for every request, you can add a couple of shot examples to the prompt (see examples below).

```
You are a warm, professional, and helpful AI assistant. You converse in fluid and conversational English. Give accurate, complete answers that sound warm, direct, and human. Answer the question directly in the first 1–2 sentences. Keep a confident, kind, conversational tone; never robotic or theatrical. Avoid formatted lists or numbering and keep your output as a spoken transcript. Be concise but thorough; add examples or context only when helpful. Prefer accuracy and safety over speculation; if unsure, say so and suggest what to check. The user will give you a problem. Explain your reasoning following the guidelines given in CONSTITUTION - REASONING, and summarize your decision at the end of your response in one sentence.

## CONSTITUTION - REASONING
1. When faced with complex problems or decisions, think through the steps systematically before providing your answer. Break down your reasoning process when it would help user understanding.
2. For subjective matters or comparisons: explain your thought process step-by-step.
3. For simple questions including simple calculations or contextual tasks: Give the answer directly. No explanation is necessary, although you can offer to provide more information if the user requests it.

EXAMPLES

User: What is 7 + 5?
Assistant: 12.

User: What is the capital of India?
Assistant: Delhi is the capital of India.

User: I have a $1,000 budget for a trip. Here are my costs... Can I afford it? Please explain your reasoning.
Assistant: (step-by-step breakdown + one-sentence conclusion)
```

## Overuse of suggested phrases
<a name="sonic-overuse-phrases"></a>

**Nova Sonic 2 is more sensitive to phrase suggestions than Sonic 1.** This increased sensitivity isn't inherently good or bad—it depends on your use case. If you want consistent, predictable phrasing, this can be beneficial. However, if you want more natural variation, explicit phrase lists can lead to overuse.

If you include prompts with explicit lists of phrases, the model will use them very frequently:

**Example 1 - Emphasis phrases:**

```
Instead of using bold or italics, emphasize important information by using phrases like "The key thing to remember is," "What's really important here is," or "I want to highlight that."
```

**Example 2 - Conversational fillers:**

```
Include natural speech elements like "Well," "You know," "Actually," "I mean," or "By the way" at appropriate moments to create an authentic, casual conversation flow.
```

**Recommendation:**
+ **If you want consistent phrasing:** Explicit phrase lists work well in Sonic 2 for creating predictable, on-brand responses.
+ **If you want natural variation:** Avoid providing explicit lists of phrases. Instead, use general guidance like "sound natural and conversational" or provide one-shot examples.

**Better approach - Use one-shot examples:**

Instead of listing phrases, provide 1-2 examples demonstrating the desired tone and style:

### Natural, helpful tone
<a name="sonic-example-natural-tone"></a>

```
You are a warm, professional, and helpful AI assistant. Sound natural and conversational in your responses.

Example:
User: How do I reset my password?
Assistant: You can reset your password by clicking the "Forgot Password" link on the login page. You'll get an email with instructions to create a new one. The whole process usually takes just a couple of minutes.
```

### Concise and direct
<a name="sonic-example-concise"></a>

```
You are a helpful AI assistant. Provide clear, direct answers without unnecessary elaboration.

Example:
User: What's the weather like today?
Assistant: It's 72 degrees and sunny with a light breeze. Perfect day to be outside.
```

### Professional with empathy
<a name="sonic-example-empathy"></a>

```
You are a professional and empathetic AI assistant. Acknowledge the user's situation while providing practical solutions.

Example:
User: I'm frustrated because my order hasn't arrived yet.
Assistant: I understand how frustrating that must be, especially when you're waiting for something important. Let me check your order status right now. Can you provide your order number?
```

### Technical but accessible
<a name="sonic-example-technical"></a>

```
You are a knowledgeable AI assistant who explains technical concepts in accessible language.

Example:
User: What is machine learning?
Assistant: Machine learning is when computers learn from examples rather than following strict rules. Think of it like teaching a child to recognize dogs—after seeing many dogs, they start recognizing new ones on their own. The computer does something similar with data.
```

This approach shows the model the desired behavior without triggering repetitive phrase patterns, while still maintaining control over tone and style.

# Core concepts
<a name="sonic-core-concepts"></a>

Amazon Nova 2 Sonic uses a bidirectional streaming architecture with structured events for real-time conversational AI. Understanding these core concepts is essential for building effective voice applications.

# Event lifecycle
<a name="sonic-event-lifecycle"></a>

The following diagram illustrates the complete bi-directional streaming event lifecycle:

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Event-Lifecycle-Diagram_1.png)


The bidirectional streaming event lifecycle follows a structured pattern from session initialization through conversation completion. Each conversation involves input events (from your application) and output events (from Amazon Nova 2 Sonic) that work together to create natural voice interactions.

# Event flow sequence
<a name="sonic-event-flow"></a>

A typical conversation follows this event sequence:

1. **Session Start** - Initialize the conversation session

1. **System Prompt** - Send system instructions

1. **Chat History** (optional) - Provide conversation context

1. **Audio Chunks** - Stream user audio input

1. **Completion Start** - AI begins processing

1. **ASR Transcripts** (USER) - User speech transcription

1. **Tool Use** (optional) - AI requests tool execution

1. **Tool Handling** (optional) - Process and return tool results

1. **Transcript** (ASSISTANT) - SPECULATIVE - Preliminary AI response

1. **Audio Chunks** - Stream AI audio output

1. **Transcript** (ASSISTANT) - FINAL - Final AI response transcript

1. **Content End Audio** - Marks the end of audio content

1. **Prompt End** - Indicates the completion of the prompt processing

1. **Session End** - Close the conversation

# Handling input events with the bidirectional API
<a name="sonic-input-events"></a>

The bidirectional Stream API uses an event-driven architecture with structured input and output events. Understanding the correct event ordering is crucial for implementing successful conversational applications and maintaining the proper conversation state throughout interactions.

## Overview
<a name="sonic-input-overview"></a>

The Nova Sonic conversation follows a structured event sequence. You begin by sending a `sessionStart` event that contains the inference configuration parameters, such as temperature and token limits. Next, you send `promptStart` to define the audio output format and tool configurations, assigning a unique `promptName` identifier that must be included in all subsequent events.

For each interaction type (system prompt, audio, and so on), you follow a three-part pattern: use `contentStart` to define the content type and the role of the content (`SYSTEM`, `USER`, `ASSISTANT`, `TOOL`, `SYSTEM_SPEECH`), then provide the actual content event, and finish with `contentEnd` to close that segment. The `contentStart` event specifies whether you're sending tool results, streaming audio, or a system prompt. The `contentStart` event includes a unique `contentName` identifier.

## Conversation History
<a name="sonic-conversation-history"></a>

A conversation history can be included only once, after the system prompt and before audio streaming begins. It follows the same `contentStart`/`textInput`/`contentEnd` pattern. The `USER` and `ASSISTANT` roles must be defined in the `contentStart` event for each historical message. This provides essential context for the current conversation but must be completed before any new user input begins.

## Audio Streaming
<a name="sonic-audio-streaming"></a>

Audio streaming operates with continuous microphone sampling. After sending an initial `contentStart`, audio frames (approximately 32ms each) are captured directly from the microphone and immediately sent as `audioInput` events using the same `contentName`. These audio samples should be streamed in real-time as they're captured, maintaining the natural microphone sampling cadence throughout the conversation. All audio frames share a single content container until the conversation ends and it is explicitly closed.

## Closing the Session
<a name="sonic-closing-session"></a>

After the conversation ends or needs to be terminated, it's essential to properly close all open streams and end the session in the correct sequence. To properly end a session and avoid resource leaks, you must follow a specific closing sequence:
+ Close any open audio streams with the `contentEnd` event.
+ Send a `promptEnd` event that references the original `promptName`.
+ Send the `sessionEnd` event.

Skipping any of these closing events can result in incomplete conversations or orphaned resources.

These identifiers create a hierarchical structure: the `promptName` ties all conversation events together, while each `contentName` marks the boundaries of specific content blocks. This hierarchy ensures that model maintains proper context throughout the interaction.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Closing-the-session_2.png)


## Input Event Flow
<a name="sonic-input-event-flow"></a>

The structure of the input event flow is provided in this section.

### 1. RequestStartEvent (Session Start)
<a name="sonic-session-start-event"></a>

The session start event initializes the conversation with inference configuration and turn detection settings.

**Inference Configuration:**
+ `maxTokens`: Maximum number of tokens to generate in the response
+ `topP`: Nucleus sampling parameter (0.0 to 1.0) for controlling randomness
+ `temperature`: Controls randomness in generation (0.0 to 1.0)

**Turn Detection Configuration:** The `endpointingSensitivity` parameter controls how quickly Nova Sonic detects when a user has finished speaking:
+ `HIGH`: Detects pauses quickly, enabling faster responses but may cut off slower speakers
+ `MEDIUM`: Balanced sensitivity for most conversational scenarios (recommended default)
+ `LOW`: Waits longer before detecting end of speech, better for thoughtful or hesitant speakers

```
{
    "event": {
        "sessionStart": {
            "inferenceConfiguration": {
                "maxTokens": "int",
                "topP": "float",
                "temperature": "float"
            },
            "turnDetectionConfiguration": {
                "endpointingSensitivity": "HIGH" | "MEDIUM" | "LOW"
            }
        }
    }
}
```

**Example:**

```
{
    "event": {
        "sessionStart": {
            "inferenceConfiguration": {
                "maxTokens": 2048,
                "topP": 0.9,
                "temperature": 0.7
            },
            "turnDetectionConfiguration": {
                "endpointingSensitivity": "MEDIUM"
            }
        }
    }
}
```

### 2. PromptStartEvent
<a name="sonic-prompt-start-event"></a>

The prompt start event defines the conversation configuration including output formats, voice selection, and available tools.

For a list of available voice IDs, refer to [Language support and multilingual capabilities](https://docs.aws.amazon.com/nova/latest/nova2-userguide/sonic-language-support.html)

```
{
    "event": {
        "promptStart": {
            "promptName": "string", // unique identifier same across all events i.e. UUID
            "textOutputConfiguration": {
                "mediaType": "text/plain"
            },
            "audioOutputConfiguration": {
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 8000 | 16000 | 24000,
                "sampleSizeBits": 16,
                "channelCount": 1,
                "voiceId": "matthew" | "tiffany" | "amy" | "olivia" | "lupe" | "carlos" | "ambre" | "florian" | "lennart" | "beatrice" | "lorenzo" |
                        "tina" | "carolina" | "leo" | "kiara" | "arjun",
                "encoding": "base64",
                "audioType": "SPEECH"
            },
            "toolUseOutputConfiguration": {
                "mediaType": "application/json"
            },
            "toolConfiguration": {
                "tools": [
                    {
                        "toolSpec": {
                            "name": "string",
                            "description": "string",
                            "inputSchema": {
                                "json": "{}"
                            }
                        }
                    }
                ]
            }
        }
    }
}
```

### 3. InputContentStartEvent
<a name="sonic-content-start-event"></a>

#### Text
<a name="sonic-content-start-text"></a>

The text content start event is used for system prompts, conversation history, and cross-modal text input.

**Interactive Parameter:**
+ `true`: Enables cross-modal input, allowing text messages during an active voice session
+ `false`: Standard text input for system prompts and conversation history

**Role Types:**
+ `SYSTEM`: System instructions and prompts
+ `USER`: User messages in conversation history or cross-modal input
+ `ASSISTANT`: Assistant responses in conversation history
+ `SYSTEM_SPEECH`: Controls transcription formatting for Hindi code-switching (Latin/Devanagari/mixed scripts)

```
{
    "event": {
        "contentStart": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string", // unique identifier for the content block
            "type": "TEXT",
            "interactive": "boolean", // true for cross-modal input
            "role": "SYSTEM" | "USER" | "ASSISTANT" | "TOOL" | "SYSTEM_SPEECH",
            "textInputConfiguration": {
                "mediaType": "text/plain"
            }
        }
    }
}
```

**Example - System Prompt:**

```
{
    "event": {
        "contentStart": {
            "promptName": "conv-12345",
            "contentName": "system-prompt-1",
            "type": "TEXT",
            "interactive": false,
            "role": "SYSTEM",
            "textInputConfiguration": {
                "mediaType": "text/plain"
            }
        }
    }
}
```

**Example - Cross-modal Input:**

```
{
    "event": {
        "contentStart": {
            "promptName": "conv-12345",
            "contentName": "user-text-1",
            "type": "TEXT",
            "interactive": true,
            "role": "USER",
            "textInputConfiguration": {
                "mediaType": "text/plain"
            }
        }
    }
}
```

#### Audio
<a name="sonic-content-start-audio"></a>

```
{
    "event": {
        "contentStart": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string", // unique identifier for the content block
            "type": "AUDIO",
            "interactive": true,
            "role": "USER",
            "audioInputConfiguration": {
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 8000 | 16000 | 24000,
                "sampleSizeBits": 16,
                "channelCount": 1,
                "audioType": "SPEECH",
                "encoding": "base64"
            }
        }
    }
}
```

#### Tool
<a name="sonic-content-start-tool"></a>

```
{
    "event": {
        "contentStart": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string", // unique identifier for the content block
            "interactive": false,
            "type": "TOOL",
            "role": "TOOL",
            "toolResultInputConfiguration": {
                "toolUseId": "string", // existing tool use id
                "type": "TEXT",
                "textInputConfiguration": {
                    "mediaType": "text/plain"
                }
            }
        }
    }
}
```

### 4. TextInputContent
<a name="sonic-text-input-event"></a>

```
{
    "event": {
        "textInput": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string", // unique identifier for the content block
            "content": "string"
        }
    }
}
```

### 5. AudioInputContent
<a name="sonic-audio-input-event"></a>

```
{
    "event": {
        "audioInput": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string", // same unique identifier from its contentStart
            "content": "base64EncodedAudioData"
        }
    }
}
```

### 6. ToolResultContentEvent
<a name="sonic-tool-result-event"></a>

```
"event": {
    "toolResult": {
        "promptName": "string", // same unique identifier from promptStart event
        "contentName": "string", // same unique identifier from its contentStart
        "content": "{\"key\": \"value\"}" // stringified JSON object as a tool result 
    }
}
```

### 7. InputContentEndEvent
<a name="sonic-content-end-event"></a>

```
{
    "event": {
        "contentEnd": {
            "promptName": "string", // same unique identifier from promptStart event
            "contentName": "string" // same unique identifier from its contentStart
        }
    }
}
```

### 8. PromptEndEvent
<a name="sonic-prompt-end-event"></a>

```
{
    "event": {
        "promptEnd": {
            "promptName": "string" // same unique identifier from promptStart event
        }
    }
}
```

### 9. RequestEndEvent
<a name="sonic-session-end-event"></a>

```
{
    "event": {
        "sessionEnd": {}
    }
}
```

# Handling output events with the bidirectional API
<a name="sonic-output-events"></a>

When the Amazon Nova Sonic model responds, it follows a structured event sequence. The flow begins with a `completionStart` event that contains unique identifiers like `sessionId`, `promptName`, and `completionId`. These identifiers are consistent throughout the response cycle and unite all subsequent response events.

## Overview
<a name="sonic-output-overview"></a>

Each response type follows a consistent three-part pattern: `contentStart` defines the content type and format, the actual content event, and `contentEnd` closes that segment. The response typically includes multiple content blocks in sequence: automatic speech recognition (ASR) transcription (what the user said), optional tool use (when external information is needed), text response (what the model plans to say), and audio response (the spoken output).

## Response Content Types
<a name="sonic-response-content-types"></a>

### ASR Transcription
<a name="sonic-asr-transcription"></a>

The ASR transcription appears first, delivering the model's understanding of the user's speech with `role: "USER"` and `"additionalModelFields": "{\"generationStage\":\"FINAL\"}"` in the `contentStart`.

### Tool Use
<a name="sonic-tool-use-response"></a>

When the model needs external data, it sends tool-related events with specific tool names and parameters.

### Text Response
<a name="sonic-text-response"></a>

The text response provides a preview of the planned speech with `role: "ASSISTANT"` and `"additionalModelFields": "{\"generationStage\":\"SPECULATIVE\"}"`.

### Audio Response
<a name="sonic-audio-response"></a>

The audio response then delivers base64-encoded speech chunks sharing the same `contentId` throughout the stream.

## Barge-In Support
<a name="sonic-barge-in-support"></a>

During audio generation, Amazon Nova Sonic supports natural conversation flow through its barge-in capability. When a user interrupts Amazon Nova Sonic while it's speaking, Nova Sonic immediately stops generating speech, switches to listening mode, and sends a content notification indicating the interruption has occurred. Because Nova Sonic operates faster than real-time, some audio may have already been delivered but not yet played. The interruption notification enables the client application to clear its audio queue and stop playback immediately, creating a responsive conversational experience.

## Final Transcription
<a name="sonic-final-transcription"></a>

After audio generation completes (or is interrupted via barge-in), Amazon Nova Sonic provides an additional text response that contains a sentence-level transcription of what was actually spoken. This text response includes a `contentStart` event with `role: "ASSISTANT"` and `"additionalModelFields": "{\"generationStage\":\"FINAL\"}"`.

## Usage Tracking
<a name="sonic-usage-tracking"></a>

Throughout the response handling, `usageEvent` events are sent to track token consumption. These events contain detailed metrics on input tokens and output tokens (both speech and text), and their cumulative totals. Each `usageEvent` maintains the same `sessionId`, `promptName`, and `completionId` as other events in the conversation flow. The details section provides both incremental changes (delta) and running totals of token usage, enabling precise monitoring of the usage during the conversation.

## Completion
<a name="sonic-completion"></a>

The model sends a `completionEnd` event with the original identifiers and a `stopReason` that indicates how the conversation ended. This event hierarchy ensures your application can track which parts of the response belong together and process them accordingly, maintaining conversation context throughout multiple turns.

The output event flow begins by entering the response generation phase. It starts with automatic speech recognition, selects a tool for use, transcribes speech, generates audio, finalizes the transcription, and finishes the session.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Output-Event -Flow_3.png)


## Output Event Flow
<a name="sonic-output-event-flow"></a>

The structure of the output event flow is described in this section.

### 1. UsageEvent
<a name="sonic-usage-event"></a>

```
"event": {
    "usageEvent": {
        "completionId": "string", // unique identifier for completion
        "details": {
            "delta": { // incremental changes since last event
                "input": {
                    "speechTokens": number, // input speech tokens
                    "textTokens": number // input text tokens
                },
                "output": {
                    "speechTokens": number, // speech tokens generated
                    "textTokens": number // text tokens generated
                }
            },
            "total": { // cumulative counts
                "input": {
                    "speechTokens": number, // total speech tokens processed
                    "textTokens": number // total text tokens processed
                },
                "output": {
                    "speechTokens": number, // total speech tokens generated
                    "textTokens": number // total text tokens generated
                }
            }
        },
        "promptName": "string", // same unique identifier from promptStart event
        "sessionId": "string", // unique identifier
        "totalInputTokens": number, // cumulative input tokens
        "totalOutputTokens": number, // cumulative output tokens
        "totalTokens": number // total tokens in the session
    }
}
```

### 2. CompleteStartEvent
<a name="sonic-completion-start-event"></a>

```
"event": {
    "completionStart": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
    }
}
```

### 3. TextOutputContent
<a name="sonic-text-output-content"></a>

#### ContentStart
<a name="sonic-text-output-content-start"></a>

```
"event": {
    "contentStart": {
        "additionalModelFields": "{\"generationStage\":\"FINAL\"}" | "{\"generationStage\":\"SPECULATIVE\"}",
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // unique identifier for the content block
        "type": "TEXT",
        "role": "USER" | "ASSISTANT",
        "textOutputConfiguration": {
            "mediaType": "text/plain"
        }
    }
}
```

#### TextOutput
<a name="sonic-text-output"></a>

```
"event": {
    "textOutput": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "content": "string" // User transcribe or Text Response
    }
}
```

#### ContentEnd
<a name="sonic-text-output-content-end"></a>

```
"event": {
    "contentEnd": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "stopReason": "PARTIAL_TURN" | "END_TURN" | "INTERRUPTED",
        "type": "TEXT"
    }
}
```

### 4. ToolUse
<a name="sonic-tool-use-output"></a>

#### ContentStart
<a name="sonic-tool-use-content-start"></a>

```
"event": {
    "contentStart": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // unique identifier for the content block
        "type": "TOOL",
        "role": "TOOL",
        "toolUseOutputConfiguration": {
            "mediaType": "application/json"
        }
    }
}
```

#### ToolUse
<a name="sonic-tool-use-event"></a>

```
"event": {
    "toolUse": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "content": "json",
        "toolName": "string",
        "toolUseId": "string"
    }
}
```

#### ContentEnd
<a name="sonic-tool-use-content-end"></a>

```
"event": {
    "contentEnd": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "stopReason": "TOOL_USE",
        "type": "TOOL"
    }
}
```

### 5. AudioOutputContent
<a name="sonic-audio-output-content"></a>

#### ContentStart
<a name="sonic-audio-output-content-start"></a>

```
"event": {
    "contentStart": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // unique identifier for the content block
        "type": "AUDIO",
        "role": "ASSISTANT",
        "audioOutputConfiguration": {
            "mediaType": "audio/lpcm",
            "sampleRateHertz": 8000 | 16000 | 24000,
            "sampleSizeBits": 16,
            "encoding": "base64",
            "channelCount": 1
        }
    }
}
```

#### AudioOutput
<a name="sonic-audio-output-event"></a>

```
"event": {
    "audioOutput": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "content": "base64EncodedAudioData", // Audio
    }
}
```

#### ContentEnd
<a name="sonic-audio-output-content-end"></a>

```
"event": {
    "contentEnd": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "contentId": "string", // same unique identifier from its contentStart
        "stopReason": "PARTIAL_TURN" | "END_TURN",
        "type": "AUDIO"
    }
}
```

### 6. CompletionEndEvent
<a name="sonic-completion-end-event"></a>

```
"event": {
    "completionEnd": {
        "sessionId": "string", // unique identifier
        "promptName": "string", // same unique identifier from promptStart event
        "completionId": "string", // unique identifier
        "stopReason": "END_TURN" 
    }
}
```

# Barge-in
<a name="sonic-barge-in"></a>

Barge-in allows users to interrupt the AI assistant while it's speaking, just like in natural human conversations. Instead of waiting for the assistant to finish, users can interject with new information, correct or clarify their previous statement, redirect the conversation to a different topic, or simply stop the assistant when they've heard enough. This creates a more natural and responsive conversational experience.

The following diagram illustrates the complete barge-in conversation flow:

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Barge-In-Flow_8.png)


## How Amazon Nova 2 Sonic handles barge-in
<a name="sonic-barge-in-handling"></a>

Amazon Nova 2 Sonic is designed to handle interruptions gracefully. When the user starts speaking during a response, the system immediately stops generating the current response, maintains full conversational context, sends an interruption signal to the client and begins processing the new user input.

**Context Preservation:** Even when interrupted, Nova Sonic remembers what was said before the interruption, the topic being discussed, the conversation history and any relevant context from previous turns. This ensures the conversation remains coherent and natural.

## Client-side implementation requirements
<a name="sonic-barge-in-client"></a>

While Amazon Nova 2 Sonic handles barge-in on the server side, you need to implement client-side logic for a complete experience.

**The audio queue challenge:** Audio generation is faster than playback speed. This means:
+ Nova Sonic generates audio chunks quickly
+ Your client receives and queues these chunks
+ The client plays them back at normal speaking speed
+ When a barge-in occurs, there's already audio queued for playback

**Required client-side logic:** Your application must handle four key steps:

1. **Detect the Interruption Signal:** Listen for the interruption event from Nova Sonic and react immediately when received.

1. **Stop Current Playback:** Pause the currently playing audio and stop any audio that's mid-playback.

1. **Clear the Audio Queue:** Remove all queued audio chunks and discard any buffered audio from the interrupted response.

1. **Start New Audio:** Begin playing the newly received audio and resume normal playback flow.

# Turn-taking controllability
<a name="sonic-turn-taking"></a>

Turn-taking is a fundamental aspect of natural conversation. Amazon Nova 2 Sonic provides fine-grained control over when the AI takes its turn to speak through the turnDetectionConfiguration parameter. This allows you to optimize the conversation flow for different use cases, balancing responsiveness with accuracy. The endpointingSensitivity parameter controls how quickly Amazon Nova 2 Sonic detects the end of a user's turn and begins responding. This setting affects both the latency of responses and the likelihood of interrupting users who are still speaking.

## API configuration
<a name="sonic-turn-taking-config"></a>

Configure turn detection sensitivity in the sessionStart event:

```
                {
    "event": {
        "sessionStart": {
            "inferenceConfiguration": {
                "maxTokens": 1000,
                "topP": 0.9,
                "temperature": 0.7
            },
            "turnDetectionConfiguration": {
                "endpointingSensitivity": "HIGH" | "MEDIUM" | "LOW"
            }
        }
    }
}
```

## Sensitivity levels
<a name="sonic-turn-taking-levels"></a>

The endpointingSensitivity parameter accepts three values: HIGH, MEDIUM, and LOW. Each level balances response speed against the risk of interrupting users who are still speaking.

HIGH  
Fastest response time, optimized for latency. Nova Sonic responds as quickly as possible after detecting the end of speech. Pause duration: 1.5 seconds. Best for quick Q and A, command-and-control applications, and time-sensitive interactions.

MEDIUM  
Balanced approach with moderate response time. Reduces false positives while maintaining responsiveness. Pause duration: 1.75 seconds. Best for general conversations, customer service with complex queries, and multi-turn discussions.

LOW  
Slowest response time with maximum patience. Nova Sonic waits the longest before responding, minimizing interruptions of users who pause while thinking. Pause duration: 2 seconds. Best for thoughtful conversations, elderly or speech-impaired users, and complex problem-solving.

## Pause duration reference
<a name="pause-duration-reference"></a>


| Sensitivity level | Pause duration (seconds) | 
| --- | --- | 
| High (fast) | 1.5 | 
| Medium | 1.75 | 
| Low (slow) | \$12.0 | 

# Cross-modal input
<a name="sonic-cross-modal"></a>

Amazon Nova 2 Sonic now supports cross-modal input, allowing you to send text messages in addition to voice input during a conversation session. While speech remains the primary mode of interaction, text input provides flexibility for scenarios where typing is more convenient or appropriate.

** Continuous streaming required**: Cross-modal input requires an active streaming session to function properly. The session must maintain continuous streaming like a regular voice session, otherwise standard session timeouts will be applied and the connection will be terminated.

Sensitivity levels in cross-modal text input is useful for scenarios such as:
+ Client-side app integration (web and mobile): Allows users to interact with the application using both text and voice, supporting seamless multimodal experiences.
+ "Model-start-first" use case: A text message can be sent immediately after the session starts to prompt the model to begin speaking.
+ Guiding the model during async tool calling: When a toolUse event is triggered and the system begins processing tool calls, the client can send a text message to Sonic to provide a natural response while waiting — such as, “Hold on a second while I process your information. In the meantime, is there anything else I can assist with?” 
+ Telephony DTMF integration: Customer uses phone keypad to enter sensitive information (such as credit card numbers). Note: Amazon Nova Sonic does not process DTMF tones natively. To support DTMF input, your system must detect the tones, convert them to text (such as "1234"), and send to Nova 2 Sonic.

## How it works
<a name="sonic-cross-modal-works"></a>

Cross-modal input uses a three-event sequence similar to audio input:

1. **Content Start Event:** Signals the beginning of text input

1. **Text Input Event:**Contains the actual text message

1. **Content End Event:** Signals the completion of text input

All three events must use the same promptName and contentName to maintain the sequence. A new UUID should be generated for contentName each time you send text input to ensure proper multi-turn conversation tracking.

## Event structure
<a name="sonic-cross-modal-events"></a>

### 1. Content Start Event
<a name="cross-modal-events-start-event"></a>

Initiates the text input sequence with configuration details:

```
{
  "event": {
    "contentStart": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>",
      "role": "USER",
      "type": "TEXT",
      "interactive": true,
      "textInputConfiguration": {
        "mediaType": "text/plain"
      }
    }
  }
}
```

 Key Parameters:
+ `promptName`:The name of your conversation prompt (consistent across the session)
+ `contentName`: A unique identifier for this text input (generate a new UUID for each message)
+ `role`: Set to `"USER"` to indicate user input
+ `type`: Set to `"TEXT"` for text input
+ `interactive`: Set to `true` to enable interactive mode
+ `mediaType`: Set to `"text/plain"` for plain text content

### 2. Text Input Event
<a name="cross-modal-events-text-input-event"></a>

Contains the actual text message content:

```
{
  "event": {
    "textInput": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>",
      "content": "<your_text_message>"
    }
  }
}
```

 Key Parameters:
+ `promptName`: Must match the value from Content Start Event
+ `contentName`: Must match the value from Content Start Event
+ `role`: Your text message string

### 3. Content End Event
<a name="cross-modal-events-content-end-event"></a>

Signals the completion of the text input:

```
{
  "event": {
    "contentEnd": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>"
    }
  }
}
```

 Key Parameters:
+ `promptName`: Must match the value from previous events
+ `contentName`: Must match the value from previous events

# Language support and multilingual capabilities
<a name="sonic-language-support"></a>

Amazon Nova 2 Sonic provides a diverse selection of voices across multiple languages, enabling you to create conversational AI applications that feel natural and culturally appropriate for your users. Each language offers both feminine-sounding and masculine-sounding voice options.

The following table lists all available voices and their corresponding language locales:


| Language | Locale | Feminine-sounding Voice ID | Masculine-sounding Voice ID | 
| --- | --- | --- | --- | 
| English (US) | en-US | tiffany | matthew | 
| English (UK) | en-GB | amy | - | 
| English (Australia) | en-AU | olivia | - | 
| English (Indian) | en-IN | kiara | arjun | 
| French | fr-FR | ambre | florian | 
| Italian | it-IT | beatrice | lorenzo | 
| German | de-DE | tina | lennart | 
| Spanish (US) | es-US | lupe | carlos | 
| Portuguese | pt-BR | carolina | leo | 
| Hindi | hi-IN | kiara | arjun | 

## Event structure using voices in your application
<a name="sonic-event-voice"></a>

You can specify the voice ID in the `audioOutputConfiguration` when starting a prompt in the `promptStart` event:

```
"event": {
        "promptStart": {
            "promptName": "string",
            "audioOutputConfiguration": {
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 16000,
                "sampleSizeBits": 16,
                "channelCount": 1,
                "voiceId": "tiffany",
                "encoding": "base64",
                "audioType": "SPEECH"
            }
        }
    }
```

## Multilingual support
<a name="sonic-polyglot-voices"></a>

Amazon Nova 2 Sonic provides powerful multilingual capabilities that enable natural conversations across multiple languages. The service supports both polyglot voices (speaking multiple languages) and code-switching (mixing languages within the same sentence), allowing you to build truly global conversational applications.

The TIFFANY (en-US, female) and MATTHEW (en-US, male) are unique polyglot voices that can speak all supported languages:

1. English

1. French

1. Italian

1. German

1. Spanish

1. Portuguese

1. Hindi

This makes Tiffany and Matthew ideal for applications that need to switch between multiple languages seamlessly.

# Managing chat history
<a name="sonic-chat-history"></a>

Amazon Nova 2 Sonic responses include ASR (Automatic Speech Recognition) transcripts for both user and assistant voices. Storing chat history is a best practice—not only for logging purposes but also for resuming sessions when the connection is unexpectedly closed. This allows the client to send context back to Nova Sonic to continue the conversation seamlessly.

Refer to the following resources for more information on managing chat history:

1. [Logging chat history](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/amazon-nova-2-sonic/repeatable-patterns/chat-history-logger)

1. [Resuming conversations](https://github.com/aws-samples/amazon-nova-samples/tree/main/speech-to-speech/repeatable-patterns/resume-conversation)

## Sending chat history
<a name="sonic-chat-history-sending"></a>

A conversation history can be included only once, after the system/speech prompt and before audio streaming begins. Overall chat history cannot be larger than 40KB. The following diagram shows when chat history is passed in during the event lifecycle:

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Sending-Chat-History_4.png)


Each historical message requires three events: `contentStart`, `textInput` and `contentEnd`.

**Event schema per message:**
+ `contentStart` - Defines the message role and configuration

  ```
  {
    "event": {
      "contentStart": {
        "promptName": "<prompt-id>",
        "contentName": "<content-id>",
        "type": "TEXT",
        "interactive": true,
        "role": "ASSISTANT",
        "textInputConfiguration": {
          "mediaType": "text/plain"
        }
      }
    }
  }
  ```
+ `textInput` - Contains the actual message content. One textInput cannot be larger than 1KB. If so, split into multiple textInputs in the same content block. If the conversation is larger than 40KB, trim the overall chat history.

  ```
  {
    "event": {
      "textInput": {
        "promptName": "<prompt-id>",
        "contentName": "<content-id>",
        "content": "Take your time, Don. I'll be here when you're ready."
      }
    }
  }
  ```
+ `contentEnd` - Marks the end of the message

  ```
  {
    "event": {
      "contentEnd": {
        "promptName": "<prompt-id>",
        "contentName": "<content-id>"
      }
    }
  }
  ```

Repeat these three events for each message in your chat history, alternating between USER and ASSISTANT roles.

**Important considerations:**
+ Chat history can only be included **once** per session
+ Chat history must be sent **after the system prompt** and **before audio streaming begins**
+ All historical messages must be sent before starting the audio streaming
+ Each message must specify either USER or ASSISTANT role
+ Use the stored transcript content from textOutput events as the content value in textInput

## Receiving ASR transcripts
<a name="sonic-chat-history-receiving"></a>

During a conversation, Amazon Nova 2 Sonic sends ASR transcripts through output events. Each transcript is delivered as a sequence of three events: contentStart, textOutput, and contentEnd.

**Example: User speech transcript:**

1. contentStart - Indicates the beginning of a transcript:

```
{
  "event": {
    "contentStart": {
      "additionalModelFields": "{\"generationStage\":\"FINAL\"}",
      "completionId": "<completion-id>",
      "contentId": "<content-id>",
      "promptName": "<prompt-id>",
      "role": "USER",
      "sessionId": "<session-id>",
      "textOutputConfiguration": {
        "mediaType": "text/plain"
      },
      "type": "TEXT"
    }
  }
}
```

2. textOutput - Contains the actual transcript content:

```
{
  "event": {
    "textOutput": {
      "completionId": "<completion-id>",
      "content": "hello how are you",
      "contentId": "<content-id>",
      "promptName": "<prompt-id>",
      "role": "USER",
      "sessionId": "<session-id>"
    }
  }
}
```

3. contentEnd - Marks the end of the transcript:

```
{
  "event": {
    "contentEnd": {
      "completionId": "<completion-id>",
      "contentId": "<content-id>",
      "promptName": "<prompt-id>",
      "sessionId": "<session-id>",
      "stopReason": "PARTIAL_TURN",
      "type": "TEXT"
    }
  }
}
```

 The same three-event pattern applies for both USER and ASSISTANT roles. Extract the `content` field from the `textOutput` event and the `role` field from the `contentStart` event to build your chat history.

## Best practices
<a name="sonic-chat-history-best-practices"></a>

Always store chat history to enable:
+ Session resumptions across difference devices
+ Conversation logging and auditing
+ Context preservation for follow-up interactions

Important: When saving chat history, use text outputs based on their generationStage:
+ Speculative - A preview of what Nova 2 Sonic plans to say, generated before audio synthesis begins
+ Final - The actual sentence-level transcription of what was spoken in the audio response

Always save the FINAL text output to your chat history, as it represents the accurate record of the conversation.

Example of FINAL output (save this to chat history):

```
ContentStart event: { 
  "additionalModelFields": "{\"generationStage\":\"FINAL\"}", 
  "completionId": "<completion-id>", 
  "contentId": "<content-id>", 
  "role": "ASSISTANT", 
  "sessionId": "<session-id>", 
  "type": "TEXT" 
}
```

Example of SPECULATIVE output (optional preview, not for history):

```
ContentStart event: { 
  "additionalModelFields": "{\"generationStage\":\"SPECULATIVE\"}", 
  "completionId": "<completion-id>", 
  "contentId": "<content-id>", 
  "role": "ASSISTANT", 
  "sessionId": "<session-id>", 
  "type": "TEXT" 
}
```

# Tool configuration
<a name="sonic-tool-configuration"></a>

Amazon Nova 2 Sonic supports tool use (also known as function calling), allowing the model to request external information or actions during conversations, such as API calls, database queries, or custom code functions. This allows your voice assistant to take actions, retrieve information, and integrate with external services based on user requests.

 Nova 2 Sonic features asynchronous tool calling, enabling the AI to continue conversing naturally while tools run in ther background - creating a more fluid and responsive user experience.

The following are simplified steps on how to use tools:

1. Define tools: specify available tools with their parameters in the promptStart event

1. User speaks: user makes a request that requires a tool (such as, "What's the weather in Seattle?")

1. Tool invocation: Nova 2 Sonic recognizes the need and sends a toolUse event

1. Execute rool: Your application executes the tool and returns results

1. Response Generation: Nova 2 Sonic incorporates the results into its spoken response

The following diagram illustrates how tool use works:

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/How-tool-use-works_5.png)


## Defining tools
<a name="sonic-tool-defining"></a>

Tools are defined using a JSON schema that describes their purpose, parameters, and expected inputs.

The following are tool definition components and explanations:
+ Name: Unique identifier for the tool (use snake\$1case)
+ Description: Clear explanation of what the tool does; helps the AI decide when to use it
+ InputSchema: JSON schema defining the parameters the tool accepts
+ Properties: Individual parameters with types and descriptions
+ Required: Array of parameter names that must be provided

### Example of tool definition
<a name="w2aac25c13c23c15b9b1"></a>

Here's a simple weather tool definition

```
{
  "toolSpec": {
    "name": "get_weather",
    "description": "Get current weather information for a specific location",
    "inputSchema": {
      "json": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "City name or zip code"
          },
          "units": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "Temperature units"
          }
        },
        "required": ["location"]
      }
    }
  }
}
```

### Configuring Tools in PromptStart
<a name="w2aac25c13c23c15b9b3"></a>

Tool configuration is passed to Nova 2 Sonic in the `promptStart` event along with audio and text output settings:

```
{
    "event": {
        "promptStart": {
            "promptName": "<prompt-id>",
            "textOutputConfiguration": {
                "mediaType": "text/plain"
            },
            "audioOutputConfiguration": {
                "mediaType": "audio/lpcm",
                "sampleRateHertz": 16000,
                "sampleSizeBits": 16,
                "channelCount": 1,
                "voiceId": "matthew",
                "encoding": "base64",
                "audioType": "SPEECH"
            },
            "toolUseOutputConfiguration": {
                "mediaType": "application/json"
            },
            "toolConfiguration": {
                "tools": [
                    {
                        "toolSpec": {
                            "name": "get_weather",
                            "description": "Get current weather information for a specific location",
                            "inputSchema": {
                                "json": {
                                    "type": "object",
                                    "properties": {
                                        "location": {
                                            "type": "string",
                                            "description": "City name or zip code"
                                        },
                                        "units": {
                                            "type": "string",
                                            "enum": ["celsius", "fahrenheit"],
                                            "description": "Temperature units"
                                        }
                                    },
                                    "required": ["location"]
                                }
                            }
                        }
                    }
                ],
                "toolChoice": {
                    "auto": {}
                }
            }
        }
    }
}
```

## Tool Choice Parameters
<a name="sonic-tool-choice-parameters"></a>

Nova 2 Sonic supports three tool choice parameters to control when and which tools are used. Specify the toolChoice parameter in your tool configuration:
+ Auto (default): The model decides whether any tools are needed and can call multiple tools if required. Provides maximum flexibility.
+ Any: Ensures at least one of the available tools is called at the beginning of the response, with the model selecting the most appropriate one. Useful when you have multiple knowledge bases or tools and want to guarantee one is used.
+ Tool: Forces a specific named tool to be called exactly once at the beginning of the response. For example, if you specify a knowledge base tool, the model will query it before responding, regardless of whether it thinks the tool is needed.

**Tool Choice Examples**

Auto (default)

```
"toolChoice": { 
    "auto": {} 
}
```

Any:

```
"toolChoice": {
    "any": {}
}
```

Specific Tool:

```
"toolChoice": {
    "tool": {
        "name": "get_weather"
    }
}
```

## Receiving and processing tool use events
<a name="sonic-tool-receiving"></a>

When Amazon Nova 2 Sonic determines that a tool is needed, it sends a `toolUse` event containing:

1. `toolUseID`: unique identifier for this tool invocation

1. ToolName: the tool name to execute

1. Content: JSON string containing parameters extracted from the user's request

1. SessionID: current session identifier

1. Role: set to "TOOL" for tool use events

Example tool use event

```
{
    "event": {
        "toolUse": {
            "completionId": "<completion-id>",
            "content": "{\"location\": \"Seattle\", \"units\": \"fahrenheit\"}",
            "contentId": "<content-id>",
            "promptName": "<prompt-id>",
            "role": "TOOL",
            "sessionId": "<session-id>",
            "toolName": "get_weather",
            "toolUseId": "<tool-use-id>"
        }
    }
}
```

Processing steps

1. Receive the toolUse event from Nova 2 Sonic

1. Extract the tool name and parameters from the event

1. Execute your tool logic (API call, database query, and so on)

1. Return the result using a toolResult event

Example ToolResult Event

```
{
    "event": {
        "toolResult": {
            "promptName": "<prompt-id>",
            "contentName": "<content-id>",
            "content": "{\"temperature\": 72, \"condition\": \"sunny\", \"humidity\": 45}"
        }
    }
}
```

## Best practices
<a name="sonic-tool-best-practices"></a>
+ Clear descriptions: Write detailed tool descriptions to help Nova 2 Sonic understand when to use each tool.
+ Validate parameters: Always validate tool parameters before execution to prevent errors. Define tool parameters using proper JSON schema with structured data types (such as enums, numbers, or booleans) rather than open-ended strings whenever possible.
+ Error handling: Return meaningful error messages in toolResult events when tools fail.
+ Async execution: Take advantage of asynchronous tool calling to maintain conversation flow.
+ Tool naming: Use descriptive, action-oriented names (such as get\$1weather, search\$1database, send\$1email).

# Asynchronous tool calling
<a name="sonic-async-tools"></a>

Unlike traditional synchronous tool calling where the AI waits silently for tool results, Amazon Nova 2 Sonic's asynchronous approach allows it to:
+ Continue accepting user input while tools are running
+ Respond to new questions without waiting for pending tool results
+ Handle multiple tool calls simultaneously
+ Maintain natural conversation flow without awkward pauses
+ No extra configuration is required. Asynchronous tool calling works out of the box.

## How it works
<a name="sonic-async-tools-works"></a>

When Nova 2 Sonic issues a tool call, it doesn't pause the conversation. Instead, it continues listening and responsing naturally until the tool arrives.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Asynchronous-Tool-Calling_6.png)


## Handling user interruptions
<a name="sonic-async-tools-interruptions"></a>

If a user changes their request while a tool is executing, Nova 2 Sonic handles it intelligently without canceling pending tools calls.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Asynchronous-User-Interruption_7.png)


Example Scenario

```
User: "Can I book a flight from Boston to Chicago?"
                Agent: "Sure, let me look that up for you."
                Agent: [initiates tool call for Chicago flights]
                User: "Actually, I want to go to Seattle"
                Agent: "Ok let me update that search"
                Agent: [initiates tool call for Seattle flights]
                [First tool returns with Chicago flight results]
                Agent: [receives Chicago results and processes them contextually]
```

## How it works
<a name="sonic-async-tools-how-it-works"></a>

Tool results are always delivered: When a tool call completes, its result is always sent to the model, even if the user has changed their request. The model uses its reasoning capabilities to determine how to handle the information.

Context-aware processing: The model understands the conversation context and can appropriately handle outdated tool results. For example:
+ If the user says "thank you" after changing their mind, the model still needs the original results for context
+ If the user changes their request, the model can acknowledge the original results while focusing on the new request

No automatic cancellation: The system does not automatically cancel or ignore tool calls based on new user input. This ensures the model has complete information to make intelligent decisions about how to respond.

# Integrations
<a name="sonic-integrations"></a>

Amazon Nova 2 Sonic can be integrated with various frameworks and platforms to build conversational AI applications. These integrations provide pre-built components and simplified APIs for common use cases.

## Strands Agents
<a name="sonic-strands-agents"></a>

Strands Agents is a simple yet powerful SDK that takes a model-driven approach to building and running AI agents. From simple conversational assistants to complex autonomous workflows, from local development to production deployment, Strands Agents scales with your needs.

For comprehensive documentation on the Strands framework, visit the [official Strands documentation](https://strandsagents.com/latest/documentation/docs/user-guide/quickstart/).

The Strands BidiAgent provides real-time audio and text interaction through persistent streaming connections. Unlike traditional request-response patterns, this agent maintains long-running conversations with support for interruptions, concurrent processing and continuous audio responses.

**Prerequisites:**
+ Python 3.8 or later installed
+ Credentials for AWS configured with access to Amazon Bedrock
+ Basic familiarity with Python async/await syntax

**Code example:**

### Code example
<a name="w2aac25c15b5c15b1"></a>

**Installation:**

 Install the required packages:

```
pip install strands-agents strands-agents-tools
```

Run this example:

```
import asyncio
from strands.experimental.bidi.agent import BidiAgent
from strands.experimental.bidi.io.audio import BidiAudioIO
from strands.experimental.bidi.io.text import BidiTextIO
from strands.experimental.bidi.models.novasonic import BidiNovaSonicModel
from strands_tools import calculator

async def main():
    """Test the BidirectionalAgent API."""
    # Audio and Text input/output utility
    audio_io = BidiAudioIO(audio_config={})
    text_io = BidiTextIO()
    
    # Nova Sonic model
    model = BidiNovaSonicModel(region="us-east-1")
    
    async with BidiAgent(model=model, tools=[calculator]) as agent:
        print("New BidiAgent Experience")
        print("Try asking: 'What is 25 times 8?' or 'Calculate the square root of 144'")
        
        await agent.run(
            inputs=[audio_io.input()],
            outputs=[audio_io.output(), text_io.output()]
        )

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nConversation ended by user")
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()
```

### 1. Import: Required Models
<a name="w2aac25c15b5c15b3"></a>

```
from strands.experimental.bidi.agent import BidiAgent 
from strands.experimental.bidi.io.audio import BidiAudioIO 
from strands.experimental.bidi.io.text import BidiTextIO 
from strands.experimental.bidi.models.novasonic import BidiNovaSonicModel 
from strands_tools import calculator
```
+ BidiAgent: The main agent class that orchestrates bidirectional conversations
+ BidiAudioIO: Handles audio input and output for speech interactions
+ BidiTextIO: Provides text output for transcriptions and responses
+ BidiNovaSonicModel The Nova 2 Sonic model wrapper
+ Calculator: A pre-built tool for mathematical operations

### 2. Configure Audio and Text I/O
<a name="w2aac25c15b5c15b5"></a>

```
audio_io = BidiAudioIO(audio_config={}) 
text_io = BidiTextIO()
```

The BidiAudioIO manages microphone input and speaker output, while BidiTextIO displays text transcriptions and responses in the console.

### 3. Initialize the Model
<a name="w2aac25c15b5c15b7"></a>

```
model = BidiNovaSonicModel(region="us-east-1")
```

Create a Nova Sonic model instance. The region parameter specifies the AWS region where the model is deployed.

### 4. Create and Run the Agent
<a name="w2aac25c15b5c15b9"></a>

```
async with BidiAgent(model=model, tools=[calculator]) as agent: 
    await agent.run( 
        inputs=[audio_io.input()],  
        outputs=[audio_io.output(), text_io.output()] 
    )
```

The agent is created with:
+ Model: The Nova 2 Sonic model to use
+ Tools: List of tools the agent can call (like calculator)
+ Inputs: Audio input from the microphone
+ Outputs: Audio output to speakers and text output to console

## Framework integrations
<a name="sonic-framework-integrations"></a>

Amazon Nova 2 Sonic can be integrated with various frameworks and platforms to build sophisticated voice applications. The following examples demonstrate integration patterns with popular frameworks.

### Amazon Bedrock AgentCore
<a name="sonic-agentcore"></a>

Amazon Bedrock AgentCore provides a managed runtime environment for deploying Nova 2 Sonic applictions with enterprise-grade security and scalability. AgentCore simplifies the deployment of real-time voice AI applications by handling infrastructure, authentication, and WebSocket connectivity.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Agentcore-Architecture-Overview_11.png)


**Key features:**
+ Bidirectional streaming - Native support for Nova Sonic's full-duplex streaming interface with real-time event processing and low-latency communication.
+ WebSocket infrastructure - Production-ready WebSocket servers with automatic scaling, connection management, and error recovery.
+ Container deployment - Deploy Nova Sonic applications as containers to managed infrastructure with horizontal scaling and independent versioning.
+ Enterprise security - Fine-grained authentication via IAM and SigV4, VPC isolation, and comprehensive audit logging.

The architecture shows how client applications connect to AgentCore Runtime via WebSocket with SigV4 authentication. The containerized environment includes your WebSocket server, application logic, and Nova Sonic client, all communicating with Nova Sonic through the bidirectional streaming API.

**Benefits:**
+ Simplified operations: Focus on application logic while AgentCore manages infrastructure, scaling, and reliability.
+ Enterprise security: Built-in authentication, authorization, and compliance features for production deployments.
+ Cost efficienty: Pay only for what you use with automatic scaling and resource optimization.
+ Developer productivity: Reduce time to production with managed WebSocket infrastructure and container deployment.

**Use cases**
+ Customer service voice assistants with secure authentication
+ Enterprise voice applications requiring IAM integration
+ Multi-tenant voice platforms with isolated deployments
+ Voice-enabled applications requiring compliance and audit trails

For detailed documentation on deploying Nova Sonic with AgentCore, visit the [Amazon Bedrock AgentCore Documentation.](https://aws.amazon.com/bedrock/agentcore/)

### LiveKit
<a name="sonic-livekit"></a>

LiveKit is an open-source platform for building real-time audio and video applications. Integration with Amazon Nova 2 Sonic enables developers to build conversational voice interfaces without managing complex audio pipelines or signaling protocols.

 For detailed implementation examples and code examples, visit the [LiveKit AWS Integration Documentation.](https://docs.livekit.io/agents/integrations/aws/)

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/LiveKit-Architecture-Overview_9.png)


**How it works:**
+ Client layer: Web, mobile, or desktop applications connect using LiveKit's client SDKs, which handle audio capture, WebRTC streaming, and playback.
+ LiveKit Server: Acts as the real-time communication hub, managing WebRTC connections, routing audio streams, and handling session state with low-latency optimization.
+ LiveKit Agent: Python-based agent that receives audio from the server, processes it through the Nova Sonic plugin, and streams responses back. Includes built-in features like voice activity detection and turn management.
+ Amazon Nova 2 Sonic: Processes the audio stream through bidirectional streaming API, performing speech recognition, natural language understanding, and generating conversational responses with synthesized speech.

### Pipecat
<a name="sonic-pipecat"></a>

Pipecat is a framework for building voice and multimodal conversational AI applications. It provides a modular, pipeline-based architecture that orchestrates multiple components to create intelligent voice applications with Amazon Nova Sonic and other AWS services.

For detailed implementation examples and code samples, visit the [PipeCat AWS Integration Documentation.](https://docs.pipecat.ai/server/services/s2s/aws)

**Key features:**
+ Pipeline architecture: Modular Python-based framework for composing voice AI components including ASR, NLU, TTS, and more.
+ Pipecat flows: State management framework for building complex conversational logic and tool execution.
+ WebRTC Support: Built-in integration with Daily and other WebRTC providers for real-time audio streaming.
+ AWS Integration: Native support for Amazon Bedrock, Amazon Transcribe, and Amazon Polly.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Pipecat-Architecture-Overview_10.png)


The architecture includes:
+ WebRTC Transport: Real-time audio streaming between client devices and application server.
+ Voice activity detection (VAD): Silero VAD with configurable speech detection and noise suppression.
+ Speech recognition: Amazon Transcribe for accurate, real-time speech-to-text conversion.
+ Natural language understanding: Amazon Nova Pro on Bedrock with latency-optimized inference.
+ Tool execution: Pipecat Flows for API integration and backend service calls.
+ Response generation: Amazon Nova Pro for coherent, context-aware responses.
+ Text-to-speech: Amazon Polly with generative voices for lifelike speech output.

### Deploy on AWS
<a name="sonic-deploy-aws"></a>

Deploy your Nova Sonic applications to AWS using infrastructure as code with AWS CDK (Cloud Development Kit). This approach provides repeatable, version-controlled deployments with best practices built in.

Deployment options
+ Amazon ECS (Elastic Container Service): Fully managed container orchestration with Application Load Balancer integration, auto-scaling, and serverless Fargate execution.
+ Amazon EKS (Elastic Kubernetes Services): Managed Kubernetes for complex orchestration, advanced networking, multi-region deployments, and extensive tooling ecosystem.
+ AWS CDK: AWS CDK allows you to define cloud infrastructure using familiar programming languages.

For a complete, production-ready example of deploying Nova Sonic with AWS CDK, see the [Speech-to-Speech CDK Sample](https://github.com/aws-samples/generative-ai-cdk-constructs-samples/tree/main/samples/speech-to-speech) on GitHub. This sample demonstrates:

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/cdk-12.png)

+ Complete CDK infrastructure setup with TypeScript
+ WebSocket server implementation for real-time communication
+ Container deployment with ECS and Fargate
+ Application Load Balancer configuration for WebSocket support
+ VPC networking and security group setup
+ CloudWatch monitoring and logging
+ Best practices for production deployments

### Multi-agentic systems
<a name="sonic-multi-agent"></a>

Multi-agent architecture is a widely used pattern for designing AI assistants that handle complex tasks. In a voice assistant powered by Nova 2 Sonic, this architecture coordinates multiple specialized agents, where each agent operates independently to enable parallel processing, modular design, and scalable solutions.

Nova Sonic serves as the orchestrator in a multi-agent system, performing two key functions:

Conversation flow management: Ensures all necessary information is collected before proceeding to the next step in the conversation.

Intent classification: Analyzes user inquiries and routes them to the appropriate specialized sub-agent.

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Banking-Assistant_13.png)


The diagram above shows a banking voice assistant that uses a multi-agent architecture. The conversation flow begins with a greeting and collecting the user's name, then handles inquiries related to banking or mortgages through specialized sub-agents.

Conversation flow example:

1. User connects to voice assistant.

1. Nova 2 Sonic: "Hello\$1 What's your name?"

1. User: "My name is John"

1. Nova 2 Sonic: "Hi John, how can I help you today?"

1. User: "I want to check my account balance"

1. Nova 2 Sonic: [Routes to Authentication Agent]

1. Authentication Agent: "Please provide your account ID"

1. User: "12345"

1. Authentication Agent: [Verifies identity]

1. Nova 2 Sonic: [Routes to Banking Agent]

1. Banking Agent: "Your current balance is \$15,431,10"

While this example demonstrates sub-agents using the Strands Agents framework deployed on Amazon Bedrock AgentCore, the architecture is flexible. You can choose:
+ Your preferred agent framework
+ Any LLM provider
+ Custom hosting options
+ Different orchestration patterns

**Benefits:**
+ Modularity: Each agent focuses on a specific domain, making the system easier to maintain and update.
+ Scalability: Add new agents without modifying existing ones, allowing your system to grow with your needs.
+ Parallel processing: Multiple agents can work simultaneously, improving response times for complex queries.
+ Specialization: Each agent can be optimized for its specific task, using the most appropriate tools and knowledge bases.
+ Fault isolation: If one agent fails, others continue to function, improving overall system reliability.

Refer to [this blog](https://aws.amazon.com/blogs/machine-learning/building-a-multi-agent-voice-assistant-with-amazon-nova-sonic-and-amazon-bedrock-agentcore/) for more details and code examples.

See the [Nova Sonic Workshop Multi-Agent Lab](https://catalog.workshops.aws/amazon-nova-sonic-s2s/en-US/02-repeatable-pattern/05-multi-agent-agentcore) for hands-on samples.

### Telephony integration
<a name="sonic-telephony"></a>

Amazon Nova 2 Sonic integrates with telephony providers to enable AI-powered voice applications accessible via phone calls. This guide covers integration with Twilio, Vonage, and other SIP-based systems for building contact center solutions and voice agents.

**Twilio**: Cloud communications platform with programmable voice and media streaming capabilities.

**Vonage**: Global communications APIs with voice, WebSocket audio streaming, and SIP connectivity.

AWS provides a comprehensive sample implementation demonstrating Nova Sonic in a contact center environment with real-time analytics and telephony integration.

Repository: [ Sample Sonic Contact Center with Telephony](https://github.com/aws-samples/sample-sonic-contact-center-with-telephony)