

# Lambda durable functions
<a name="durable-functions"></a>

Lambda durable functions enable you to build resilient multi-step applications and AI workflows that can execute for up to one year while maintaining reliable progress despite interruptions. When a durable function runs, this complete lifecycle is called a durable execution, which uses checkpoints to track progress and automatically recover from failures through replay, re-executing from the beginning while skipping completed work.

Within each function, you use durable operations as fundamental building blocks. Steps execute business logic with built-in retries and progress tracking, while waits suspend execution without incurring compute charges, making them ideal for long-running processes like human-in-the-loop workflows or polling external dependencies. Whether you're processing orders, coordinating microservices, or orchestrating agentic AI applications, durable functions maintain state automatically and recover from failures while you write code in familiar programming languages.

## Key benefits
<a name="durable-functions-benefits"></a>

**Write resilient code naturally:** With familiar programming constructs, you write code that handles failures automatically. Built-in checkpointing, transparent retries, and automatic recovery mean your business logic stays clean and focused.

**Pay only for what you use:** During wait operations, your function suspends without incurring compute charges. For long-running workflows that wait hours or days, you pay only for actual processing time, not idle waiting.

**Operational simplicity:** With Lambda's serverless model, you get automatic scaling, including scale-to-zero, without managing infrastructure. Durable functions handle state management, retry logic, and failure recovery automatically, reducing operational overhead.

## When to use durable functions
<a name="durable-functions-use-cases"></a>

**Short-lived coordination:** Coordinate payments, inventory, and shipping across multiple services with automatic rollback on failures. Process orders through validation, payment authorization, inventory allocation, and fulfillment with guaranteed completion.

**Process payments with confidence:** Build resilient payment flows that maintain transaction state through failures and handle retries automatically. Coordinate multi-step authorization, fraud checks, and settlement across payment providers with full auditability across steps.

**Build reliable AI workflows:** Create multi-step AI workflows that chain model calls, incorporate human feedback, and handle long-running tasks deterministically during failures. Automatically resume after suspension, and only pay for active execution time.

**Orchestrate complex order fulfillment:** Coordinate order processing across inventory, payment, shipping, and notification systems with built-in resilience. Automatically handle partial failures, preserve order state despite interruptions, and efficiently wait for external events without consuming compute resources.

**Automate multi-step business workflows:** Build reliable workflows for employee onboarding, loan approvals, and compliance processes that span days or weeks. Maintain workflow state across human approvals, system integrations, and scheduled tasks while providing full visibility into process status and history.

### How durable functions compare to Step Functions
<a name="durable-functions-vs-step-functions"></a>

Both, durable functions and Step Functions, provide workflow orchestration with automatic state management. The key differences are where they run and how you define workflows:
+ **Durable functions:** Run within Lambda, use standard programming languages, managed within Lambda environment
+ **Step Functions:** Standalone service, graph-based DSL or visual designer, fully managed with zero maintenance

Durable functions are ideal for application development in Lambda where workflows are tightly coupled with business logic. Step Functions excels at workflow orchestration across AWS services where you need visual design, native integrations to 220\$1 services, and zero-maintenance infrastructure.

For a detailed comparison, see [Durable functions or Step Functions](durable-step-functions.md).

## How it works
<a name="durable-functions-how-it-works"></a>

 Under the hood, durable functions are regular Lambda functions using a checkpoint/replay mechanism to track progress and support long-running operations through user-defined suspension points, commonly referred to as durable execution. After your function resumes from a pause or interruption, the system performs replay. During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions.

To harness this checkpoint-and-replay mechanism in your applications, Lambda provides a durable execution SDK. The SDK abstracts away the complexity of managing checkpoints and replay, exposing simple primitives called durable operations that you use in your code. The SDK is available for JavaScript, TypeScript, Python, and Java, integrating seamlessly with your existing Lambda development workflow.

With the SDK, you wrap your Lambda event handler, which then provides a DurableContext alongside your event. This context gives you access to durable operations like steps and waits. You write your function logic as normal sequential code, but instead of calling services directly, you wrap those calls in steps for automatic checkpointing and retries. When you need to pause execution, you add waits that suspend your function without incurring charges. The SDK handles all the complexity of state management and replay behind the scenes, so your code remains clean and readable.

 ![\[Filter for Amazon Inspector results related to Lambda functions\]](http://docs.aws.amazon.com/lambda/latest/dg/images/how_durable_works.png) 

## Next steps
<a name="durable-functions-next-steps"></a>
+ [Get started with durable functions](durable-getting-started.md)
+ [Explore the durable execution SDK](durable-execution-sdk.md)
+ [Durable functions or Step Functions](durable-step-functions.md)
+ [Monitor and debug durable functions](durable-monitoring.md)
+ [Review security and permissions](durable-security.md)
+ [Follow best practices](durable-best-practices.md)

# Basic concepts
<a name="durable-basic-concepts"></a>

Lambda provides durable execution SDKs for JavaScript, TypeScript, and Python. These SDKs are the foundation for building durable functions, providing the primitives you need to checkpoint progress, handle retries, and manage execution flow. For complete SDK documentation and examples, see the [JavaScript/TypeScript SDK](https://github.com/aws/aws-durable-execution-sdk-js) and [Python SDK](https://github.com/aws/aws-durable-execution-sdk-python) on GitHub.

## Durable execution
<a name="durable-execution-concept"></a>

A **durable execution** represents the complete lifecycle of a Lambda durable function, using a checkpoint and replay mechanism to track business logic progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, previously completed checkpoints are replayed and the function continues execution.

The lifecycle may include multiple invocations of a Lambda function to complete the execution, particularly after suspensions or failure recovery. This approach enables your function to run for extended periods (up to one year) while maintaining reliable progress despite interruptions.

**How replay works**  
Lambda keeps a running log of all durable operations (steps, waits, and other operations) as your function executes. When your function needs to pause or encounters an interruption, Lambda saves this checkpoint log and stops the execution. When it's time to resume, Lambda invokes your function again from the beginning and replays the checkpoint log, substituting stored values for completed operations. This means your code runs again, but previously completed steps don't re-execute. Their stored results are used instead.

This replay mechanism is fundamental to understanding durable functions. Your code must be deterministic during replay, meaning it produces the same results given the same inputs. Avoid operations with side effects (like generating random numbers or getting the current time) outside of steps, as these can produce different values during replay and cause non-deterministic behavior.

## DurableContext
<a name="durable-context-concept"></a>

**DurableContext** is the context object your durable function receives. It provides methods for durable operations like steps and waits that create checkpoints and manage execution flow.

Your durable function receives a `DurableContext` instead of the default Lambda context:

------
#### [ TypeScript ]

```
import {
  DurableContext,
  withDurableExecution,
} from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const result = await context.step(async () => {
      return "step completed";
    });
    return result;
  },
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import (
    DurableContext,
    durable_execution,
    durable_step,
)

@durable_step
def my_step(step_context, data):
    # Your business logic
    return result

@durable_execution
def handler(event, context: DurableContext):
    result = context.step(my_step(event["data"]))
    return result
```

------

The Python SDK for durable functions uses synchronous methods and doesn't support `await`. The TypeScript SDK uses `async/await`.

## Steps
<a name="steps-concept"></a>

**Steps** runs business logic with built-in retries and automatic checkpointing. Each step saves its result, ensuring your function can resume from any completed step after interruptions.

------
#### [ TypeScript ]

```
// Each step is automatically checkpointed
const order = await context.step(async () => processOrder(event));
const payment = await context.step(async () => processPayment(order));
const result = await context.step(async () => completeOrder(payment));
```

------
#### [ Python ]

```
# Each step is automatically checkpointed
order = context.step(lambda: process_order(event))
payment = context.step(lambda: process_payment(order))
result = context.step(lambda: complete_order(payment))
```

------

## Wait States
<a name="wait-states-concept"></a>

**Wait states** are planned pauses where your function stops running (and stops charging) until it's time to continue. Use them to wait for time periods, external callbacks, or specific conditions.

------
#### [ TypeScript ]

```
// Wait for 1 hour without consuming resources
await context.wait({ seconds:3600 });

// Wait for external callback
const approval = await context.waitForCallback(
  async (callbackId) => sendApprovalRequest(callbackId)
);
```

------
#### [ Python ]

```
# Wait for 1 hour without consuming resources
context.wait(Duration.from_seconds(3600))

# Wait for external callback
approval = context.wait_for_callback(
    lambda callback_id: send_approval_request(callback_id)
)
```

------

When your function encounters a wait or needs to pause, Lambda saves the checkpoint log and stops the execution. When it's time to resume, Lambda invokes your function again and replays the checkpoint log, substituting stored values for completed operations.

For more complex workflows, durable Lambda functions also come with advanced operations like `parallel()` for concurrent execution, `map()` for processing arrays, `runInChildContext()` for nested operations, and `waitForCondition()` for polling. See [Examples](durable-examples.md) for detailed examples and guidance on when to use each operation.

## Invoking other functions
<a name="invoke-concept"></a>

**Invoke** allows a durable function to call other Lambda functions and wait for their results. The calling function suspends while the invoked function executes, creating a checkpoint that preserves the result. This enables you to build modular workflows where specialized functions handle specific tasks.

Use `context.invoke()` to call other functions from within your durable function. The invocation is checkpointed, so if your function is interrupted after the invoked function completes, it resumes with the stored result without re-invoking the function.

------
#### [ TypeScript ]

```
// Invoke another function and wait for result
const customerData = await context.invoke(
  'validate-customer',
  'arn:aws:lambda:us-east-1:123456789012:function:customer-service:1',
  { customerId: event.customerId }
);

// Use the result in subsequent steps
const order = await context.step(async () => {
  return processOrder(customerData);
});
```

------
#### [ Python ]

```
# Invoke another function and wait for result
customer_data = context.invoke(
    'arn:aws:lambda:us-east-1:123456789012:function:customer-service:1',
    {'customerId': event['customerId']},
    name='validate-customer'
)

# Use the result in subsequent steps
order = context.step(
    lambda: process_order(customer_data),
    name='process-order'
)
```

------

The invoked function can be either a durable or standard Lambda function. If you invoke a durable function, the calling function waits for the complete durable execution to finish. This pattern is common in microservices architectures where each function handles a specific domain, allowing you to compose complex workflows from specialized, reusable functions.

**Note**  
Cross-account invocations are not supported. The invoked function must be in the same AWS account as the calling function.

## Durable function configuration
<a name="durable-configuration-basic"></a>

Durable functions have specific configuration settings that control execution behavior and data retention. These settings are separate from standard Lambda function configuration and apply to the entire durable execution lifecycle.

The **DurableConfig** object defines the configuration for durable functions:

```
{
  "ExecutionTimeout": Integer,
  "RetentionPeriodInDays": Integer
}
```

### Execution timeout
<a name="durable-execution-timeout"></a>

The **execution timeout** controls how long a durable execution can run from start to completion. This is different from the Lambda function timeout, which controls how long a single function invocation can run.

A durable execution can span multiple Lambda function invocations as it progresses through checkpoints, waits, and replays. The execution timeout applies to the total elapsed time of the durable execution, not to individual function invocations.

**Understanding the difference**  
The Lambda function timeout (maximum 15 minutes) limits each individual invocation of your function. The durable execution timeout (maximum 1 year) limits the total time from when the execution starts until it completes, fails, or times out. During this period, your function may be invoked multiple times as it processes steps, waits, and recovers from failures.

For example, if you set a durable execution timeout of 24 hours and a Lambda function timeout of 5 minutes:
+ Each function invocation must complete within 5 minutes
+ The entire durable execution can run for up to 24 hours
+ Your function can be invoked many times during those 24 hours
+ Wait operations don't count against the Lambda function timeout but do count against the execution timeout

You can configure the execution timeout when creating a durable function using the Lambda console, AWS CLI, or AWS SAM. In the Lambda console, choose your function, then Configuration, Durable execution. Set the Execution timeout value in seconds (default: 86400 seconds / 24 hours, minimum: 60 seconds, maximum: 31536000 seconds / 1 year).

**Note**  
The execution timeout and Lambda function timeout are different settings. The Lambda function timeout controls how long each individual invocation can run (maximum 15 minutes). The execution timeout controls the total elapsed time for the entire durable execution (maximum 1 year).

### Retention period
<a name="durable-retention-period"></a>

The **retention period** controls how long Lambda retains execution history and checkpoint data after a durable execution completes. This data includes step results, execution state, and the complete checkpoint log.

After the retention period expires, Lambda deletes the execution history and checkpoint data. You can no longer retrieve execution details or replay the execution. The retention period starts when the execution reaches a terminal state (SUCCEEDED, FAILED, STOPPED, or TIMED\$1OUT).

You can configure the retention period when creating a durable function using the Lambda console, AWS CLI, or AWS SAM. In the Lambda console, choose your function, then Configuration, Durable execution. Set the Retention period value in days (default: 14 days, minimum: 1 day, maximum: 90 days).

Choose a retention period based on your compliance requirements, debugging needs, and cost considerations. Longer retention periods provide more time for debugging and auditing but increase storage costs.

## See also
<a name="durable-basic-concepts-see-also"></a>
+ [Durable functions or Step Functions](durable-step-functions.md) – Compare durable functions with Step Functions to understand when each approach is most effective.

# Creating Lambda durable functions
<a name="durable-getting-started"></a>

To get started with Lambda durable functions, use the Lambda console to create a durable function. In a few minutes, you can create and deploy a durable function that uses steps and waits to demonstrate checkpoint-based execution.

As you carry out the tutorial, you'll learn fundamental durable function concepts, like how to use the `DurableContext` object, create checkpoints with steps, and pause execution with waits. You'll also learn how replay works when your function resumes after a wait.

To keep things simple, this tutorial shows you how to create your function using either the Python or Node.js runtime. With these interpreted languages, you can edit function code directly in the console's built-in code editor.

**Note**  
Durable functions currently support Python, Node.js (JavaScript/TypeScript), and Java runtimes, as well as container images (OCI). For a complete list of supported runtime versions and container image options, see [Supported runtimes for durable functions](durable-supported-runtimes.md). For more information about using container images with Lambda, see [Creating Lambda container images](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html) in the Lambda Developer Guide.

**Tip**  
To learn how to build **serverless solutions**, check out the [Serverless Developer Guide](https://docs.aws.amazon.com/serverless/latest/devguide/).

## Prerequisites
<a name="durable-getting-started-prerequisites"></a>

### Sign up for an AWS account
<a name="sign-up-for-aws"></a>

If you do not have an AWS account, complete the following steps to create one.

**To sign up for an AWS account**

1. Open [https://portal.aws.amazon.com/billing/signup](https://portal.aws.amazon.com/billing/signup).

1. Follow the online instructions.

   Part of the sign-up procedure involves receiving a phone call or text message and entering a verification code on the phone keypad.

   When you sign up for an AWS account, an *AWS account root user* is created. The root user has access to all AWS services and resources in the account. As a security best practice, assign administrative access to a user, and use only the root user to perform [tasks that require root user access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_root-user.html#root-user-tasks).

AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view your current account activity and manage your account by going to [https://aws.amazon.com/](https://aws.amazon.com/) and choosing **My Account**.

### Create a user with administrative access
<a name="create-an-admin"></a>

After you sign up for an AWS account, secure your AWS account root user, enable AWS IAM Identity Center, and create an administrative user so that you don't use the root user for everyday tasks.

**Secure your AWS account root user**

1.  Sign in to the [AWS Management Console](https://console.aws.amazon.com/) as the account owner by choosing **Root user** and entering your AWS account email address. On the next page, enter your password.

   For help signing in by using root user, see [Signing in as the root user](https://docs.aws.amazon.com/signin/latest/userguide/console-sign-in-tutorials.html#introduction-to-root-user-sign-in-tutorial) in the *AWS Sign-In User Guide*.

1. Turn on multi-factor authentication (MFA) for your root user.

   For instructions, see [Enable a virtual MFA device for your AWS account root user (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/enable-virt-mfa-for-root.html) in the *IAM User Guide*.

**Create a user with administrative access**

1. Enable IAM Identity Center.

   For instructions, see [Enabling AWS IAM Identity Center](https://docs.aws.amazon.com//singlesignon/latest/userguide/get-set-up-for-idc.html) in the *AWS IAM Identity Center User Guide*.

1. In IAM Identity Center, grant administrative access to a user.

   For a tutorial about using the IAM Identity Center directory as your identity source, see [ Configure user access with the default IAM Identity Center directory](https://docs.aws.amazon.com//singlesignon/latest/userguide/quick-start-default-idc.html) in the *AWS IAM Identity Center User Guide*.

**Sign in as the user with administrative access**
+ To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email address when you created the IAM Identity Center user.

  For help signing in using an IAM Identity Center user, see [Signing in to the AWS access portal](https://docs.aws.amazon.com/signin/latest/userguide/iam-id-center-sign-in-tutorial.html) in the *AWS Sign-In User Guide*.

**Assign access to additional users**

1. In IAM Identity Center, create a permission set that follows the best practice of applying least-privilege permissions.

   For instructions, see [ Create a permission set](https://docs.aws.amazon.com//singlesignon/latest/userguide/get-started-create-a-permission-set.html) in the *AWS IAM Identity Center User Guide*.

1. Assign users to a group, and then assign single sign-on access to the group.

   For instructions, see [ Add groups](https://docs.aws.amazon.com//singlesignon/latest/userguide/addgroups.html) in the *AWS IAM Identity Center User Guide*.

## Create a Lambda durable function with the console
<a name="getting-started-create-durable-function"></a>

In this example, your durable function processes an order through multiple steps with automatic checkpointing. The function takes a JSON object containing an order ID, validates the order, processes payment, and confirms the order. Each step is automatically checkpointed, so if the function is interrupted, it resumes from the last completed step.

Your function also demonstrates a wait operation, pausing execution for a short period to simulate waiting for external confirmation.

**To create a durable function with the console**

1. Open the [Functions page](https://console.aws.amazon.com/lambda/home#/functions) of the Lambda console.

1. Choose **Create function**.

1. Select **Author from scratch**.

1. In the **Basic information** pane, for **Function name**, enter `myDurableFunction`.

1. For **Runtime**, choose either **Node.js 24** or **Python 3.14**.

1. Select **Enable durable execution**.

Lambda creates your durable function with an [execution role](lambda-intro-execution-role.md) that includes permissions for checkpoint operations (`lambda:CheckpointDurableExecution` and `lambda:GetDurableExecutionState`).

**Note**  
Lambda runtimes include the Durable Execution SDK, so you can test durable functions without packaging dependencies. However, we recommend including the SDK in your deployment package for production. This ensures version consistency and avoids potential runtime updates that might affect your function.

Use the console's built-in code editor to add your durable function code.

------
#### [ Node.js ]

**To modify the code in the console**

1. Choose the **Code** tab.

   In the console's built-in code editor, you should see the function code that Lambda created. If you don't see the **index.mjs** tab in the code editor, select **index.mjs** in the file explorer as shown on the following diagram.  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/durable-nodejs.png)

1. Paste the following code into the **index.mjs** tab, replacing the code that Lambda created.

   ```
   import {
     withDurableExecution,
   } from "@aws/durable-execution-sdk-js";
   
   export const handler = withDurableExecution(
     async (event, context) => {
       const orderId = event.orderId;
       
       // Step 1: Validate order
       const validationResult = await context.step(async (stepContext) => {
         stepContext.logger.info(`Validating order ${orderId}`);
         return { orderId, status: "validated" };
       });
       
       // Step 2: Process payment
       const paymentResult = await context.step(async (stepContext) => {
         stepContext.logger.info(`Processing payment for order ${orderId}`);
         return { orderId, status: "paid", amount: 99.99 };
       });
       
       // Wait for 10 seconds to simulate external confirmation
       await context.wait({ seconds: 10 });
       
       // Step 3: Confirm order
       const confirmationResult = await context.step(async (stepContext) => {
         stepContext.logger.info(`Confirming order ${orderId}`);
         return { orderId, status: "confirmed" };
       });
           
       return {
         orderId: orderId,
         status: "completed",
         steps: [validationResult, paymentResult, confirmationResult]
       };
     }
   );
   ```

1. In the **DEPLOY** section, choose **Deploy** to update your function's code:  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/getting-started-tutorial/deploy-console.png)

**Understanding your durable function code**  
Before you move to the next step, let's look at the function code and understand key durable function concepts.
+ The `withDurableExecution` wrapper:

  Your durable function is wrapped with `withDurableExecution`. This wrapper enables durable execution by providing the `DurableContext` object and managing checkpoint operations.
+ The `DurableContext` object:

  Instead of the standard Lambda context, your function receives a `DurableContext`. This object provides methods for durable operations like `step()` and `wait()` that create checkpoints.
+ Steps and checkpoints:

  Each `context.step()` call creates a checkpoint before and after execution. If your function is interrupted, it resumes from the last completed checkpoint. The function doesn't re-execute completed steps. It uses their stored results instead.
+ Wait operations:

  The `context.wait()` call pauses execution without consuming compute resources. When the wait completes, Lambda invokes your function again and replays the checkpoint log, substituting stored values for completed steps.
+ Replay mechanism:

  When your function resumes after a wait or interruption, Lambda runs your code from the beginning. However, completed steps don't re-execute. Lambda replays their results from the checkpoint log. This is why your code must be deterministic.

------
#### [ Python ]

**To modify the code in the console**

1. Choose the **Code** tab.

   In the console's built-in code editor, you should see the function code that Lambda created. If you don't see the **lambda\$1function.py** tab in the code editor, select **lambda\$1function.py** in the file explorer as shown on the following diagram.  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/durable-python.png)

1. Paste the following code into the **lambda\$1function.py** tab, replacing the code that Lambda created.

   ```
   from aws_durable_execution_sdk_python import (
       DurableContext,
       durable_execution,
       durable_step,
   )
   from aws_durable_execution_sdk_python.config import Duration
   
   @durable_step
   def validate_order(step_context, order_id):
       step_context.logger.info(f"Validating order {order_id}")
       return {"orderId": order_id, "status": "validated"}
   
   @durable_step
   def process_payment(step_context, order_id):
       step_context.logger.info(f"Processing payment for order {order_id}")
       return {"orderId": order_id, "status": "paid", "amount": 99.99}
   
   @durable_step
   def confirm_order(step_context, order_id):
       step_context.logger.info(f"Confirming order {order_id}")
       return {"orderId": order_id, "status": "confirmed"}
   
   @durable_execution
   def lambda_handler(event, context: DurableContext):
       order_id = event['orderId']
       
       # Step 1: Validate order
       validation_result = context.step(validate_order(order_id))
       
       # Step 2: Process payment
       payment_result = context.step(process_payment(order_id))
       
       # Wait for 10 seconds to simulate external confirmation
       context.wait(Duration.from_seconds(10))
       
       # Step 3: Confirm order
       confirmation_result = context.step(confirm_order(order_id))
           
       return {
           "orderId": order_id,
           "status": "completed",
           "steps": [validation_result, payment_result, confirmation_result]
       }
   ```

1. In the **DEPLOY** section, choose **Deploy** to update your function's code:  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/getting-started-tutorial/deploy-console.png)

**Understanding your durable function code**  
Before you move to the next step, let's look at the function code and understand key durable function concepts.
+ The `@durable_execution` decorator:

  Your handler function is decorated with `@durable_execution`. This decorator enables durable execution by providing the `DurableContext` object and managing checkpoint operations.
+ The `@durable_step` decorator:

  Each step function is decorated with `@durable_step`. This decorator marks the function as a durable step that creates checkpoints.
+ The `DurableContext` object:

  Instead of the standard Lambda context, your function receives a `DurableContext`. This object provides methods for durable operations like `step()` and `wait()` that create checkpoints.
+ Steps and checkpoints:

  Each `context.step()` call creates a checkpoint before and after execution. If your function is interrupted, it resumes from the last completed checkpoint. The function doesn't re-execute completed steps. It uses their stored results instead.
+ Wait operations:

  The `context.wait()` call pauses execution without consuming compute resources. When the wait completes, Lambda invokes your function again and replays the checkpoint log, substituting stored values for completed steps.
+ Python SDK is synchronous:

  Note that the Python SDK doesn't use `await`. All durable operations are synchronous method calls.

------

## Invoke the durable function using the console code editor
<a name="get-started-invoke-durable-manually"></a>

When no explicit version is specified (or published), the console invokes the durable function using the `$LATEST` version qualifier. However, for deterministic execution of your code, you must always use a qualified ARN pointing to a stable version.

**To publish a version of your function**

1. Choose the **Versions** tab.

1. Choose **Publish new version**.

1. For **Version description**, enter **Initial version** (optional).

1. Choose **Publish**.

1. Lambda creates version 1 of your function. Note that the function ARN now includes `:1` at the end, indicating this is version 1.

Now create a test event to send to your function. The event is a JSON formatted document containing an order ID.

**To create the test event**

1. In the **TEST EVENTS** section of the console code editor, choose **Create test event**.  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/getting-started-tutorial/test-event.png)

1. For **Event Name**, enter **myTestEvent**.

1. In the **Event JSON** section, replace the default JSON with the following:

   ```
   {
     "orderId": "order-12345"
   }
   ```

1. Choose **Save**.

**To test your durable function and view execution**

In the **TEST EVENTS** section of the console code editor, choose the run icon next to your test event:

![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/getting-started-tutorial/run-test-event.png)


Your durable function starts executing. Because it includes a 10-second wait, the initial invocation completes quickly, and the function resumes after the wait period. You can view the execution progress in the **Durable executions** tab.

**To view your durable function execution**

1. Choose the **Durable executions** tab.

1. Find your execution in the list. The execution shows the current status (Running, Succeeded, or Failed).

1. Choose the execution ID to view details, including:
   + Execution timeline showing when each step completed
   + Checkpoint history
   + Wait periods
   + Step results

You can also view your function's logs in CloudWatch Logs to see the console output from each step.

**To view your function's invocation records in CloudWatch Logs**

1. Open the [Log groups](https://console.aws.amazon.com/cloudwatch/home#logs:) page of the CloudWatch console.

1. Choose the log group for your function (`/aws/lambda/myDurableFunction`).

1. Scroll down and choose the **Log stream** for the function invocations you want to look at.  
![\[\]](http://docs.aws.amazon.com/lambda/latest/dg/images/log-stream.png)

   You should see log entries for each invocation of your function, including the initial execution and the replay after the wait.

**Note**  
When you use the logger from the `DurableContext` (such as `context.logger` or `stepContext.logger`), logs also appear in the durable execution and step views in the Lambda console. These logs may take a moment to load.

## Clean up
<a name="gettingstarted-durable-cleanup"></a>

When you're finished working with the example durable function, delete it. You can also delete the log group that stores the function's logs, and the [execution role](lambda-intro-execution-role.md) that the console created.

**To delete the Lambda function**

1. Open the [Functions page](https://console.aws.amazon.com/lambda/home#/functions) of the Lambda console.

1. Select the function that you created.

1. Choose **Actions**, **Delete**.

1. Type **confirm** in the text input field and choose **Delete**.

**To delete the log group**

1. Open the [Log groups](https://console.aws.amazon.com/cloudwatch/home#logs:) page of the CloudWatch console.

1. Select the function's log group (`/aws/lambda/myDurableFunction`).

1. Choose **Actions**, **Delete log group(s)**.

1. In the **Delete log group(s)** dialog box, choose **Delete**.

**To delete the execution role**

1. Open the [Roles page](https://console.aws.amazon.com/iam/home?#/roles) of the AWS Identity and Access Management (IAM) console.

1. Select the function's execution role (for example, `myDurableFunction-role-31exxmpl`).

1. Choose **Delete**.

1. In the **Delete role** dialog box, enter the role name, and then choose **Delete**.

## Additional resources and next steps
<a name="durable-getting-started-more-resources"></a>

Now that you've created and tested a simple durable function using the console, take these next steps:
+ Learn about common use cases for durable functions, including distributed transactions, order processing, and human review workflows. See [Examples](durable-examples.md).
+ Understand how to monitor durable function executions with CloudWatch metrics and execution history. See [Monitoring and debugging](durable-monitoring.md).
+ Learn about invoking durable functions synchronously and asynchronously, and managing long-running executions. See [Invoking durable functions](durable-invoking.md).
+ Follow best practices for writing deterministic code, managing checkpoint sizes, and optimizing costs. See [Best practices](durable-best-practices.md).
+ Learn how to test durable functions locally and in the cloud. See [Testing durable functions](durable-testing.md).
+ Compare durable functions with Step Functions to understand when each approach is most effective. See [Durable functions or Step Functions](durable-step-functions.md).

# Deploy and invoke Lambda durable functions with the AWS CLI
<a name="durable-getting-started-cli"></a>

Use the AWS CLI to create and deploy Lambda durable functions with imperative commands. This approach gives you direct control over each step of the deployment process.

## Prerequisites
<a name="durable-cli-prerequisites"></a>
+ Install and configure the AWS CLI. For instructions, see [Installing the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
+ Create a deployment package with your function code and the durable execution SDK.
+ Create an IAM execution role with checkpoint permissions.

## Create the execution role
<a name="durable-cli-create-role"></a>

Create an IAM role with permissions for basic Lambda execution and checkpoint operations.

**To create the execution role**

1. Create a trust policy document that allows Lambda to assume the role. Save this as `trust-policy.json`:

   ```
   {
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "lambda.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

1. Create the role:

   ```
   aws iam create-role \
     --role-name durable-function-role \
     --assume-role-policy-document file://trust-policy.json
   ```

1. Attach the durable execution policy for checkpoint operations and basic execution:

   ```
   aws iam attach-role-policy \
     --role-name durable-function-role \
     --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
   ```

The `AWSLambdaBasicDurableExecutionRolePolicy` managed policy includes the required permissions for checkpoint operations (`lambda:CheckpointDurableExecution` and `lambda:GetDurableExecutionState`) and basic Lambda execution.

## Create the durable function
<a name="durable-cli-create-function"></a>

Create your durable function with the `--durable-config` parameter.

**To create a durable function**

1. Package your function code with dependencies into a .zip file:

   ```
   zip -r function.zip index.mjs node_modules/
   ```
**Note**  
For Java-based durable functions, you need to compile your function code and dependencies into a single .zip file or Java Archive (JAR) file. For more information, see [Deploy Java Lambda functions with .zip or JAR file archives](java-package.md).

1. Create the function with durable execution enabled:

   ```
   aws lambda create-function \
     --function-name myDurableFunction \
     --runtime nodejs22.x \
     --role arn:aws:iam::123456789012:role/durable-function-role \
     --handler index.handler \
     --zip-file fileb://function.zip \
     --durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 7}'
   ```

**Note**  
You can only enable durable execution when creating the function. You cannot enable it on existing functions.

## Publish a version
<a name="durable-cli-publish-version"></a>

While durable functions can be invoked using the `$LATEST` version qualifier, you must always use a qualified ARN pointing to a stable version to ensure deterministic execution of your code.

```
aws lambda publish-version \
  --function-name myDurableFunction \
  --description "Initial version"
```

The command returns the version ARN. Note the version number (for example, `:1`) at the end of the ARN.

Optionally, create an alias that points to the version:

```
aws lambda create-alias \
  --function-name myDurableFunction \
  --name prod \
  --function-version 1
```

## Invoke the durable function
<a name="durable-cli-invoke"></a>

Invoke your durable function using the qualified ARN (version or alias).

**Note**  
**Idempotent invocations:** To prevent duplicate executions when retrying failed invocations, you can provide an execution name that ensures at-most-once execution semantics. See [Idempotency](durable-execution-idempotency.md) for details.

**Synchronous invocation**  
For executions that complete within 15 minutes, use synchronous invocation:

```
aws lambda invoke \
  --function-name myDurableFunction:1 \
  --payload '{"orderId": "order-12345"}' \
  --cli-binary-format raw-in-base64-out \
  response.json
```

Or using an alias:

```
aws lambda invoke \
  --function-name myDurableFunction:prod \
  --payload '{"orderId": "order-12345"}' \
  --cli-binary-format raw-in-base64-out \
  response.json
```

**Asynchronous invocation**  
For long-running executions, use asynchronous invocation:

```
aws lambda invoke \
  --function-name myDurableFunction:prod \
  --invocation-type Event \
  --payload '{"orderId": "order-12345"}' \
  --cli-binary-format raw-in-base64-out \
  response.json
```

With asynchronous invocation, Lambda returns immediately. The function continues executing in the background.

**Note**  
You can use `$LATEST` for prototyping and testing in the console. For production workloads, use a published version or alias.

## Manage durable executions
<a name="durable-cli-manage-executions"></a>

Use the following commands to manage and monitor durable function executions.

**List executions**  
List all executions for a durable function:

```
aws lambda list-durable-executions-by-function \
  --function-name myDurableFunction
```

**Get execution details**  
Get details about a specific execution:

```
aws lambda get-durable-execution \
  --durable-execution-arn arn:aws:lambda:us-east-1:123456789012:function:myDurableFunction:my-function-version/durable-execution/my-execution-name/my-execution-id
```

**Get execution history**  
View the checkpoint history for an execution:

```
aws lambda get-durable-execution-history \
  --durable-execution-arn arn:aws:lambda:us-east-1:123456789012:function:myDurableFunction:my-function-version/durable-execution/my-execution-name/my-execution-id
```

**Stop an execution**  
Stop a running durable execution:

```
aws lambda stop-durable-execution \
  --durable-execution-arn arn:aws:lambda:us-east-1:123456789012:function:myDurableFunction:my-function-version/durable-execution/my-execution-name/my-execution-id
```

## Update function code
<a name="durable-cli-update-function"></a>

Update your durable function code and publish a new version:

**To update and publish a new version**

1. Update the function code:

   ```
   aws lambda update-function-code \
     --function-name myDurableFunction \
     --zip-file fileb://function.zip
   ```

1. Wait for the update to complete:

   ```
   aws lambda wait function-updated \
     --function-name myDurableFunction
   ```

1. Publish a new version:

   ```
   aws lambda publish-version \
     --function-name myDurableFunction \
     --description "Updated order processing logic"
   ```

1. Update the alias to point to the new version:

   ```
   aws lambda update-alias \
     --function-name myDurableFunction \
     --name prod \
     --function-version 2
   ```

**Important**  
Running executions continue using the version they started with. New invocations use the updated alias version.

## View function logs
<a name="durable-cli-view-logs"></a>

View your durable function's logs in CloudWatch Logs:

```
aws logs tail /aws/lambda/myDurableFunction --follow
```

Filter logs for a specific execution:

```
aws logs filter-log-events \
  --log-group-name /aws/lambda/myDurableFunction \
  --filter-pattern "exec-abc123"
```

## Clean up resources
<a name="durable-cli-cleanup"></a>

Delete your durable function and associated resources:

```
# Delete the function
aws lambda delete-function --function-name myDurableFunction

# Delete the IAM role policies
aws iam detach-role-policy \
  --role-name durable-function-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

aws iam detach-role-policy \
  --role-name durable-function-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy

# Delete the role
aws iam delete-role --role-name durable-function-role
```

## Next steps
<a name="durable-cli-next-steps"></a>

After deploying your durable function with the AWS CLI:
+ Monitor executions using the `list-durable-executions-by-function` and `get-durable-execution` commands
+ View checkpoint operations in AWS CloudTrail data events
+ Set up CloudWatch alarms for execution failures or long-running executions
+ Automate deployments using shell scripts or CI/CD pipelines

For more information about AWS CLI commands for Lambda, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/lambda/index.html).

# Deploy Lambda durable functions with Infrastructure as Code
<a name="durable-getting-started-iac"></a>

You can deploy Lambda durable functions using Infrastructure as Code (IaC) tools like AWS CloudFormation, AWS CDK, AWS Serverless Application Model, or Terraform. These tools let you define your function, execution role, and permissions in code, making deployments repeatable and version-controlled.

All three tools require you to:
+ Enable durable execution on the function
+ Grant checkpoint permissions to the execution role
+ Publish a version or create an alias (durable functions require qualified ARNs)

## Durable functions from a ZIP
<a name="durable-iac-zip"></a>

### AWS CloudFormation
<a name="durable-iac-cloudformation"></a>

Use CloudFormation to define your durable function in a template. The following example creates a durable function with the required permissions.

```
AWSTemplateFormatVersion: '2010-09-09'
Description: Lambda durable function example

Resources:
  DurableFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'		 	 	 
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy

  DurableFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: myDurableFunction
      Runtime: nodejs22.x
      Handler: index.handler
      Role: !GetAtt DurableFunctionRole.Arn
      Code:
        ZipFile: |
          // Your durable function code here
          export const handler = async (event, context) => {
            return { statusCode: 200 };
          };
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7

  DurableFunctionVersion:
    Type: AWS::Lambda::Version
    Properties:
      FunctionName: !Ref DurableFunction
      Description: Initial version

  DurableFunctionAlias:
    Type: AWS::Lambda::Alias
    Properties:
      FunctionName: !Ref DurableFunction
      FunctionVersion: !GetAtt DurableFunctionVersion.Version
      Name: prod

Outputs:
  FunctionArn:
    Description: Durable function ARN
    Value: !GetAtt DurableFunction.Arn
  AliasArn:
    Description: Function alias ARN (use this for invocations)
    Value: !Ref DurableFunctionAlias
```

**To deploy the template**

```
aws cloudformation deploy \
  --template-file template.yaml \
  --stack-name my-durable-function-stack \
  --capabilities CAPABILITY_IAM
```

### AWS CDK
<a name="durable-iac-cdk"></a>

AWS CDK lets you define infrastructure using programming languages. The following examples show how to create a durable function using TypeScript and Python.

------
#### [ TypeScript ]

```
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export class DurableFunctionStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create the durable function
    const durableFunction = new lambda.Function(this, 'DurableFunction', {
      runtime: lambda.Runtime.NODEJS_22_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      functionName: 'myDurableFunction',
      durableConfig: { executionTimeout: Duration.hours(1), retentionPeriod: Duration.days(30) },
    });

    // Create version and alias
    const version = durableFunction.currentVersion;
    const alias = new lambda.Alias(this, 'ProdAlias', {
      aliasName: 'prod',
      version: version,
    });

    // Output the alias ARN
    new cdk.CfnOutput(this, 'FunctionAliasArn', {
      value: alias.functionArn,
      description: 'Use this ARN to invoke the durable function',
    });
  }
}
```

------
#### [ Python ]

```
from aws_cdk import (
    Stack,
    aws_lambda as lambda_,
    aws_iam as iam,
    CfnOutput,
)
from constructs import Construct

class DurableFunctionStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)

        # Create the durable function
        durable_function = lambda_.Function(
            self, 'DurableFunction',
            runtime=lambda_.Runtime.NODEJS_22_X,
            handler='index.handler',
            code=lambda_.Code.from_asset('lambda'),
            function_name='myDurableFunction',
            durable_execution={execution_timeout: Duration.hours(1), retention_period: Duration.days(30)}
        )

        # Add durable execution managed policy for checkpoint permissions
        durable_function.role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name('service-role/AWSLambdaBasicDurableExecutionRolePolicy')
        )

        # Create version and alias
        version = durable_function.current_version
        alias = lambda_.Alias(
            self, 'ProdAlias',
            alias_name='prod',
            version=version
        )

        # Output the alias ARN
        CfnOutput(
            self, 'FunctionAliasArn',
            value=alias.function_arn,
            description='Use this ARN to invoke the durable function'
        )
```

------

**To deploy the CDK stack**

```
cdk deploy
```

### AWS Serverless Application Model
<a name="durable-iac-sam"></a>

AWS SAM simplifies CloudFormation templates for serverless applications. The following template creates a durable function with AWS SAM.

```
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Lambda durable function with SAM

Resources:
  DurableFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: myDurableFunction
      Runtime: nodejs22.x
      Handler: index.handler
      CodeUri: ./src
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Policies:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
      AutoPublishAlias: prod

Outputs:
  FunctionArn:
    Description: Durable function ARN
    Value: !GetAtt DurableFunction.Arn
  AliasArn:
    Description: Function alias ARN (use this for invocations)
    Value: !Ref DurableFunction.Alias
```

**To deploy the SAM template**

```
sam build
sam deploy --guided
```

### Terraform
<a name="durable-iac-terraform"></a>

Terraform is a popular open-source IaC tool that supports AWS resources. The following example creates a durable function with Terraform using the AWS provider version 6.25.0 or later.

```
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 6.25.0"
    }
  }
}

provider "aws" {
  region = "us-east-2"
}

# IAM Role for Lambda Function
resource "aws_iam_role" "lambda_role" {
  name = "durable-function-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

# Attach durable execution policy for checkpoint operations
resource "aws_iam_role_policy_attachment" "lambda_durable" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy"
  role       = aws_iam_role.lambda_role.name
}

# Lambda Function with Durable Execution enabled
resource "aws_lambda_function" "durable_function" {
  filename      = "function.zip"
  function_name = "myDurableFunction"
  role          = aws_iam_role.lambda_role.arn
  handler       = "index.handler"
  runtime       = "nodejs22.x"
  timeout       = 30
  memory_size   = 512

  durable_config {
    execution_timeout = 900
    retention_period  = 7
  }
}

# Publish a version
resource "aws_lambda_alias" "prod" {
  name             = "prod"
  function_name    = aws_lambda_function.durable_function.function_name
  function_version = aws_lambda_function.durable_function.version
}

output "function_arn" {
  description = "ARN of the Lambda function"
  value       = aws_lambda_function.durable_function.arn
}

output "alias_arn" {
  description = "ARN of the function alias (use this for invocations)"
  value       = aws_lambda_alias.prod.arn
}
```

**To deploy with Terraform**

```
terraform init
terraform plan
terraform apply
```

**Note**  
Terraform support for Lambda durable functions requires AWS provider version 6.25.0 or later. Update your provider version if you're using an older version.

## Durable functions from an OCI container image
<a name="durable-iac-oci"></a>

You can also create Durable functions based off of container images. For instructions on how to build a container image, see [Supported runtimes for durable functions](durable-supported-runtimes.md).

### AWS CDK
<a name="durable-iac-oci-cdk"></a>

AWS CDK lets you define infrastructure using programming languages. The following examples show how to create a durable function using TypeScript from a container image.

```
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export class DurableFunctionStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create the durable function
    const durableFunction = new lambda.DockerImageFunction(this, 'DurableFunction', {
      code: lambda.DockerImageCode.fromImageAsset('./lambda', {
        platform: cdk.aws_ecr_assets.Platform.LINUX_AMD64,
      }),
      functionName: 'myDurableFunction',
      memorySize: 512,
      timeout: cdk.Duration.seconds(30),
      durableConfig: { executionTimeout: cdk.Duration.hours(1), retentionPeriod: cdk.Duration.days(30) },
    });

    // Create version and alias
    const version = durableFunction.currentVersion;
    const alias = new lambda.Alias(this, 'ProdAlias', {
      aliasName: 'prod',
      version: version,
    });

    // Output the alias ARN
    new cdk.CfnOutput(this, 'FunctionAliasArn', {
      value: alias.functionArn,
      description: 'Use this ARN to invoke the durable function',
    });
  }
}
```

**To deploy the CDK stack**

```
cdk deploy
```

### AWS Serverless Application Model
<a name="durable-iac-oci-sam"></a>

AWS SAM simplifies CloudFormation templates for serverless applications. The following template creates a durable function with AWS SAM.

```
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Lambda durable function with SAM

Resources:
  DurableFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: myDurableFunction
      PackageType: Image
      ImageUri: ./src
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Policies:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicDurableExecutionRolePolicy
      AutoPublishAlias: prod
    Metadata:
      DockerTag: latest
      DockerContext: ./src
      Dockerfile: Dockerfile

Outputs:
  FunctionArn:
    Description: Durable function ARN
    Value: !GetAtt DurableFunction.Arn
  AliasArn:
    Description: Function alias ARN (use this for invocations)
    Value: !Ref DurableFunction.Alias
```

**To deploy the SAM template**

```
sam build
sam deploy --guided
```

## Common configuration patterns
<a name="durable-iac-common-patterns"></a>

Regardless of which IaC tool you use, follow these patterns for durable functions:

**Enable durable execution**  
Set the `DurableConfig` property on your function to enable durable execution. This property is only available when creating the function. You cannot enable durable execution on existing functions.

**Grant checkpoint permissions**  
Attach the `AWSLambdaBasicDurableExecutionRolePolicy` managed policy to the execution role. This policy includes the required `lambda:CheckpointDurableExecution` and `lambda:GetDurableExecutionState` permissions.

**Use qualified ARNs**  
Create a version or alias for your function. Durable functions require qualified ARNs (with version or alias) for invocation. Use `AutoPublishAlias` in AWS SAM or create explicit versions in CloudFormation, AWS CDK, and Terraform.

**Package dependencies**  
Include the durable execution SDK in your deployment package. For Node.js, install `@aws/durable-execution-sdk-js`. For Python, install `aws-durable-execution-sdk-python`.

## Next steps
<a name="durable-iac-next-steps"></a>

After deploying your durable function:
+ Test your function using the qualified ARN (version or alias)
+ Monitor execution progress in the Lambda console under the Durable executions tab
+ View checkpoint operations in AWS CloudTrail data events
+ Review CloudWatch Logs for function output and replay behavior

For more information about deploying Lambda functions with IaC tools, see:
+ [CloudFormation AWS::Lambda::Function reference](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html)
+ [AWS CDK Lambda module documentation](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_lambda-readme.html)
+ [AWS SAM Developer Guide](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)

# Configure Lambda durable functions
<a name="durable-configuration"></a>

Durable execution settings control how long your Lambda function can run and how long the service retains execution history. Configure these settings to enable durable execution for your function.

## Enable durable execution
<a name="durable-config-settings"></a>

Configure the `DurableConfig` object when creating your function to set execution timeout and history retention. You can only enable durable execution when creating a function. You cannot enable it on existing functions.

------
#### [ AWS CLI ]

```
aws lambda create-function \
  --function-name my-durable-function \
  --runtime nodejs24.x \
  --role arn:aws:iam::123456789012:role/my-durable-role \
  --handler index.handler \
  --zip-file fileb://function.zip \
  --durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 30}'
```

------
#### [ CloudFormation ]

```
Resources:
  MyDurableFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: my-durable-function
      Runtime: nodejs24.x
      Handler: index.handler
      Code:
        ZipFile: |
          // Your durable function code
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 30
```

------

**Configuration parameters:**
+ `ExecutionTimeout` – The maximum time in seconds that a durable execution can run before Lambda stops the execution. This timeout applies to the entire durable execution, not individual function invocations. Valid range: 1–31622400.
+ `RetentionPeriodInDays` – The number of days to retain execution history after a durable execution completes. After this period, execution history is no longer available through the `GetDurableExecutionHistory` API. Valid range: 1–90.

For the full API reference, see [DurableConfig](https://docs.aws.amazon.com/lambda/latest/api/API_DurableConfig.html) in the Lambda API Reference.

## Configuration best practices
<a name="durable-config-best-practices"></a>

Follow these best practices when configuring durable functions for production use:
+ **Set appropriate execution timeouts** – Configure `ExecutionTimeout` based on your workflow's maximum expected duration. Do not set unnecessarily long timeouts as they affect cost and resource allocation.
+ **Balance retention with storage costs** – Set `RetentionPeriodInDays` based on your debugging and audit requirements. Longer retention periods increase storage costs.
+ **Monitor state size** – Large state objects increase storage costs and can impact performance. Keep state minimal and use external storage for large data.
+ **Configure appropriate logging** – Enable detailed logging for troubleshooting long-running workflows, but consider the impact on log volume and costs.

**Production configuration example:**

```
{
  "ExecutionTimeout": 86400,
  "RetentionPeriodInDays": 7
}
```

This example sets a 24-hour (86,400 seconds) execution timeout with a 7-day retention period, which balances debugging visibility with storage costs for most production workloads.

# Durable functions or Step Functions
<a name="durable-step-functions"></a>

Both Lambda durable functions and AWS Step Functions enable reliable workflow orchestration with automatic state management and failure recovery. They serve different developer preferences and architectural patterns. Durable functions are optimized for application development within Lambda, while Step Functions is built for workflow orchestration across AWS services.

## When to use durable functions
<a name="durable-sfn-when-durable"></a>

Use durable functions when:
+ Your team prefers standard programming languages and familiar development tools
+ Your application logic is primarily within Lambda functions
+ You want fine-grained control over execution state in code
+ You're building Lambda-centric applications with tight coupling between workflow and business logic
+ You want to iterate quickly without switching between code and visual/JSON designers

## When to use Step Functions
<a name="durable-sfn-when-step"></a>

Use Step Functions when:
+ You need visual workflow representation for cross-team visibility
+ You're orchestrating multiple AWS services and want native integrations without custom SDK code
+ You require zero-maintenance infrastructure (no patching, runtime updates)
+ Non-technical stakeholders need to understand and validate workflow logic

## Decision framework
<a name="durable-sfn-decision-framework"></a>

Use the following questions to determine which service fits your use case:
+ **What's your primary focus?** Application development in Lambda → durable functions. Workflow orchestration across AWS → Step Functions.
+ **What's your preferred programming model?** Standard programming languages → durable functions. Graph-based DSL or visual designer → Step Functions.
+ **How many AWS services are involved?** Primarily Lambda → durable functions. Multiple AWS services → Step Functions.
+ **What development tools do you use?** Lambda developer experience, IDE with LLM agent, programming language-specific unit test frameworks, AWS SAM, AWS CDK, AWS Toolkit → durable functions. Visual workflow builder, AWS CDK to model workflows → Step Functions.
+ **Who manages the infrastructure?** Want flexibility within Lambda → durable functions. Want fully managed, zero-maintenance → Step Functions.

## Feature comparison
<a name="durable-sfn-comparison"></a>

The following table compares key features between Step Functions and Lambda durable functions:


| Feature | AWS Step Functions | Lambda durable functions | 
| --- | --- | --- | 
| Primary focus | Workflow orchestration across AWS | Application development in Lambda | 
| Service type | Standalone, dedicated workflow service | Runs within Lambda | 
| Programming model | Graph-based, Amazon States Language DSL or AWS CDK | Standard programming languages (JavaScript/TypeScript, Python) | 
| Development tools | Visual builder in Console / AWS Toolkit IDE extension, AWS CDK | Lambda DX within IDE and LLM agents, unit test frameworks, AWS SAM, AWS Toolkit IDE extension | 
| Integrations | 220\$1 AWS services, 16k APIs | Lambda event-driven programming model extension (event sources) | 
| Management | Fully managed, runtime agnostic, zero maintenance (no patching, runtime updates) | Managed within Lambda environment | 
| Best for | Business process and IT automation, data processing, AI workflows | Distributed transactions, stateful application logic, function orchestration, data processing, AI workflows | 

## Hybrid architectures
<a name="durable-sfn-hybrid"></a>

Many applications benefit from using both services. A common pattern is using durable functions for application-level logic within Lambda, while Step Functions coordinates high-level workflows across multiple AWS services beyond Lambda functions.

## Migration considerations
<a name="durable-sfn-migration"></a>

**Starting simple, evolving complex:** Begin with durable functions for Lambda-centric workflows. Add Step Functions when you need multi-service orchestration or visual workflow design.

**Existing Step Functions users:** Keep Step Functions for established cross-service workflows. Consider durable functions for new Lambda application logic that needs reliability.

## Related resources
<a name="durable-sfn-related"></a>
+ [Lambda durable functions](durable-functions.md)
+ [Orchestrating Lambda functions with Step Functions](with-step-functions.md)
+ [Getting started with durable functions](durable-getting-started.md)

# Examples and use cases
<a name="durable-examples"></a>

Lambda durable functions enable you to build fault-tolerant, multi-step applications using durable operations like steps and waits. With automatic checkpointing and a checkpoint-replay model. where execution restarts from the beginning after failure but skips completed checkpoints, your functions can recover from failures and resume execution without losing progress.

## Short-lived fault-tolerant processes
<a name="durable-examples-short-lived"></a>

Use durable functions to build reliable operations that typically complete within minutes. While these processes are shorter than long-running workflows, they still benefit from automatic checkpointing and fault tolerance across distributed systems. Durable functions help ensure your multi-step processes complete successfully even when individual service calls fail, without requiring complex error handling or state management code.

Common scenarios include hotel booking systems, restaurant reservation platforms, ride-sharing trip requests, event ticket purchases, and SaaS subscription upgrades. These scenarios share common characteristics: multiple service calls that must complete together, the need for automatic retry on transient failures, and the requirement to maintain consistent state across distributed systems.

### Distributed transactions across microservices
<a name="durable-examples-distributed-transactions"></a>

Coordinate payments, inventory, and shipping across multiple services with automatic rollback on failures. Each service operation is wrapped in a step, ensuring the transaction can recover from any point if a service fails.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { orderId, amount, items } = event;
    
    // Reserve inventory across multiple warehouses
    const inventory = await context.step("reserve-inventory", async () => {
      return await inventoryService.reserve(items);
    });
    
    // Process payment
    const payment = await context.step("process-payment", async () => {
      return await paymentService.charge(amount);
    });
    
    // Create shipment
    const shipment = await context.step("create-shipment", async () => {
      return await shippingService.createShipment(orderId, inventory);
    });
    
    return { orderId, status: 'completed', shipment };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution

@durable_execution
def lambda_handler(event, context: DurableContext):
    order_id = event['orderId']
    amount = event['amount']
    items = event['items']
    
    # Reserve inventory across multiple warehouses
    inventory = context.step(
        lambda _: inventory_service.reserve(items),
        name='reserve-inventory'
    )
    
    # Process payment
    payment = context.step(
        lambda _: payment_service.charge(amount),
        name='process-payment'
    )
    
    # Create shipment
    shipment = context.step(
        lambda _: shipping_service.create_shipment(order_id, inventory),
        name='create-shipment'
    )
    
    return {'orderId': order_id, 'status': 'completed', 'shipment': shipment}
```

------

If any step fails, the function automatically retries from the last successful checkpoint. The inventory reservation persists even if payment processing fails temporarily. When the function retries, it skips the completed inventory step and proceeds directly to payment processing. This eliminates duplicate reservations and ensures consistent state across your distributed system.

### Order processing with multiple steps
<a name="durable-examples-order-processing"></a>

Process orders through validation, payment authorization, inventory allocation, and fulfillment with automatic retry and recovery. Each step is checkpointed, ensuring the order progresses even if individual steps fail and retry.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { orderId, customerId, items } = event;
    
    // Validate order details
    const validation = await context.step("validate-order", async () => {
      const customer = await customerService.validate(customerId);
      const itemsValid = await inventoryService.validateItems(items);
      return { customer, itemsValid };
    });
    
    if (!validation.itemsValid) {
      return { orderId, status: 'rejected', reason: 'invalid_items' };
    }
    
    // Authorize payment
    const authorization = await context.step("authorize-payment", async () => {
      return await paymentService.authorize(
        validation.customer.paymentMethod,
        calculateTotal(items)
      );
    });
    
    // Allocate inventory
    const allocation = await context.step("allocate-inventory", async () => {
      return await inventoryService.allocate(items);
    });
    
    // Fulfill order
    const fulfillment = await context.step("fulfill-order", async () => {
      return await fulfillmentService.createShipment({
        orderId,
        items: allocation.allocatedItems,
        address: validation.customer.shippingAddress
      });
    });
    
    return {
      orderId,
      status: 'completed',
      trackingNumber: fulfillment.trackingNumber
    };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution

@durable_execution
def lambda_handler(event, context: DurableContext):
    order_id = event['orderId']
    customer_id = event['customerId']
    items = event['items']
    
    # Validate order details
    def validate_order(_):
        customer = customer_service.validate(customer_id)
        items_valid = inventory_service.validate_items(items)
        return {'customer': customer, 'itemsValid': items_valid}
    
    validation = context.step(validate_order, name='validate-order')
    
    if not validation['itemsValid']:
        return {'orderId': order_id, 'status': 'rejected', 'reason': 'invalid_items'}
    
    # Authorize payment
    authorization = context.step(
        lambda _: payment_service.authorize(
            validation['customer']['paymentMethod'],
            calculate_total(items)
        ),
        name='authorize-payment'
    )
    
    # Allocate inventory
    allocation = context.step(
        lambda _: inventory_service.allocate(items),
        name='allocate-inventory'
    )
    
    # Fulfill order
    fulfillment = context.step(
        lambda _: fulfillment_service.create_shipment({
            'orderId': order_id,
            'items': allocation['allocatedItems'],
            'address': validation['customer']['shippingAddress']
        }),
        name='fulfill-order'
    )
    
    return {
        'orderId': order_id,
        'status': 'completed',
        'trackingNumber': fulfillment['trackingNumber']
    }
```

------

This pattern ensures orders never get stuck in intermediate states. If validation fails, the order is rejected before payment authorization. If payment authorization fails, inventory isn't allocated. Each step builds on the previous one with automatic retry and recovery.

**Note**  
The conditional check `if (!validation.itemsValid)` is outside a step and will re-execute during replay. This is safe because it's deterministic—it always produces the same result given the same validation object.

## Long-running processes
<a name="durable-examples-long-running"></a>

Use durable functions for processes that span hours, days, or weeks. Wait operations suspend execution without incurring compute charges, making long-running processes cost-effective. During wait periods, your function stops running and Lambda recycles the execution environment. When it's time to resume, Lambda invokes your function again and replays from the last checkpoint.

This execution model makes durable functions ideal for processes that need to pause for extended periods, whether waiting for human decisions, external system responses, scheduled processing windows, or time-based delays. You pay only for active compute time, not for waiting.

Common scenarios include document approval processes, scheduled batch processing, multi-day onboarding processes, subscription trial processes, and delayed notification systems. These scenarios share common characteristics: extended wait periods measured in hours or days, the need to maintain execution state across those waits, and cost-sensitive requirements where paying for idle compute time is prohibitive.

### Human-in-the-loop approvals
<a name="durable-examples-human-in-loop"></a>

Pause execution for document reviews, approvals, or decisions while maintaining execution state. The function waits for external callbacks without consuming resources, resuming automatically when approval is received.

This pattern is essential for processes that require human judgment or external validation. The function suspends at the callback point, incurring no compute charges while waiting. When someone submits their decision via API, Lambda invokes your function again and replays from the checkpoint, continuing with the approval result.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { documentId, reviewers } = event;
    
    // Step 1: Prepare document for review
    const prepared = await context.step("prepare-document", async () => {
      return await documentService.prepare(documentId);
    });
    
    // Step 2: Request approval with callback
    const approval = await context.waitForCallback(
      "approval-callback",
      async (callbackId) => {
        await notificationService.sendApprovalRequest({
          documentId,
          reviewers,
          callbackId,
          expiresIn: 86400
        });
      },
      {
        timeout: { seconds: 86400 }
      }
    );
    
    // Function resumes here when approval is received
    if (approval?.approved) {
      const finalized = await context.step("finalize-document", async () => {
        return await documentService.finalize(documentId, approval.comments);
      });
      
      return {
        status: 'approved',
        documentId,
        finalizedAt: finalized.timestamp
      };
    }
    
    // Handle rejection
    await context.step("archive-rejected", async () => {
      await documentService.archive(documentId, approval?.reason);
    });
    
    return {
      status: 'rejected',
      documentId,
      reason: approval?.reason
    };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution, WaitConfig

@durable_execution
def lambda_handler(event, context: DurableContext):
    document_id = event['documentId']
    reviewers = event['reviewers']
    
    # Step 1: Prepare document for review
    prepared = context.step(
        lambda _: document_service.prepare(document_id),
        name='prepare-document'
    )
    
    # Step 2: Request approval with callback
    def send_approval_request(callback_id):
        notification_service.send_approval_request({
            'documentId': document_id,
            'reviewers': reviewers,
            'callbackId': callback_id,
            'expiresIn': 86400
        })
    
    approval = context.wait_for_callback(
        send_approval_request,
        name='approval-callback',
        config=WaitConfig(timeout=86400)
    )
    
    # Function resumes here when approval is received
    if approval and approval.get('approved'):
        finalized = context.step(
            lambda _: document_service.finalize(document_id, approval.get('comments')),
            name='finalize-document'
        )
        
        return {
            'status': 'approved',
            'documentId': document_id,
            'finalizedAt': finalized['timestamp']
        }
    
    # Handle rejection
    context.step(
        lambda _: document_service.archive(document_id, approval.get('reason') if approval else None),
        name='archive-rejected'
    )
    
    return {
        'status': 'rejected',
        'documentId': document_id,
        'reason': approval.get('reason') if approval else None
    }
```

------

When the callback is received and your function resumes, it replays from the beginning. The prepare-document step returns its checkpointed result instantly. The waitForCallback operation also returns instantly with the stored approval result instead of waiting again. Execution then continues to the finalization or archival steps.

### Multi-stage data pipelines
<a name="durable-examples-data-pipelines"></a>

Process large datasets through extraction, transformation, and loading phases with checkpoints between stages. Each stage can take hours to complete, and checkpoints enable the pipeline to resume from any stage if interrupted.

This pattern is ideal for ETL workflows, data migrations, or batch processing jobs where you need to process data in stages with recovery points between them. If a stage fails, the pipeline resumes from the last completed stage rather than restarting from the beginning. You can also use wait operations to pause between stages; respecting rate limits, waiting for downstream systems to be ready, or scheduling processing during off-peak hours.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { datasetId, batchSize } = event;
    
    // Stage 1: Extract data from source
    const extracted = await context.step("extract-data", async () => {
      const records = await sourceDatabase.extractRecords(datasetId);
      return { recordCount: records.length, records };
    });
    
    // Wait 5 minutes to respect source system rate limits
    await context.wait({ seconds: 300 });
    
    // Stage 2: Transform data in batches
    const transformed = await context.step("transform-data", async () => {
      const batches = chunkArray(extracted.records, batchSize);
      const results = [];
      
      for (const batch of batches) {
        const transformed = await transformService.processBatch(batch);
        results.push(transformed);
      }
      
      return { batchCount: batches.length, results };
    });
    
    // Wait until off-peak hours (e.g., 2 AM)
    const now = new Date();
    const targetHour = 2;
    const msUntilTarget = calculateMsUntilHour(now, targetHour);
    await context.wait({ seconds: Math.floor(msUntilTarget / 1000) });
    
    // Stage 3: Load data to destination
    const loaded = await context.step("load-data", async () => {
      let loadedCount = 0;
      
      for (const result of transformed.results) {
        await destinationDatabase.loadBatch(result);
        loadedCount += result.length;
      }
      
      return { loadedCount };
    });
    
    // Stage 4: Verify and finalize
    const verified = await context.step("verify-pipeline", async () => {
      const verification = await destinationDatabase.verifyRecords(datasetId);
      await pipelineService.markComplete(datasetId, verification);
      return verification;
    });
    
    return {
      datasetId,
      recordsProcessed: extracted.recordCount,
      batchesProcessed: transformed.batchCount,
      recordsLoaded: loaded.loadedCount,
      verified: verified.success
    };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution
from datetime import datetime

@durable_execution
def lambda_handler(event, context: DurableContext):
    dataset_id = event['datasetId']
    batch_size = event['batchSize']
    
    # Stage 1: Extract data from source
    def extract_data(_):
        records = source_database.extract_records(dataset_id)
        return {'recordCount': len(records), 'records': records}
    
    extracted = context.step(extract_data, name='extract-data')
    
    # Wait 5 minutes to respect source system rate limits
    context.wait(Duration.from_seconds(300))
    
    # Stage 2: Transform data in batches
    def transform_data(_):
        batches = chunk_array(extracted['records'], batch_size)
        results = []
        
        for batch in batches:
            transformed = transform_service.process_batch(batch)
            results.append(transformed)
        
        return {'batchCount': len(batches), 'results': results}
    
    transformed = context.step(transform_data, name='transform-data')
    
    # Wait until off-peak hours (e.g., 2 AM)
    now = datetime.now()
    target_hour = 2
    ms_until_target = calculate_ms_until_hour(now, target_hour)
    context.wait(ms_until_target // 1000)
    
    # Stage 3: Load data to destination
    def load_data(_):
        loaded_count = 0
        
        for result in transformed['results']:
            destination_database.load_batch(result)
            loaded_count += len(result)
        
        return {'loadedCount': loaded_count}
    
    loaded = context.step(load_data, name='load-data')
    
    # Stage 4: Verify and finalize
    def verify_pipeline(_):
        verification = destination_database.verify_records(dataset_id)
        pipeline_service.mark_complete(dataset_id, verification)
        return verification
    
    verified = context.step(verify_pipeline, name='verify-pipeline')
    
    return {
        'datasetId': dataset_id,
        'recordsProcessed': extracted['recordCount'],
        'batchesProcessed': transformed['batchCount'],
        'recordsLoaded': loaded['loadedCount'],
        'verified': verified['success']
    }
```

------

Each stage is wrapped in a step, creating a checkpoint that allows the pipeline to resume from any stage if interrupted. The 5-minute wait between extract and transform respects source system rate limits without consuming compute resources, while the wait until 2 AM schedules the expensive load operation during off-peak hours.

**Note**  
The `new Date()` call and `calculateMsUntilHour()` function are outside steps and will re-execute during replay. For time-based operations that must be consistent across replays, calculate the timestamp inside a step or use it only for wait durations (which are checkpointed).

## Advanced patterns
<a name="durable-examples-advanced"></a>

Use durable functions to build complex multi-step applications that combine multiple durable operations, parallel execution, array processing, conditional logic, and polling. These patterns let you build sophisticated applications that coordinate many tasks while maintaining fault tolerance and automatic recovery.

Advanced patterns go beyond simple sequential steps. You can run operations concurrently with `parallel()`, process arrays with `map()`, wait for external conditions with `waitForCondition()`, and combine these primitives to build reliable applications. Each durable operation creates its own checkpoints, so your application can recover from any point if interrupted.

### User onboarding processes
<a name="durable-examples-user-onboarding"></a>

Guide users through registration, email verification, profile setup, and initial configuration with retry handling. This example combines sequential steps, callbacks, and conditional logic to create a complete onboarding process.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { userId, email } = event;
    
    // Step 1: Create user account
    const user = await context.step("create-account", async () => {
      return await userService.createAccount(userId, email);
    });
    
    // Step 2: Send verification email
    await context.step("send-verification", async () => {
      return await emailService.sendVerification(email);
    });
    
    // Step 3: Wait for email verification (up to 48 hours)
    const verified = await context.waitForCallback(
      "email-verification",
      async (callbackId) => {
        await notificationService.sendVerificationLink({
          email,
          callbackId,
          expiresIn: 172800
        });
      },
      {
        timeout: { seconds: 172800 }
      }
    );
    
    if (!verified) {
      await context.step("send-reminder", async () => {
        await emailService.sendReminder(email);
      });
      
      return {
        status: "verification_timeout",
        userId,
        message: "Email verification not completed within 48 hours"
      };
    }
    
    // Step 4: Initialize user profile in parallel
    const setupResults = await context.parallel("profile-setup", [
      async (ctx: DurableContext) => {
        return await ctx.step("create-preferences", async () => {
          return await preferencesService.createDefaults(userId);
        });
      },
      
      async (ctx: DurableContext) => {
        return await ctx.step("setup-notifications", async () => {
          return await notificationService.setupDefaults(userId);
        });
      },
      
      async (ctx: DurableContext) => {
        return await ctx.step("create-welcome-content", async () => {
          return await contentService.createWelcome(userId);
        });
      }
    ]);
    
    // Step 5: Send welcome email
    await context.step("send-welcome", async () => {
      const [preferences, notifications, content] = setupResults.getResults();
      return await emailService.sendWelcome({
        email,
        preferences,
        notifications,
        content
      });
    });
    
    return {
      status: "onboarding_complete",
      userId,
      completedAt: new Date().toISOString()
    };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution, WaitConfig
from datetime import datetime

@durable_execution
def lambda_handler(event, context: DurableContext):
    user_id = event['userId']
    email = event['email']
    
    # Step 1: Create user account
    user = context.step(
        lambda _: user_service.create_account(user_id, email),
        name='create-account'
    )
    
    # Step 2: Send verification email
    context.step(
        lambda _: email_service.send_verification(email),
        name='send-verification'
    )
    
    # Step 3: Wait for email verification (up to 48 hours)
    def send_verification_link(callback_id):
        notification_service.send_verification_link({
            'email': email,
            'callbackId': callback_id,
            'expiresIn': 172800
        })
    
    verified = context.wait_for_callback(
        send_verification_link,
        name='email-verification',
        config=WaitConfig(timeout=172800)
    )
    
    if not verified:
        context.step(
            lambda _: email_service.send_reminder(email),
            name='send-reminder'
        )
        
        return {
            'status': 'verification_timeout',
            'userId': user_id,
            'message': 'Email verification not completed within 48 hours'
        }
    
    # Step 4: Initialize user profile in parallel
    def create_preferences(ctx: DurableContext):
        return ctx.step(
            lambda _: preferences_service.create_defaults(user_id),
            name='create-preferences'
        )
    
    def setup_notifications(ctx: DurableContext):
        return ctx.step(
            lambda _: notification_service.setup_defaults(user_id),
            name='setup-notifications'
        )
    
    def create_welcome_content(ctx: DurableContext):
        return ctx.step(
            lambda _: content_service.create_welcome(user_id),
            name='create-welcome-content'
        )
    
    setup_results = context.parallel(
        [create_preferences, setup_notifications, create_welcome_content],
        name='profile-setup'
    )
    
    # Step 5: Send welcome email
    def send_welcome(_):
        results = setup_results.get_results()
        preferences, notifications, content = results[0], results[1], results[2]
        return email_service.send_welcome({
            'email': email,
            'preferences': preferences,
            'notifications': notifications,
            'content': content
        })
    
    context.step(send_welcome, name='send-welcome')
    
    return {
        'status': 'onboarding_complete',
        'userId': user_id,
        'completedAt': datetime.now().isoformat()
    }
```

------

The process combines sequential steps with checkpoints for account creation and email sending, then pauses for up to 48 hours waiting for email verification without consuming resources. Conditional logic handles different paths based on whether verification completes or times out. Profile setup tasks run concurrently using parallel operations to reduce total execution time, and each step retries automatically on transient failures to help ensure the onboarding completes reliably.

### Chained invocations across functions
<a name="durable-examples-chained-invocations"></a>

Invoke other Lambda functions from within a durable function using `context.invoke()`. The calling function suspends while waiting for the invoked function to complete, creating a checkpoint that preserves the result. If the calling function is interrupted after the invoked function completes, it resumes with the stored result without re-invoking the function.

Use this pattern when you have specialized functions that handle specific domains (customer validation, payment processing, inventory management) and need to coordinate them in a workflow. Each function maintains its own logic and can be invoked by multiple orchestrator functions, avoiding code duplication.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

// Main orchestrator function
export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { orderId, customerId } = event;
    
    // Step 1: Validate customer by invoking customer service function
    const customer = await context.invoke(
      "validate-customer",
      "arn:aws:lambda:us-east-1:123456789012:function:customer-service:1",
      { customerId }
    );
    
    if (!customer.isValid) {
      return { orderId, status: "rejected", reason: "invalid_customer" };
    }
    
    // Step 2: Check inventory by invoking inventory service function
    const inventory = await context.invoke(
      "check-inventory",
      "arn:aws:lambda:us-east-1:123456789012:function:inventory-service:1",
      { orderId, items: event.items }
    );
    
    if (!inventory.available) {
      return { orderId, status: "rejected", reason: "insufficient_inventory" };
    }
    
    // Step 3: Process payment by invoking payment service function
    const payment = await context.invoke(
      "process-payment",
      "arn:aws:lambda:us-east-1:123456789012:function:payment-service:1",
      {
        customerId,
        amount: inventory.totalAmount,
        paymentMethod: customer.paymentMethod
      }
    );
    
    // Step 4: Create shipment by invoking fulfillment service function
    const shipment = await context.invoke(
      "create-shipment",
      "arn:aws:lambda:us-east-1:123456789012:function:fulfillment-service:1",
      {
        orderId,
        items: inventory.allocatedItems,
        address: customer.shippingAddress
      }
    );
    
    return {
      orderId,
      status: "completed",
      trackingNumber: shipment.trackingNumber,
      estimatedDelivery: shipment.estimatedDelivery
    };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution

# Main orchestrator function
@durable_execution
def lambda_handler(event, context: DurableContext):
    order_id = event['orderId']
    customer_id = event['customerId']
    
    # Step 1: Validate customer by invoking customer service function
    customer = context.invoke(
        'arn:aws:lambda:us-east-1:123456789012:function:customer-service:1',
        {'customerId': customer_id},
        name='validate-customer'
    )
    
    if not customer['isValid']:
        return {'orderId': order_id, 'status': 'rejected', 'reason': 'invalid_customer'}
    
    # Step 2: Check inventory by invoking inventory service function
    inventory = context.invoke(
        'arn:aws:lambda:us-east-1:123456789012:function:inventory-service:1',
        {'orderId': order_id, 'items': event['items']},
        name='check-inventory'
    )
    
    if not inventory['available']:
        return {'orderId': order_id, 'status': 'rejected', 'reason': 'insufficient_inventory'}
    
    # Step 3: Process payment by invoking payment service function
    payment = context.invoke(
        'arn:aws:lambda:us-east-1:123456789012:function:payment-service:1',
        {
            'customerId': customer_id,
            'amount': inventory['totalAmount'],
            'paymentMethod': customer['paymentMethod']
        },
        name='process-payment'
    )
    
    # Step 4: Create shipment by invoking fulfillment service function
    shipment = context.invoke(
        'arn:aws:lambda:us-east-1:123456789012:function:fulfillment-service:1',
        {
            'orderId': order_id,
            'items': inventory['allocatedItems'],
            'address': customer['shippingAddress']
        },
        name='create-shipment'
    )
    
    return {
        'orderId': order_id,
        'status': 'completed',
        'trackingNumber': shipment['trackingNumber'],
        'estimatedDelivery': shipment['estimatedDelivery']
    }
```

------

Each invocation creates a checkpoint in the orchestrator function. If the orchestrator is interrupted after the customer validation completes, it resumes from that checkpoint with the stored customer data, skipping the validation invocation. This prevents duplicate calls to downstream services and ensures consistent execution across interruptions.

The invoked functions can be either durable or standard Lambda functions. If you invoke a durable function, it can have its own multi-step workflow with waits and checkpoints. The orchestrator simply waits for the complete durable execution to finish, receiving the final result.

**Note**  
Cross-account invocations are not supported. All invoked functions must be in the same AWS account as the calling function.

### Batch processing with checkpoints
<a name="durable-examples-batch-processing"></a>

Process millions of records with automatic recovery from the last successful checkpoint after failures. This example demonstrates how durable functions combine `map()` operations with chunking and rate limiting to handle large-scale data processing.

------
#### [ TypeScript ]

```
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

interface Batch {
  batchIndex: number;
  recordIds: string[];
}

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const { datasetId, batchSize = 1000 } = event;
    
    // Step 1: Get all record IDs to process
    const recordIds = await context.step("fetch-record-ids", async () => {
      return await dataService.getRecordIds(datasetId);
    });
    
    // Step 2: Split into batches
    const batches: Batch[] = [];
    for (let i = 0; i < recordIds.length; i += batchSize) {
      batches.push({
        batchIndex: Math.floor(i / batchSize),
        recordIds: recordIds.slice(i, i + batchSize)
      });
    }
    
    // Step 3: Process batches with controlled concurrency
    const batchResults = await context.map(
      "process-batches",
      batches,
      async (ctx: DurableContext, batch: Batch, index: number) => {
        const processed = await ctx.step(`batch-${batch.batchIndex}`, async () => {
          const results = [];
          for (const recordId of batch.recordIds) {
            const result = await recordService.process(recordId);
            results.push(result);
          }
          return results;
        });
        
        const validated = await ctx.step(`validate-${batch.batchIndex}`, async () => {
          return await validationService.validateBatch(processed);
        });
        
        return {
          batchIndex: batch.batchIndex,
          recordCount: batch.recordIds.length,
          successCount: validated.successCount,
          failureCount: validated.failureCount
        };
      },
      {
        maxConcurrency: 5
      }
    );
    
    // Step 4: Aggregate results
    const summary = await context.step("aggregate-results", async () => {
      const results = batchResults.getResults();
      const totalSuccess = results.reduce((sum, r) => sum + r.successCount, 0);
      const totalFailure = results.reduce((sum, r) => sum + r.failureCount, 0);
      
      return {
        datasetId,
        totalRecords: recordIds.length,
        batchesProcessed: batches.length,
        successCount: totalSuccess,
        failureCount: totalFailure,
        completedAt: new Date().toISOString()
      };
    });
    
    return summary;
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import DurableContext, durable_execution, MapConfig
from datetime import datetime
from typing import List, Dict

@durable_execution
def lambda_handler(event, context: DurableContext):
    dataset_id = event['datasetId']
    batch_size = event.get('batchSize', 1000)
    
    # Step 1: Get all record IDs to process
    record_ids = context.step(
        lambda _: data_service.get_record_ids(dataset_id),
        name='fetch-record-ids'
    )
    
    # Step 2: Split into batches
    batches = []
    for i in range(0, len(record_ids), batch_size):
        batches.append({
            'batchIndex': i // batch_size,
            'recordIds': record_ids[i:i + batch_size]
        })
    
    # Step 3: Process batches with controlled concurrency
    def process_batch(ctx: DurableContext, batch: Dict, index: int):
        batch_index = batch['batchIndex']
        
        def process_records(_):
            results = []
            for record_id in batch['recordIds']:
                result = record_service.process(record_id)
                results.append(result)
            return results
        
        processed = ctx.step(process_records, name=f'batch-{batch_index}')
        
        validated = ctx.step(
            lambda _: validation_service.validate_batch(processed),
            name=f'validate-{batch_index}'
        )
        
        return {
            'batchIndex': batch_index,
            'recordCount': len(batch['recordIds']),
            'successCount': validated['successCount'],
            'failureCount': validated['failureCount']
        }
    
    batch_results = context.map(
        process_batch,
        batches,
        name='process-batches',
        config=MapConfig(max_concurrency=5)
    )
    
    # Step 4: Aggregate results
    def aggregate_results(_):
        results = batch_results.get_results()
        total_success = sum(r['successCount'] for r in results)
        total_failure = sum(r['failureCount'] for r in results)
        
        return {
            'datasetId': dataset_id,
            'totalRecords': len(record_ids),
            'batchesProcessed': len(batches),
            'successCount': total_success,
            'failureCount': total_failure,
            'completedAt': datetime.now().isoformat()
        }
    
    summary = context.step(aggregate_results, name='aggregate-results')
    
    return summary
```

------

Records are split into manageable batches to avoid overwhelming memory or downstream services, then multiple batches process concurrently with `maxConcurrency` controlling the parallelism. Each batch has its own checkpoint, so failures only retry the failed batch rather than reprocessing all records. This pattern is ideal for ETL jobs, data migrations, or bulk operations where processing can take hours.

## Next steps
<a name="durable-examples-next-steps"></a>
+ Explore [basic concepts](durable-basic-concepts.md) to understand DurableContext, steps, and waits
+ Review [best practices](durable-best-practices.md) for writing deterministic code and optimizing performance
+ Learn about [testing durable functions](durable-testing.md) locally and in the cloud
+ Compare durable functions with Step Functions to understand when each approach is most effective. See [Durable functions or Step Functions](durable-step-functions.md).

# Security and permissions for Lambda durable functions
<a name="durable-security"></a>

Lambda durable functions require specific IAM permissions to manage checkpoint operations. Follow the principle of least privilege by granting only the permissions your function needs.

## Execution role permissions
<a name="durable-execution-role"></a>

Your durable function's execution role needs permissions to create checkpoints and retrieve execution state. The following policy shows the minimum required permissions:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:CheckpointDurableExecution",
                "lambda:GetDurableExecutionState"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name:*"
        }
    ]
}
```

When you create a durable function using the console, Lambda automatically adds these permissions to the execution role. If you create the function using the AWS CLI or AWS CloudFormation, add these permissions to your execution role.

For Lambda to assume your execution role, the role's trust policy must specify the Lambda service principal (`lambda.amazonaws.com`) as a trusted service. The following example shows a trust policy that grants Lambda permission to assume the role.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

**Least privilege principle**  
Scope the `Resource` element to specific function ARNs instead of using wildcards. This limits the execution role to checkpoint operations for only the functions that need them.

**Example: Scoped permissions for multiple functions**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:CheckpointDurableExecution",
                "lambda:GetDurableExecutionState"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:123456789012:function:orderProcessor:*",
                "arn:aws:lambda:us-east-1:123456789012:function:paymentHandler:*"
            ]
        }
    ]
}
```

Alternatively, you can use the AWS managed policy `AWSLambdaBasicDurableExecutionRolePolicy` which includes the required durable execution permissions along with basic Lambda execution permissions for Amazon CloudWatch Logs.

## State encryption
<a name="durable-state-encryption"></a>

Lambda durable functions automatically enable encryption at rest using AWS owned keys at no charge. Each function execution maintains isolated state that other executions cannot access. Customer managed keys (CMK) are not supported.

Checkpoint data includes:
+ Step results and return values
+ Execution progress and timeline
+ Wait state information

All data is encrypted in transit using TLS when Lambda reads or writes checkpoint data.

### Custom encryption with custom serializers and deserializers
<a name="durable-custom-encryption"></a>

For critical security requirements, you can implement your own encryption and decryption mechanism using custom serializers and deserializers (SerDer) using durable SDK. This approach gives you full control over the encryption keys and algorithms used to protect checkpoint data.

**Important**  
When you use custom encryption, you lose visibility of operation results in the Lambda console and API responses. Checkpoint data appears encrypted in execution history and cannot be inspected without decryption.

Your function's execution role needs `kms:Encrypt` and `kms:Decrypt` permissions for the AWS KMS key used in the custom SerDer implementation.

## CloudTrail logging
<a name="durable-cloudtrail-logging"></a>

Lambda logs checkpoint operations as data events in AWS CloudTrail. You can use CloudTrail to audit when checkpoints are created, track execution state changes, and monitor access to durable execution data.

Checkpoint operations appear in CloudTrail logs with the following event names:
+ `CheckpointDurableExecution` - Logged when a step completes and creates a checkpoint
+ `GetDurableExecutionState` - Logged when Lambda retrieves execution state during replay

To enable data event logging for durable functions, configure a CloudTrail trail to log Lambda data events. For more information, see [Logging data events](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html) in the CloudTrail User Guide.

**Example: CloudTrail log entry for checkpoint operation**

```
{
    "eventVersion": "1.08",
    "eventTime": "2024-11-16T10:30:45Z",
    "eventName": "CheckpointDurableExecution",
    "eventSource": "lambda.amazonaws.com",
    "requestParameters": {
        "functionName": "myDurableFunction",
        "executionId": "exec-abc123",
        "stepId": "step-1"
    },
    "responseElements": null,
    "eventType": "AwsApiCall"
}
```

## Cross-account considerations
<a name="durable-cross-account-access"></a>

If you invoke durable functions across AWS accounts, the calling account needs `lambda:InvokeFunction` permission, but checkpoint operations always use the execution role in the function's account. The calling account cannot access checkpoint data or execution state directly.

This isolation ensures that checkpoint data remains secure within the function's account, even when invoked from external accounts.

## Inherited Lambda security features
<a name="durable-inherited-security"></a>

Durable functions inherit all security, governance, and compliance features from Lambda, including VPC connectivity, environment variable encryption, dead letter queues, reserved concurrency, function URLs, code signing, and compliance certifications (SOC, PCI DSS, HIPAA, etc.).

For detailed information about Lambda security features, see [Security in AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/lambda-security.html) in the Lambda Developer Guide. The only additional security considerations for durable functions are the checkpoint permissions documented in this guide.

# Durable execution SDK
<a name="durable-execution-sdk"></a>

The durable execution SDK is the foundation for building durable functions. It provides the primitives you need to checkpoint progress, handle retries, and manage execution flow. The SDK abstracts the complexity of checkpoint management and replay, letting you write sequential code that automatically becomes fault-tolerant.

The SDK is available for JavaScript, TypeScript, Python, and Java. For complete API documentation and examples, see the [JavaScript/TypeScript SDK](https://github.com/aws/aws-durable-execution-sdk-js), [Python SDK](https://github.com/aws/aws-durable-execution-sdk-python) and [Java SDK](https://github.com/aws/aws-durable-execution-sdk-java) on GitHub.

## DurableContext
<a name="durable-sdk-context"></a>

The SDK provides your function with a `DurableContext` object that exposes all durable operations. This context replaces the standard Lambda context and provides methods for creating checkpoints, managing execution flow, and coordinating with external systems.

To use the SDK, wrap your Lambda handler with the durable execution wrapper:

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Your function receives DurableContext instead of Lambda context
    // Use context.step(), context.wait(), etc.
    return result;
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext

@durable_execution
def handler(event: dict, context: DurableContext):
    # Your function receives DurableContext
    # Use context.step(), context.wait(), etc.
    return result
```

------
#### [ Java ]

```
import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;

public class Handler extends DurableHandler<Object, String> {
    @Override
    public String handleRequest(Object input, DurableContext context) {
        // Your function receives DurableContext
        // Use context.step(), context.wait(), etc.
        return result;
    }
}
```

------

The wrapper intercepts your function invocation, loads any existing checkpoint log, and provides the `DurableContext` that manages replay and checkpointing.

## What the SDK does
<a name="durable-sdk-what-it-does"></a>

The SDK handles three critical responsibilities that enable durable execution:

**Checkpoint management:** The SDK automatically creates checkpoints as your function executes durable operations. Each checkpoint records the operation type, inputs, and results. When your function completes a step, the SDK persists the checkpoint before continuing. This ensures your function can resume from any completed operation if interrupted.

**Replay coordination:** When your function resumes after a pause or interruption, the SDK performs replay. It runs your code from the beginning but skips completed operations, using stored checkpoint results instead of re-executing them. The SDK ensures replay is deterministic—given the same inputs and checkpoint log, your function produces the same results.

**State isolation:** The SDK maintains execution state separately from your business logic. Each durable execution has its own checkpoint log that other executions cannot access. The SDK encrypts checkpoint data at rest and ensures state remains consistent across replays.

## How checkpointing works
<a name="durable-sdk-how-checkpointing-works"></a>

When you call a durable operation, the SDK follows this sequence:

1. **Check for existing checkpoint:** The SDK checks if this operation already completed in a previous invocation. If a checkpoint exists, the SDK returns the stored result without re-executing the operation.

1. **Execute the operation:** If no checkpoint exists, the SDK executes your operation code. For steps, this means calling your function. For waits, this means scheduling resumption.

1. **Create checkpoint:** After the operation completes, the SDK serializes the result and creates a checkpoint. The checkpoint includes the operation type, name, inputs, result, and timestamp.

1. **Persist checkpoint:** The SDK calls the Lambda checkpoint API to persist the checkpoint. This ensures the checkpoint is durable before continuing execution.

1. **Return result:** The SDK returns the operation result to your code, which continues to the next operation.

This sequence ensures that once an operation completes, its result is safely stored. If your function is interrupted at any point, the SDK can replay up to the last completed checkpoint.

## Replay behavior
<a name="durable-sdk-replay-behavior"></a>

When your function resumes after a pause or interruption, the SDK performs replay:

1. **Load checkpoint log:** The SDK retrieves the checkpoint log for this execution from Lambda.

1. **Run from beginning:** The SDK invokes your handler function from the start, not from where it paused.

1. **Skip completed durable operations:** As your code calls durable operations, the SDK checks each against the checkpoint log. For completed durable operations, the SDK returns the stored result without executing the operation code.
**Note**  
If a child context's result was larger than the maximum checkpoint size (256 KB), the context's code is executed again during replay. This allows you to construct large results from the durable operations that ran inside the context, which will be looked up from the checkpoint log. Therefore it is imperative to only run deterministic code in the context itself. When using child contexts with large results, it is a best practice to perform long-running or non-deterministic work inside of steps and only perform short-running tasks which combine the results in the context itself.

1. **Resume at interruption point:** When the SDK reaches an operation without a checkpoint, it executes normally and creates new checkpoints as durable operations complete.

This replay mechanism requires your code to be deterministic. Given the same inputs and checkpoint log, your function must make the same sequence of durable operation calls. The SDK enforces this by validating that operation names and types match the checkpoint log during replay.

## Available durable operations
<a name="durable-sdk-operations"></a>

The `DurableContext` provides operations for different coordination patterns. Each durable operation creates checkpoints automatically, ensuring your function can resume from any point.

### Steps
<a name="durable-sdk-op-step"></a>

Executes business logic with automatic checkpointing and retry. Use steps for operations that call external services, perform calculations, or execute any logic that should be checkpointed. The SDK creates a checkpoint before and after the step, storing the result for replay.

------
#### [ TypeScript ]

```
const result = await context.step('process-payment', async () => {
  return await paymentService.charge(amount);
});
```

------
#### [ Python ]

```
result = context.step(
    lambda _: payment_service.charge(amount),
    name='process-payment'
)
```

------
#### [ Java ]

```
var result = context.step("process-payment", Payment.class, 
    () -> paymentService.charge(amount)
);
```

------

Steps support configurable retry strategies, execution semantics (at-most-once or at-least-once), and custom serialization.

### Waits
<a name="durable-sdk-op-wait"></a>

Pauses execution for a specified duration without consuming compute resources. The SDK creates a checkpoint, terminates the function invocation, and schedules resumption. When the wait completes, Lambda invokes your function again and the SDK replays to the wait point before continuing.

------
#### [ TypeScript ]

```
// Wait 1 hour without charges
await context.wait({ seconds: 3600 });
```

------
#### [ Python ]

```
# Wait 1 hour without charges
context.wait(Duration.from_seconds(3600))
```

------
#### [ Java ]

```
// Wait 1 hour without charges
context.wait(Duration.ofHours(1));
```

------

### Callbacks
<a name="durable-sdk-op-callback"></a>

Callbacks enable your function to pause and wait for external systems to provide input. When you create a callback, the SDK generates a unique callback ID and creates a checkpoint. Your function then suspends (terminates the invocation) without incurring compute charges. External systems submit callback results using the `SendDurableExecutionCallbackSuccess` or `SendDurableExecutionCallbackFailure` Lambda APIs. When a callback is submitted, Lambda invokes your function again, the SDK replays to the callback point, and your function continues with the callback result.

The SDK provides two methods for working with callbacks:

**createCallback:** Creates a callback and returns both a promise and a callback ID. You send the callback ID to an external system, which submits the result using the Lambda API.

------
#### [ TypeScript ]

```
const [promise, callbackId] = await context.createCallback('approval', {
  timeout: { hours: 24 }
});

await sendApprovalRequest(callbackId, requestData);
const approval = await promise;
```

------
#### [ Python ]

```
callback = context.create_callback(
    name='approval',
    config=CallbackConfig(timeout_seconds=86400)
)

context.step(
    lambda _: send_approval_request(callback.callback_id),
    name='send_request'
)

approval = callback.result()
```

------
#### [ Java ]

```
var config = CallbackConfig.builder(Duration.ofHours(24)).timeout()

var callback = context.createCallback("approval", String.class, config);

context.step("send-request", String.class, () -> {
    notificationService.sendApprovalRequest(callback.callbackId(), requestData);
    return "request-sent";
});

// Blocks until the callback finishes or times out
String approval = callback.get();
```

------

**waitForCallback:** Simplifies callback handling by combining callback creation and submission in one operation. The SDK creates the callback, executes your submitter function with the callback ID, and waits for the result.

------
#### [ TypeScript ]

```
const result = await context.waitForCallback(
  'external-api',
  async (callbackId, ctx) => {
    await submitToExternalAPI(callbackId, requestData);
  },
  { timeout: { minutes: 30 } }
);
```

------
#### [ Python ]

```
result = context.wait_for_callback(
    lambda callback_id: submit_to_external_api(callback_id, request_data),
    name='external-api',
    config=WaitForCallbackConfig(timeout_seconds=1800)
)
```

------
#### [ Java ]

```
var result = context.waitForCallback(
    "external-api",
    String.class,
    (callbackId, ctx) -> {
        submitToExternalAPI(callbackId, requestData);
    },
    WaitForCallbackConfig.builder()
        .callbackConfig(CallbackConfig.builder()
            .timeout(Duration.ofMinutes(30))
            .build())
        .build());
```

------

Configure timeouts to prevent functions from waiting indefinitely. If a callback times out, the SDK throws a `CallbackError` and your function can handle the timeout case. Use heartbeat timeouts for long-running callbacks to detect when external systems stop responding.

Use callbacks for human-in-the-loop workflows, external system integration, webhook responses, or any scenario where execution must pause for external input.

### Parallel execution
<a name="durable-sdk-op-parallel"></a>

Executes multiple operations concurrently with optional concurrency control. The SDK manages parallel execution, creates checkpoints for each operation, and handles failures according to your completion policy.

------
#### [ TypeScript ]

```
const results = await context.parallel([
  async (ctx) => ctx.step('task1', async () => processTask1()),
  async (ctx) => ctx.step('task2', async () => processTask2()),
  async (ctx) => ctx.step('task3', async () => processTask3())
]);
```

------
#### [ Python ]

```
results = context.parallel([
    lambda ctx: ctx.step(lambda _: process_task1(), name='task1'),
    lambda ctx: ctx.step(lambda _: process_task2(), name='task2'),
    lambda ctx: ctx.step(lambda _: process_task3(), name='task3')
])
```

------
#### [ Java ]

```
DurableFuture<String> f1;
DurableFuture<Integer> f2;
DurableFuture<Boolean> f3;
try (var parallel = context.parallel("tasks")) {
    f1 = parallel.branch("string-task",  String.class,  ctx -> ctx.step("string-task",  String.class,  s -> processString()));
    f2 = parallel.branch("integer-task", Integer.class, ctx -> ctx.step("integer-task", Integer.class, s -> processInteger()));
    f3 = parallel.branch("boolean-task", Boolean.class, ctx -> ctx.step("boolean-task", Boolean.class, s -> processBoolean()));
}
String stringResult = f1.get();
int integerResult = f2.get();
boolean booleanResult = f3.get();
```

------

Use `parallel` to execute independent operations concurrently.

### Map
<a name="durable-sdk-op-map"></a>

Concurrently execute an operation on each item in an array with optional concurrency control. The SDK manages concurrent execution, creates checkpoints for each operation, and handles failures according to your completion policy.

------
#### [ TypeScript ]

```
const results = await context.map(itemArray, async (ctx, item, index) =>
  ctx.step('task', async () => processItem(item, index))
);
```

------
#### [ Python ]

```
results = context.map(
    item_array,
    lambda ctx, item, index: ctx.step(
        lambda _: process_item(item, index),
        name='task'
    )
)
```

------
#### [ Java ]

```
var results = context.map(
    "process-items",
    itemArray,
    String.class,
    (item, index, ctx) -> ctx.step("task", String.class, s -> processItem(item, index)));
```

------

Use `map` to process arrays with concurrency control.

### Child contexts
<a name="durable-sdk-op-child-context"></a>

Creates an isolated execution context for grouping operations. Child contexts have their own checkpoint log and can contain multiple steps, waits, and other operations. The SDK treats the entire child context as a single unit for retry and recovery.

Use child contexts to organize complex workflows, implement sub-workflows, or isolate operations that should retry together.

------
#### [ TypeScript ]

```
const result = await context.runInChildContext(
  'batch-processing',
  async (childCtx) => {
    return await processBatch(childCtx, items);
  }
);
```

------
#### [ Python ]

```
result = context.run_in_child_context(
    lambda child_ctx: process_batch(child_ctx, items),
    name='batch-processing'
)
```

------
#### [ Java ]

```
var result = context.runInChildContext(
    "batch-processing", 
    String.class, 
    childCtx -> process_batch(childCtx, items)
);
```

------

The replay mechanism demands that durable operations happen in a deterministic order. Using multiple child contexts you can have multiple streams of work execute concurrently, and the determinism applies separately within each context. This allows you to build high-performance functions which efficiently utilize multiple CPU cores.

For example, imagine we start two child contexts, A and B. On the initial invocation, the steps within the contexts were run in this order, with the 'A' steps running concurrently with the 'B' steps: A1, B1, B2, A2, A3. Upon replay, the timing is much faster as results are retrieved from checkpoint log, and the steps happen to be encountered in a different order: B1, A1, A2, B2, A3. Because the 'A' steps were encountered in the correct order (A1, A2, A3) and the 'B' steps were encountered in the correct order (B1, B2), the need for determinism was satisfied correctly.

### Conditional waits
<a name="durable-sdk-op-wait-condition"></a>

Polls for a condition with automatic checkpointing between attempts. The SDK executes your check function, creates a checkpoint with the result, waits according to your strategy, and repeats until the condition is met.

------
#### [ TypeScript ]

```
const result = await context.waitForCondition(
  async (state, ctx) => {
    const status = await checkJobStatus(state.jobId);
    return { ...state, status };
  },
  {
    initialState: { jobId: 'job-123', status: 'pending' },
    waitStrategy: (state) => 
      state.status === 'completed' 
        ? { shouldContinue: false }
        : { shouldContinue: true, delay: { seconds: 30 } }
  }
);
```

------
#### [ Python ]

```
result = context.wait_for_condition(
    lambda state, ctx: check_job_status(state['jobId']),
    config=WaitForConditionConfig(
        initial_state={'jobId': 'job-123', 'status': 'pending'},
        wait_strategy=lambda state, attempt: 
            {'should_continue': False} if state['status'] == 'completed'
            else {'should_continue': True, 'delay': 30}
    )
)
```

------
#### [ Java ]

```
record JobState(String jobId, String status) {}

var result = context.waitForCondition(
    "check-job",
    JobState.class,
    (state, ctx) -> {
        var status = checkJobStatus(state.jobId());
        var updatedState = new JobState(state.jobId(), status);
        if ("completed".equals(status)) {
            return WaitForConditionResult.stopPolling(updatedState);
        }
        return WaitForConditionResult.continuePolling(updatedState);
    },
    WaitForConditionConfig.<JobState>builder()
        .initialState(new JobState("job-123", "pending"))
        .waitStrategy((state, attempt) -> Duration.ofSeconds(30))
        .build());
```

------

Use `waitForCondition` for polling external systems, waiting for resources to be ready, or implementing retry with backoff.

### Function invocation
<a name="durable-sdk-op-invoke"></a>

Invokes another Lambda function and waits for its result. The SDK creates a checkpoint, invokes the target function, and resumes your function when the invocation completes. This enables function composition and workflow decomposition.

------
#### [ TypeScript ]

```
const result = await context.invoke(
  'invoke-processor',
  'arn:aws:lambda:us-east-1:123456789012:function:processor:1',
  { data: inputData }
);
```

------
#### [ Python ]

```
result = context.invoke(
    'arn:aws:lambda:us-east-1:123456789012:function:processor:1',
    {'data': input_data},
    name='invoke-processor'
)
```

------
#### [ Java ]

```
var result = context.invoke(
    "invoke-processor", 
    "arn:aws:lambda:us-east-1:123456789012:function:processor:1",
    inputData,
    Result.class, 
    InvokeConfig.builder().build()
);
```

------

## How durable operations are metered
<a name="durable-operations-checkpoint-consumption"></a>

Each durable operation you call through `DurableContext` creates checkpoints to track execution progress and store state data. These operations incur charges based on their usage, and the checkpoints may contain data that contributes to your data write and retention costs. Stored data includes invocation event data, payloads returned from steps, and data passed when completing callbacks. Understanding how durable operations are metered helps you estimate execution costs and optimize your workflows. For details on pricing, see the [Lambda pricing page](https://aws.amazon.com/lambda/pricing/).

Payload size refers to the size of the serialized data that a durable operation persists. The data is measured in bytes and the size can vary depending on the serializer used by the operation. The payload of an operation could be the result itself for successful completions, or the serialized error object if the operation failed.

### Basic operations
<a name="durable-operations-basic"></a>

Basic operations are the fundamental building blocks for durable functions:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| Execution | Started | 1 | Input payload size | 
| Execution | Completed (Succeeded/Failed/Stopped) | 0 | Output payload size | 
| Step | Retry/Succeeded/Failed | 1 \$1 N retries | Returned payload size from each attempt | 
| Wait | Started | 1 | N/A | 
| WaitForCondition | Each poll attempt | 1 \$1 N polls | Returned payload size from each poll attempt | 
| Invocation-level Retry | Started | 1 | Payload for error object | 

### Callback operations
<a name="durable-operations-callbacks"></a>

Callback operations enable your function to pause and wait for external systems to provide input. These operations create checkpoints when the callback is created and when it's completed:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| CreateCallback | Started | 1 | N/A | 
| Callback completion via API call | Completed | 0 | Callback payload | 
| WaitForCallback | Started | 3 \$1 N retries (context \$1 callback \$1 step) | Payloads returned by submitter step attempts, plus two copies of the callback payload | 

### Compound operations
<a name="durable-operations-compound"></a>

Compound operations combine multiple durable operations to handle complex coordination patterns like parallel execution, array processing, and nested contexts:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| Parallel | Started | 1 \$1 N branches (1 parent context \$1 N child contexts) | Up to two copies of the returned payload size from each branch, plus the statuses of each branch | 
| Map | Started | 1 \$1 N branches (1 parent context \$1 N child contexts) | Up to two copies of the returned payload size from each iteration, plus the statuses of each iteration | 
| Promise helpers | Completed | 1 | Returned payload size from the promise | 
| RunInChildContext | Succeeded/Failed | 1 | Returned payload size from the child context | 

For contexts, such as from `runInChildContext` or used internally by compound operations, results smaller than 256 KB are checkpointed directly. Larger results aren't stored—instead, they're reconstructed during replay by re-processing the context's operations.

# Supported runtimes for durable functions
<a name="durable-supported-runtimes"></a>

Durable functions are available for selected managed runtimes and OCI container images for additional runtime version flexibility. You can create durable functions for Node.js, Python, and Java using managed runtimes directly in the console or programmatically through infrastructure-as-code.

## Lambda managed runtimes
<a name="durable-managed-runtimes"></a>

The following managed runtimes support durable functions when you create functions in the Lambda console or using the AWS CLI with the `--durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 7}'` parameter. For complete information about Lambda runtimes, see [Lambda runtimes](lambda-runtimes.md).


| Language | Runtime | 
| --- | --- | 
| Node.js | nodejs22.x | 
| Node.js | nodejs24.x | 
| Python | python3.13 | 
| Python | python3.14 | 
| Java | java17 | 
| Java | java21 | 
| Java | java25 | 

**Note**  
Lambda Node.js and Python runtimes include the durable execution SDK for testing and development. However, we recommend including the SDK in your deployment package for production. This ensures version consistency and avoids potential runtime updates that might affect your function behavior. Because Java is a compiled language, Lambda Java runtimes do not include the durable execution SDK, so it must be included in your deployment package.

### Node.js
<a name="durable-runtime-nodejs"></a>

Install the SDK in your Node.js project:

```
npm install @aws/durable-execution-sdk-js
```

The SDK supports JavaScript and TypeScript. For TypeScript projects, the SDK includes type definitions.

### Python
<a name="durable-runtime-python"></a>

Install the SDK in your Python project:

```
pip install aws-durable-execution-sdk-python
```

The Python SDK uses synchronous methods and doesn't require `async/await`.

### Java
<a name="durable-runtime-java"></a>

Add a dependency to `pom.xml`:

```
<dependency>
    <groupId>software.amazon.lambda.durable</groupId>
    <artifactId>aws-durable-execution-sdk-java</artifactId>
    <version>VERSION</version>
</dependency>
```

Install the SDK in your Java project:

```
mvn install
```

The Java SDK provides both synchronous and asynchronous versions of each method.

## Container images
<a name="durable-container-images"></a>

You can use durable functions with container images to support additional runtime versions or custom runtime configurations. Container images let you use runtime versions not available as managed runtimes or customize your runtime environment.

To create a durable function using a container image:

1. Create a Dockerfile based on an Lambda base image

1. Install the durable execution SDK in your container

1. Build and push the container image to Amazon Elastic Container Registry

1. Create the Lambda function from the container image with durable execution enabled

### Container example
<a name="durable-container-python"></a>

Create a Dockerfile:

------
#### [ Python ]

Create a Dockerfile for Python 3.11:

```
FROM public.ecr.aws/lambda/python:3.11

# Copy requirements file
COPY requirements.txt ${LAMBDA_TASK_ROOT}/

# Install dependencies including durable SDK
RUN pip install -r requirements.txt

# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}/

# Set the handler
CMD [ "lambda_function.handler" ]
```

Create a `requirements.txt` file:

```
aws-durable-execution-sdk-python
```

------
#### [ Java ]

Create a Dockerfile for Java 25:

```
FROM --platform=linux/amd64 public.ecr.aws/lambda/java:25

# Install Maven
RUN dnf install -y maven

WORKDIR /var/task

# Copy Maven configuration and source code
COPY pom.xml .
COPY src ./src

# Build
RUN mvn clean package -DskipTests

# Move JAR to lib directory
RUN mv target/*.jar lib/

# Set the handler
CMD ["src.path.to.lambdaFunction::handler"]
```

------

Build and push the image:

```
# Build the image
docker build -t my-durable-function .

# Tag for ECR
docker tag my-durable-function:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest

# Push to ECR
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest
```

Create the function with durable execution enabled:

```
aws lambda create-function \
  --function-name myDurableFunction \
  --package-type Image \
  --code ImageUri=123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest \
  --role arn:aws:iam::123456789012:role/lambda-execution-role \
  --durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 7}'
```

For more information about using container images with Lambda, see [Creating Lambda container images](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html) in the Lambda Developer Guide.

## Runtime considerations
<a name="durable-runtime-considerations"></a>

**SDK version management:** Include the durable execution SDK in your deployment package or container image. This ensures your function uses a specific SDK version and isn't affected by runtime updates. Pin SDK versions in your `package.json` or `requirements.txt` to control when you upgrade.

**Runtime updates:** AWS updates managed runtimes to include security patches and bug fixes. These updates may include new SDK versions. To avoid unexpected behavior, include the SDK in your deployment package and test thoroughly before deploying to production.

**Container image size:** Container images have a maximum uncompressed size of 10 GB. The durable execution SDK adds minimal size to your image. Optimize your container by using multi-stage builds and removing unnecessary dependencies.

**Cold start performance:** Container images may have longer cold start times than managed runtimes. The durable execution SDK has minimal impact on cold start performance. Use provisioned concurrency if cold start latency is critical for your application.

# Invoking durable Lambda functions
<a name="durable-invoking"></a>

Durable Lambda functions support the same invocation methods as standard Lambda functions. You can invoke durable functions synchronously, asynchronously, or through event source mappings. The invocation process is identical to standard functions, but durable functions provide additional capabilities for long-running executions and automatic state management.

## Invocation methods
<a name="durable-invoking-methods"></a>

**Synchronous invocation:** Invoke a durable function and wait for the response. Synchronous invocations are limited by the Lambda to 15 minutes (or less, depending on the configured function and execution timeout). Use synchronous invocation when you need immediate results or when integrating with APIs and services that expect a response. You can use wait operations for efficient computation without disrupting the caller—the invocation waits for the entire durable execution to complete. For idempotent execution starts, use the execution name parameter as described in [Idempotency](durable-execution-idempotency.md).

```
aws lambda invoke \
  --function-name my-durable-function:1 \
  --cli-binary-format raw-in-base64-out \
  --payload '{"orderId": "12345"}' \
  response.json
```

**Asynchronous invocation:** Queue an event for processing without waiting for a response. Lambda places the event in a queue and returns immediately. Asynchronous invocations support execution durations up to 1 year. Use asynchronous invocation for fire-and-forget scenarios or when processing can happen in the background. For idempotent execution starts, use the execution name parameter as described in [Idempotency](durable-execution-idempotency.md).

```
aws lambda invoke \
  --function-name my-durable-function:1 \
  --invocation-type Event \
  --cli-binary-format raw-in-base64-out \
  --payload '{"orderId": "12345"}' \
  response.json
```

**Event source mappings:** Configure Lambda to automatically invoke your durable function when records are available from stream or queue-based services like Amazon SQS, Kinesis, or DynamoDB. Event source mappings poll the event source and invoke your function with batches of records. For details about using event source mappings with durable functions, including execution duration limits, see [Event source mappings with durable functions](durable-invoking-esm.md).

For complete details about each invocation method, see [synchronous invocation](invocation-sync.md) and [asynchronous invocation](invocation-async.md).

**Note**  
Durable functions support dead-letter queues (DLQs) for error handling, but don't support Lambda destinations. Configure a DLQ to capture records from failed invocations.

## Qualified ARNs requirement
<a name="durable-invoking-qualified-arns"></a>

Durable functions require qualified identifiers for invocation. You must invoke durable functions using a version number, alias, or `$LATEST`. You can use either a full qualified ARN or a function name with version/alias suffix. You cannot use an unqualified identifier (without a version or alias suffix).

**Valid invocations:**

```
# Using full ARN with version number
arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1

# Using full ARN with alias
arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:prod

# Using full ARN with $LATEST
arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:$LATEST

# Using function name with version number
my-durable-function:1

# Using function name with alias
my-durable-function:prod
```

**Invalid invocations:**

```
# Unqualified ARN (not allowed)
arn:aws:lambda:us-east-1:123456789012:function:my-durable-function

# Unqualified function name (not allowed)
my-durable-function
```

This requirement ensures that durable executions remain consistent throughout their lifecycle. When a durable execution starts, it's pinned to the specific function version. If your function pauses and resumes hours or days later, Lambda invokes the same version that started the execution, ensuring code consistency across the entire workflow.

**Best practice**  
Use numbered versions or aliases for production durable functions rather than `$LATEST`. Numbered versions are immutable and support deterministic replay. Optionally, aliases provide a stable reference that you can update to point to new versions without changing invocation code. When you update an alias, new executions use the new version, while in-progress executions continue with their original version. You may use `$LATEST` for prototyping or to shorten deployment times during development, understanding that executions might not replay correctly (or even fail) if the underlying code changes during running executions.

## Understanding execution lifecycle
<a name="durable-invoking-execution-lifecycle"></a>

When you invoke a durable function, Lambda creates a durable execution that can span multiple function invocations:

1. **Initial invocation:** Your invocation request creates a new durable execution. Lambda assigns a unique execution ID and starts processing.

1. **Execution and checkpointing:** As your function executes durable operations, the SDK creates checkpoints that track progress.

1. **Suspension (if needed):** If your function uses durable waits, such as `wait` or `waitForCallback`, or automatic step retries, Lambda suspends the execution and stops charging for compute time.

1. **Resumption:** When it's time to resume (including after retries), Lambda invokes your function again. The SDK replays the checkpoint log and continues from where execution paused.

1. **Completion:** When your function returns a final result or throws an unhandled error, the durable execution completes.

For synchronous invocations, the caller waits for the entire durable execution to complete, including any wait operations. If the execution exceeds the invocation timeout (15 minutes or less), the invocation times out. For asynchronous invocations, Lambda returns immediately and the execution continues independently. Use the durable execution APIs to track execution status and retrieve final results.

## Invoking from application code
<a name="durable-invoking-with-sdk"></a>

Use the AWS SDKs to invoke durable functions from your application code. The invocation process is identical to standard functions:

------
#### [ TypeScript ]

```
import { LambdaClient, InvokeCommand } from '@aws-sdk/client-lambda';

const client = new LambdaClient({});

// Synchronous invocation
const response = await client.send(new InvokeCommand({
  FunctionName: 'arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
  Payload: JSON.stringify({ orderId: '12345' })
}));

const result = JSON.parse(Buffer.from(response.Payload!).toString());

// Asynchronous invocation
await client.send(new InvokeCommand({
  FunctionName: 'arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
  InvocationType: 'Event',
  Payload: JSON.stringify({ orderId: '12345' })
}));
```

------
#### [ Python ]

```
import boto3
import json

client = boto3.client('lambda')

# Synchronous invocation
response = client.invoke(
    FunctionName='arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
    Payload=json.dumps({'orderId': '12345'})
)

result = json.loads(response['Payload'].read())

# Asynchronous invocation
client.invoke(
    FunctionName='arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
    InvocationType='Event',
    Payload=json.dumps({'orderId': '12345'})
)
```

------

## Chained invocations
<a name="durable-invoking-chained"></a>

Durable functions can invoke other durable and non-durable functions using the `invoke` operation from `DurableContext`. This creates a chained invocation where the calling function waits (suspends) for the invoked function to complete:

------
#### [ TypeScript ]

```
export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Invoke another durable function and wait for result
    const result = await context.invoke(
      'process-order',
      'arn:aws:lambda:us-east-1:123456789012:function:order-processor:1',
      { orderId: event.orderId }
    );
    
    return { statusCode: 200, body: JSON.stringify(result) };
  }
);
```

------
#### [ Python ]

```
@durable_execution
def handler(event, context: DurableContext):
    # Invoke another durable function and wait for result
    result = context.invoke(
        'arn:aws:lambda:us-east-1:123456789012:function:order-processor:1',
        {'orderId': event['orderId']},
        name='process-order'
    )
    
    return {'statusCode': 200, 'body': json.dumps(result)}
```

------

Chained invocations create a checkpoint in the calling function. If the calling function is interrupted, it resumes from the checkpoint with the invoked function's result, without re-invoking the function.

**Note**  
Cross-account chained invocations are not supported. The invoked function must be in the same AWS account as the calling function.

# Event source mappings with durable functions
<a name="durable-invoking-esm"></a>

Durable functions work with all Lambda event source mappings. Configure event source mappings for durable functions the same way you configure them for standard functions. Event source mappings automatically poll event sources like Amazon SQS, Kinesis, and DynamoDB Streams, and invoke your function with batches of records.

Event source mappings are useful for durable functions that process streams or queues with complex, multi-step workflows. For example, you can create a durable function that processes Amazon SQS messages with retries, external API calls, and human approvals.

## How event source mappings invoke durable functions
<a name="durable-esm-invocation-behavior"></a>

Event source mappings invoke durable functions synchronously, waiting for the complete durable execution to finish before processing the next batch or marking records as processed. If the total durable execution time exceeds 15 minutes, the execution times out and fails. The event source mapping receives a timeout exception and handles it according to its retry configuration.

## 15-minute execution limit
<a name="durable-esm-duration-limit"></a>

When durable functions are invoked by event source mappings, the total durable execution duration cannot exceed 15 minutes. This limit applies to the entire durable execution from start to completion, not just individual function invocations.

This 15-minute limit is separate from the Lambda function timeout (also 15 minutes maximum). The function timeout controls how long each individual invocation can run, while the durable execution timeout controls the total elapsed time from execution start to completion.

**Example scenarios:**
+ **Valid:** A durable function processes an Amazon SQS message with three steps, each taking 2 minutes, then waits 5 minutes before completing a final step. Total execution time: 11 minutes. This works because the total is under 15 minutes.
+ **Invalid:** A durable function processes an Amazon SQS message, completes initial processing in 2 minutes, then waits 20 minutes for an external callback before completing. Total execution time: 22 minutes. This exceeds the 15-minute limit and will fail.
+ **Invalid:** A durable function processes a Kinesis record with multiple wait operations totaling 30 minutes between steps. Even though each individual invocation completes quickly, the total execution time exceeds 15 minutes.

**Important**  
Configure your durable execution timeout to 15 minutes or less when using event source mappings, otherwise creation of the event source mapping will fail. If your workflow requires longer execution times, use the intermediary function pattern described below.

## Configuring event source mappings
<a name="durable-esm-configuration"></a>

Configure event source mappings for durable functions using the Lambda console, AWS CLI, or AWS SDKs. All standard event source mapping properties apply to durable functions:

```
aws lambda create-event-source-mapping \
  --function-name arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1 \
  --event-source-arn arn:aws:sqs:us-east-1:123456789012:my-queue \
  --batch-size 10 \
  --maximum-batching-window-in-seconds 5
```

Remember to use a qualified ARN (with version number or alias) when configuring event source mappings for durable functions.

## Error handling with event source mappings
<a name="durable-esm-error-handling"></a>

Event source mappings provide built-in error handling that works with durable functions:
+ **Retry behavior:** If the initial invocation fails, the event source mapping retries according to its retry configuration. Configure maximum retry attempts and retry intervals based on your requirements.
+ **Dead-letter queues:** Configure a dead-letter queue to capture records that fail after all retries. This prevents message loss and enables manual inspection of failed records.
+ **Partial batch failures:** For Amazon SQS and Kinesis, use partial batch failure reporting to process records individually and only retry failed records.
+ **Bisect on error:** For Kinesis and DynamoDB Streams, enable bisect on error to split failed batches and isolate problematic records.

**Note**  
Durable functions support dead-letter queues (DLQs) for error handling, but don't support Lambda destinations. Configure a DLQ to capture records from failed invocations.

For complete information about event source mapping error handling, see [event source mappings](invocation-eventsourcemapping.md).

## Using an intermediary function for long-running workflows
<a name="durable-esm-intermediary-function"></a>

If your workflow requires more than 15 minutes to complete, use an intermediary standard Lambda function between the event source mapping and your durable function. The intermediary function receives events from the event source mapping and invokes the durable function asynchronously, removing the 15-minute execution limit.

This pattern decouples the event source mapping's synchronous invocation model from the durable function's long-running execution model. The event source mapping invokes the intermediary function, which quickly returns after starting the durable execution. The durable function then runs independently for as long as needed (up to 1 year).

### Architecture
<a name="durable-esm-intermediary-architecture"></a>

The intermediary function pattern uses three components:

1. **Event source mapping:** Polls the event source (Amazon SQS, Kinesis, DynamoDB Streams) and invokes the intermediary function synchronously with batches of records.

1. **Intermediary function:** A standard Lambda function that receives events from the event source mapping, validates and transforms the data if needed, and invokes the durable function asynchronously. This function completes quickly (typically under 1 second) and returns control to the event source mapping.

1. **Durable function:** Processes the event with complex, multi-step logic that can run for extended periods. Invoked asynchronously, so it's not constrained by the 15-minute limit.

### Implementation
<a name="durable-esm-intermediary-implementation"></a>

The intermediary function receives the entire event from the event source mapping and invokes the durable function asynchronously. Use the execution name parameter to ensure idempotent execution starts, preventing duplicate processing if the event source mapping retries:

------
#### [ TypeScript ]

```
import { LambdaClient, InvokeCommand } from '@aws-sdk/client-lambda';
import { SQSEvent } from 'aws-lambda';
import { createHash } from 'crypto';

const lambda = new LambdaClient({});

export const handler = async (event: SQSEvent) => {
  // Invoke durable function asynchronously with execution name
  await lambda.send(new InvokeCommand({
    FunctionName: 'arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
    InvocationType: 'Event',
    Payload: JSON.stringify({
      executionName: event.Name,
      event: event
    })
  }));
  
  return { statusCode: 200 };
};
```

------
#### [ Python ]

```
import boto3
import json
import hashlib

lambda_client = boto3.client('lambda')

def handler(event, context):  
    # Invoke durable function asynchronously with execution name
    lambda_client.invoke(
        FunctionName='arn:aws:lambda:us-east-1:123456789012:function:my-durable-function:1',
        InvocationType='Event',
        Payload=json.dumps({
            'executionName': execution_name,
            'event': event["name"]
        })
    )
    
    return {'statusCode': 200}
```

------

For idempotency in the intermediary function itself, use [Powertools for AWS Lambda](https://docs.aws.amazon.com//powertools/) to prevent duplicate invocations of the durable function if the event source mapping retries the intermediary function.

The durable function receives the payload with the execution name and processes all records with long-running logic:

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (payload: any, context: DurableContext) => {
    const sqsEvent = payload.event;
    
    // Process each record with complex, multi-step logic
    const results = await context.map(
      sqsEvent.Records,
      async (ctx, record) => {
        const validated = await ctx.step('validate', async () => {
          return validateOrder(JSON.parse(record.body));
        });
        
        // Wait for external approval (could take hours or days)
        const approval = await ctx.waitForCallback(
          'approval',
          async (callbackId) => {
            await requestApproval(callbackId, validated);
          },
          { timeout: { hours: 48 } }
        );
        
        // Complete processing
        return await ctx.step('complete', async () => {
          return completeOrder(validated, approval);
        });
      }
    );
    
    return { statusCode: 200, processed: results.getResults().length };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext
from aws_durable_execution_sdk_python.config import Duration, WaitForCallbackConfig
from collections.abc import Sequence
import json

def validate_order(order_data: dict) -> dict:
    """Validate order data - always passes."""
    return order_data

def request_approval(callback_id: str, validated_order: dict) -> None:
    """Request approval for the order - always passes."""
    pass

def complete_order(validated_order: dict, approval_result: str) -> dict:
    """Complete the order processing - always passes."""
    return validated_order

@durable_execution
def lambda_handler(payload, context: DurableContext):
    sqs_event = payload['event']

    def process_record(
        ctx: DurableContext, 
        record: dict, 
        index: int, 
        items: Sequence[dict]
    ) -> dict:
        validated = ctx.step(
            lambda _: validate_order(json.loads(record['body'])),
            name=f'validate-{index}'
        )

        approval = ctx.wait_for_callback(
            submitter=lambda callback_id, wait_ctx: request_approval(callback_id, validated),
            name=f'approval-{index}',
            config=WaitForCallbackConfig(timeout=Duration.from_seconds(172800))
        )

        return ctx.step(
            lambda _: complete_order(validated, approval),
            name=f'complete-{index}'
        )

    results = context.map(
        inputs=sqs_event['Records'],
        func=process_record,
        name='process-records'
    )

    return {
        'statusCode': 200, 
        'started': results.started_count,
        'completed': results.success_count,
        'failed': results.failure_count,
        'total': results.total_count
    }
```

------

### Key considerations
<a name="durable-esm-intermediary-tradeoffs"></a>

This pattern removes the 15-minute execution limit by decoupling the event source mapping from the durable execution. The intermediary function returns immediately after starting the durable execution, allowing the event source mapping to continue processing. The durable function then runs independently for as long as needed.

The intermediary function succeeds when it invokes the durable function, not when the durable execution completes. If the durable execution fails later, the event source mapping won't retry because it already processed the batch successfully. Implement error handling in the durable function and configure dead-letter queues for failed executions.

Use the execution name parameter to ensure idempotent execution starts. If the event source mapping retries the intermediary function, the durable function won't start a duplicate execution because the execution name already exists.

## Supported event sources
<a name="durable-esm-supported-sources"></a>

Durable functions support all Lambda event sources that use event source mappings:
+ Amazon SQS queues (standard and FIFO)
+ Kinesis streams
+ DynamoDB Streams
+ Amazon Managed Streaming for Apache Kafka (Amazon MSK)
+ Self-managed Apache Kafka
+ Amazon MQ (ActiveMQ and RabbitMQ)
+ Amazon DocumentDB change streams

All event source types are subject to the 15-minute durable execution limit when invoking durable functions.

# Retries for Lambda durable functions
<a name="durable-execution-sdk-retries"></a>

Durable functions provide automatic retry capabilities that make your applications resilient to transient failures. The SDK handles retries at two levels: step retries for business logic failures and backend retries for infrastructure failures.

## Step retries
<a name="durable-step-retries"></a>

When an uncaught exception occurs within a step, the SDK automatically retries the step based on the configured retry strategy. Step retries are checkpointed operations that allow the SDK to suspend execution and resume later without losing progress.

### Step retry behavior
<a name="durable-step-retry-behavior"></a>

The following table describes how the SDK handles exceptions within steps:


| Scenario | What happens | Metering impact | 
| --- | --- | --- | 
| Exception in step with remaining retry attempts | The SDK creates a checkpoint for the retry and suspends the function. On the next invocation, the step retries with the configured backoff delay. | 1 operation \$1 error payload size | 
| Exception in step with no remaining retry attempts | The step fails and throws an exception. If your handler code doesn't catch this exception, the entire execution fails. | 1 operation \$1 error payload size | 

When a step needs to retry, the SDK checkpoints the retry state and exits the Lambda invocation if no other work is running. This allows the SDK to implement backoff delays without consuming compute resources. The function resumes automatically after the backoff period.

### Configuring step retry strategies
<a name="durable-step-retry-configuration"></a>

Configure retry strategies to control how steps handle failures. You can specify maximum attempts, backoff intervals, and conditions for retrying.

**Exponential backoff with max attempts:**

------
#### [ TypeScript ]

```
const result = await context.step('call-api', async () => {
  const response = await fetch('https://api.example.com/data');
  if (!response.ok) throw new Error(`API error: ${response.status}`);
  return await response.json();
}, {
  retryStrategy: (error, attemptCount) => {
    if (attemptCount >= 5) {
      return { shouldRetry: false };
    }
    // Exponential backoff: 2s, 4s, 8s, 16s, 32s (capped at 300s)
    const delay = Math.min(2 * Math.pow(2, attemptCount - 1), 300);
    return { shouldRetry: true, delay: { seconds: delay } };
  }
});
```

------
#### [ Python ]

```
def retry_strategy(error, attempt_count):
    if attempt_count >= 5:
        return RetryDecision(should_retry=False)
    # Exponential backoff: 2s, 4s, 8s, 16s, 32s (capped at 300s)
    delay = min(2 * (2 ** (attempt_count - 1)), 300)
    return RetryDecision(should_retry=True, delay=delay)

result = context.step(
    lambda _: call_external_api(),
    name='call-api',
    config=StepConfig(retry_strategy=retry_strategy)
)
```

------

**Fixed interval backoff:**

------
#### [ TypeScript ]

```
const orders = await context.step('query-orders', async () => {
  return await queryDatabase(event.userId);
}, {
  retryStrategy: (error, attemptCount) => {
    if (attemptCount >= 3) {
      return { shouldRetry: false };
    }
    return { shouldRetry: true, delay: { seconds: 5 } };
  }
});
```

------
#### [ Python ]

```
def retry_strategy(error, attempt_count):
    if attempt_count >= 3:
        return RetryDecision(should_retry=False)
    return RetryDecision(should_retry=True, delay=5)

orders = context.step(
    lambda _: query_database(event['userId']),
    name='query-orders',
    config=StepConfig(retry_strategy=retry_strategy)
)
```

------

**Conditional retry (retry only specific errors):**

------
#### [ TypeScript ]

```
const result = await context.step('call-rate-limited-api', async () => {
  const response = await fetch('https://api.example.com/data');
  
  if (response.status === 429) throw new Error('RATE_LIMIT');
  if (response.status === 504) throw new Error('TIMEOUT');
  if (!response.ok) throw new Error(`API_ERROR_${response.status}`);
  
  return await response.json();
}, {
  retryStrategy: (error, attemptCount) => {
    // Only retry rate limits and timeouts
    const isRetryable = error.message === 'RATE_LIMIT' || error.message === 'TIMEOUT';
    
    if (!isRetryable || attemptCount >= 3) {
      return { shouldRetry: false };
    }
    
    // Exponential backoff: 1s, 2s, 4s (capped at 30s)
    const delay = Math.min(Math.pow(2, attemptCount - 1), 30);
    return { shouldRetry: true, delay: { seconds: delay } };
  }
});
```

------
#### [ Python ]

```
def retry_strategy(error, attempt_count):
    # Only retry rate limits and timeouts
    is_retryable = str(error) in ['RATE_LIMIT', 'TIMEOUT']
    
    if not is_retryable or attempt_count >= 3:
        return RetryDecision(should_retry=False)
    
    # Exponential backoff: 1s, 2s, 4s (capped at 30s)
    delay = min(2 ** (attempt_count - 1), 30)
    return RetryDecision(should_retry=True, delay=delay)

result = context.step(
    lambda _: call_rate_limited_api(),
    name='call-rate-limited-api',
    config=StepConfig(retry_strategy=retry_strategy)
)
```

------

**Disable retries:**

------
#### [ TypeScript ]

```
const isDuplicate = await context.step('check-duplicate', async () => {
  return await checkIfOrderExists(event.orderId);
}, {
  retryStrategy: () => ({ shouldRetry: false })
});
```

------
#### [ Python ]

```
is_duplicate = context.step(
    lambda _: check_if_order_exists(event['orderId']),
    name='check-duplicate',
    config=StepConfig(
        retry_strategy=lambda error, attempt: {'should_retry': False}
    )
)
```

------

When the retry strategy returns `shouldRetry: false`, the step fails immediately without retries. Use this for operations that should not be retried, such as idempotency checks or operations with side effects that cannot be safely repeated.

## Exceptions outside steps
<a name="durable-handler-exceptions"></a>

When an uncaught exception occurs in your handler code but outside any step, the SDK marks the execution as failed. This ensures errors in your application logic are properly captured and reported.


| Scenario | What happens | Metering impact | 
| --- | --- | --- | 
| Exception in handler code outside any step | The SDK marks the execution as FAILED and returns the error. The exception is not automatically retried. | Error payload size | 

To enable automatic retry for error-prone code, wrap it in a step with a retry strategy. Steps provide automatic retry with configurable backoff, while code outside steps fails immediately.

## Invocation retries
<a name="durable-invocation-retries"></a>

Invocation level retries are handled differently depending on how the Lambda durable function is attempted to be invoked. The following table describes how the different invocation types might influence the invocation level retries.


| Invocation type | What happens | 
| --- | --- | 
| Synchronous invocation |  Lambda does not automatically retry the invocation on an error during durable function execution. Retries on invocation failures will depend on the source of the synchronous invocation. For example, using the AWS SDK, InternalFailure and ThrottlingException are by default retried automatically.  | 
| Asynchronous invocation |  If a durable function execution fails (for example, it enters a FAILED, STOPPED, or TIMED\$1OUT status), Lambda does not retry the execution. This is different from standard Lambda functions, where Lambda retries the function on asynchronous invocation failures. The MaximumRetryAttempts setting for asynchronous invocations does not apply to durable executions. If you configure a dead-letter queue (DLQ) for the function, Lambda sends the triggering event to the DLQ.  | 
| ESM (Event Source Mapping) |  Lambda by default retries the entire batch until it succeeds. For stream sources (DynamoDB and Kinesis), you can configure the maximum number of times that Lambda retries when your function returns an error. See [event source mappings batching](invocation-eventsourcemapping.md#invocation-eventsourcemapping-batching). For Amazon SQS ESM, you may configure max retries via a DLQ on the original Amazon SQS queue. See [configure Amazon SQS ESM](services-sqs-configure.md). Alternatively, you may consider a DLQ at the function level and Lambda will send the failing triggering event to the DLQ. See [function DLQ](invocation-async-retain-records.md#invocation-dlq). If you are interested in receiving a record of events that failed all processing attempts, or events for successful processing attempts, you may configure destinations for ESM. See [invocation async destinations](invocation-async-retain-records.md#invocation-async-destinations).  | 
| Direct Trigger |  This depends on the "Trigger". For example, Lambda processes functions triggered by Amazon S3 event notifications asynchronously. See [Process Amazon SQS event notifications with Lambda](with-sqs.md). Lambda processes functions triggered by Amazon SNS event notifications, asynchronously. See [Invoking Lambda functions with Amazon SNS notifications](with-sns.md). The asynchronous invocation retry behavior is above in the "Asynchronous invocation" table entry. If Amazon SNS can't reach Lambda or the message is rejected, Amazon SNS retries at increasing intervals over several hours. For details, see [Reliability](https://aws.amazon.com/sns/faqs/#Reliability) in the Amazon SNS FAQs. API Gateway will synchronously invoke Lambda and return the genuine error response back to the requester. See [invocation retries](invocation-retries.md). The synchronous invocation retry behavior is above in the "Synchronous invocation" table entry. See [each direct trigger](invocation-eventsourcemapping.md#eventsourcemapping-trigger-difference) for more details.  | 

## Backend retries
<a name="durable-backend-retries"></a>

Backend retries occur when Lambda encounters infrastructure failures, runtime errors, or when the SDK cannot communicate with the durable execution service. Lambda automatically retries these failures to help your durable functions can recover from transient infrastructure issues.

### Backend retry scenarios
<a name="durable-backend-retry-scenarios"></a>

Lambda automatically retries your function when it encounters the following scenarios:
+ **Internal service errors** - When Lambda or the durable execution service returns a 5xx error, indicating a temporary service issue.
+ **Throttling** - When your function is throttled due to concurrency limits or service quotas.
+ **Timeouts** - When the SDK cannot reach the durable execution service within the timeout period.
+ **Sandbox initialization failures** - When Lambda cannot initialize the execution environment.
+ **Runtime errors** - When the Lambda runtime encounters errors outside your function code, such as out-of-memory errors or process crashes.
+ **Invalid checkpoint token errors** - When the checkpoint token is no longer valid, typically due to service-side state changes.

The following table describes how the SDK handles these scenarios:


| Scenario | What happens | Metering impact | 
| --- | --- | --- | 
| Runtime error outside durable handler (OOM, timeout, crash) | Lambda automatically retries the invocation. The SDK replays from the last checkpoint, skipping completed steps. | Error payload size \$1 1 operation per retry | 
| Service error (5xx) or timeout when calling CheckpointDurableExecution / GetDurableExecutionState APIs | Lambda automatically retries the invocation. The SDK replays from the last checkpoint. | Error payload size \$1 1 operation per retry | 
| Throttling (429) or invalid checkpoint token when calling CheckpointDurableExecution / GetDurableExecutionState APIs | Lambda automatically retries the invocation with exponential backoff. The SDK replays from the last checkpoint. | Error payload size \$1 1 operation per retry | 
| Client error (4xx, except 429 and invalid token) when CheckpointDurableExecution / GetDurableExecutionState APIs | The SDK marks the execution as FAILED. No automatic retry occurs because the error indicates a permanent issue. | Error payload size | 

Backend retries use exponential backoff and continue until the function succeeds or the execution timeout is reached. During replay, the SDK skips completed checkpoints and continues execution from the last successful operation, ensuring your function doesn't re-execute completed work.

## Retry best practices
<a name="durable-retry-best-practices"></a>

Follow these best practices when configuring retry strategies:
+ **Configure explicit retry strategies** - Don't rely on default retry behavior in production. Configure explicit retry strategies with appropriate max attempts and backoff intervals for your use case.
+ **Use conditional retries** - Implement `shouldRetry` logic to retry only transient errors (rate limits, timeouts) and fail fast on permanent errors (validation failures, not found).
+ **Set appropriate max attempts** - Balance between resilience and execution time. Too many retries can delay failure detection, while too few can cause unnecessary failures.
+ **Use exponential backoff** - Exponential backoff reduces load on downstream services and increases the likelihood of recovery from transient failures.
+ **Wrap error-prone code in steps** - Code outside steps cannot be automatically retried. Wrap external API calls, database queries, and other error-prone operations in steps with retry strategies.
+ **Monitor retry metrics** - Track step retry operations and execution failures in Amazon CloudWatch to identify patterns and optimize retry strategies.

# Idempotency
<a name="durable-execution-idempotency"></a>

Durable functions provide built-in idempotency for execution starts through execution names. When you provide an execution name, Lambda uses it to prevent duplicate executions and enable safe retries of invocation requests. Steps have at-least-once execution semantics by default—during replay, the SDK returns checkpointed results without re-executing completed steps, but your business logic must be idempotent to handle potential retries before completion.

**Note**  
Lambda event source mappings (ESM) don't support idempotency at launch. Therefore, each invocation (including retries) starts a new durable execution. To ensure idempotent execution with event source mappings, either implement idempotency logic in your function code such as with [Powertools for AWS Lambda](https://docs.aws.amazon.com//powertools/) or use a regular Lambda function as proxy (dispatcher) to invoke a durable function with an idempotency key (execution name parameter).

## Execution names
<a name="durable-idempotency-execution-names"></a>

You can provide an execution name when invoking a durable function. The execution name acts as an idempotency key, allowing you to safely retry invocation requests without creating duplicate executions. If you don't provide a name, Lambda generates a unique execution ID automatically.

Execution names must be unique within your account and region. When you invoke a function with an execution name that already exists, Lambda behavior depends on the existing execution's state and whether the payload matches.

## Idempotency behavior
<a name="durable-idempotency-behavior"></a>

The following table describes how Lambda handles invocation requests based on whether you provide an execution name, the existing execution state, and whether the payload matches:


| Scenario | Name provided? | Existing execution status | Payload identical? | Behavior | 
| --- | --- | --- | --- | --- | 
| 1 | No | N/A | N/A | New execution started: Lambda generates a unique execution ID and starts a new execution | 
| 2 | Yes | Never existed or retention expired | N/A | New execution started: Lambda starts a new execution with the provided name | 
| 3 | Yes | Running | Yes | Idempotent start: Lambda returns the existing execution information without starting a duplicate. For synchronous invocations, this acts as a reattach to the running execution | 
| 4 | Yes | Running | No | Error: Lambda returns DurableExecutionAlreadyExists error because an execution with this name is already running with different payload | 
| 5 | Yes | Closed (succeeded, failed, stopped, or timed out) | Yes | Idempotent start: Lambda returns the existing execution information without starting a new execution. The closed execution result is returned | 
| 6 | Yes | Closed (succeeded, failed, stopped, or timed out) | No | Error: Lambda returns DurableExecutionAlreadyExists error because an execution with this name already completed with different payload | 

**Note**  
Scenarios 3 and 5 demonstrate idempotent behavior where Lambda safely handles duplicate invocation requests by returning existing execution information instead of creating duplicates.

## Step idempotency
<a name="durable-idempotency-steps"></a>

Steps have at-least-once execution semantics by default. When your function replays after a wait, callback, or failure, the SDK checks each step against the checkpoint log. For steps that already completed, the SDK returns the checkpointed result without re-executing the step logic. However, if a step fails or the function is interrupted before the step completes, the step may execute multiple times.

Your business logic wrapped in steps must be idempotent to handle potential retries. Use idempotency keys to ensure operations like payments or database writes execute only once, even if the step retries.

**Example: Using idempotency keys in steps**

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';
import { randomUUID } from 'crypto';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Generate idempotency key once
    const idempotencyKey = await context.step('generate-key', async () => {
      return randomUUID();
    });
    
    // Use idempotency key in payment API to prevent duplicate charges
    const payment = await context.step('process-payment', async () => {
      return paymentAPI.charge({
        amount: event.amount,
        idempotencyKey: idempotencyKey
      });
    });
    
    return { statusCode: 200, payment };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext
import uuid

@durable_execution
def handler(event, context: DurableContext):
    # Generate idempotency key once
    idempotency_key = context.step(
        lambda _: str(uuid.uuid4()),
        name='generate-key'
    )
    
    # Use idempotency key in payment API to prevent duplicate charges
    payment = context.step(
        lambda _: payment_api.charge(
            amount=event['amount'],
            idempotency_key=idempotency_key
        ),
        name='process-payment'
    )
    
    return {'statusCode': 200, 'payment': payment}
```

------

You can configure steps to use at-most-once execution semantics by setting the execution mode to `AT_MOST_ONCE_PER_RETRY`. This ensures the step executes at most once per retry attempt, but may not execute at all if the function is interrupted before the step completes.

The SDK enforces deterministic replay by validating that step names and order match the checkpoint log during replay. If your code attempts to execute steps in a different order or with different names, the SDK throws a `NonDeterministicExecutionError`.

**How replay works with completed steps:**

1. First invocation: Function executes step A, creates checkpoint, then waits

1. Second invocation (after wait): Function replays from beginning, step A returns checkpointed result instantly without re-executing, then continues to step B

1. Third invocation (after another wait): Function replays from beginning, steps A and B return checkpointed results instantly, then continues to step C

This replay mechanism ensures that completed steps don't re-execute, but your business logic must still be idempotent to handle retries before completion.

# Testing Lambda durable functions
<a name="durable-testing"></a>

AWS provides dedicated testing SDKs for durable functions that let you run and inspect executions both locally and in the cloud. Install the testing SDK for your language:

------
#### [ TypeScript ]

```
npm install --save-dev @aws/durable-execution-sdk-js-testing
```

For complete documentation and examples, see the [TypeScript testing SDK](https://github.com/aws/aws-durable-execution-sdk-js/tree/development/packages/aws-durable-execution-sdk-js-testing) on GitHub.

------
#### [ Python ]

```
pip install aws-durable-execution-sdk-python-testing
```

For complete documentation and examples, see the [Python testing SDK](https://github.com/aws/aws-durable-execution-sdk-python-testing) on GitHub.

------

The testing SDK provides two testing modes: local testing for fast unit tests, and cloud testing for integration tests against deployed functions.

## Local testing
<a name="durable-local-testing"></a>

Local testing runs your durable functions in your development environment without requiring deployed resources. The test runner runs your function code directly and captures all operations for inspection.

Use local testing for unit tests, test-driven development, and CI/CD pipelines. Tests run locally without network latency or additional costs.

**Example test:**

------
#### [ TypeScript ]

```
import { withDurableExecution } from '@aws/durable-execution-sdk-js';
import { DurableFunctionTestRunner } from '@aws/durable-execution-sdk-js-testing';

const handler = withDurableExecution(async (event, context) => {
  const result = await context.step('calculate', async () => {
    return event.a + event.b;
  });
  return result;
});

test('addition works correctly', async () => {
  const runner = new DurableFunctionTestRunner({ handler });
  const result = await runner.run({ a: 5, b: 3 });
  
  expect(result.status).toBe('SUCCEEDED');
  expect(result.result).toBe(8);
  
  const step = result.getStep('calculate');
  expect(step.result).toBe(8);
});
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext
from aws_durable_execution_sdk_python_testing import DurableFunctionTestRunner
from aws_durable_execution_sdk_python.execution import InvocationStatus

@durable_execution
def handler(event: dict, context: DurableContext) -> int:
    result = context.step(lambda _: event["a"] + event["b"], name="calculate")
    return result

def test_addition():
    runner = DurableFunctionTestRunner(handler=handler)
    with runner:
        result = runner.run(input={"a": 5, "b": 3}, timeout=10)
    
    assert result.status is InvocationStatus.SUCCEEDED
    assert result.result == 8
    
    step = result.get_step("calculate")
    assert step.result == 8
```

------

The test runner captures execution state including the final result, individual step results, wait operations, callbacks, and any errors. You can inspect operations by name or iterate through all operations to verify execution behavior.

### Execution stores
<a name="durable-execution-stores"></a>

The testing SDK uses execution stores to persist test execution data. By default, tests use an in-memory store that's fast and requires no cleanup. For debugging or analyzing execution history, you can use a filesystem store that saves executions as JSON files.

**In-memory store (default):**

The in-memory store keeps execution data in memory during test runs. Data is lost when tests complete, making it ideal for standard unit tests and CI/CD pipelines where you don't need to inspect executions after tests finish.

**Filesystem store:**

The filesystem store persists execution data to disk as JSON files. Each execution is saved in a separate file, making it easy to inspect execution history after tests complete. Use the filesystem store when debugging complex test failures or analyzing execution patterns over time.

Configure the store using environment variables:

```
# Use filesystem store
export AWS_DEX_STORE_TYPE=filesystem
export AWS_DEX_STORE_PATH=./test-executions

# Run tests
pytest tests/
```

Execution files are stored with sanitized names and contain the complete execution state including operations, checkpoints, and results. The filesystem store automatically creates the storage directory if it doesn't exist.

## Cloud testing
<a name="durable-cloud-testing"></a>

Cloud testing invokes deployed durable functions in AWS and retrieves their execution history using the Lambda API. Use cloud testing to verify behavior in production-like environments with real AWS services and configurations.

Cloud testing requires a deployed function and AWS credentials with permissions to invoke functions and read execution history:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction",
                "lambda:GetDurableExecution",
                "lambda:GetDurableExecutionHistory"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

**Example cloud test:**

------
#### [ TypeScript ]

```
import { DurableFunctionCloudTestRunner } from '@aws/durable-execution-sdk-js-testing';

test('deployed function processes orders', async () => {
  const runner = new DurableFunctionCloudTestRunner({
    functionName: 'order-processor',
    region: 'us-east-1'
  });
  
  const result = await runner.run({ orderId: 'order-123' });
  
  expect(result.status).toBe('SUCCEEDED');
  expect(result.result.status).toBe('completed');
});
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python_testing import (
    DurableFunctionCloudTestRunner,
    DurableFunctionCloudTestRunnerConfig
)

def test_deployed_function():
    config = DurableFunctionCloudTestRunnerConfig(
        function_name="order-processor",
        region="us-east-1"
    )
    runner = DurableFunctionCloudTestRunner(config=config)
    
    result = runner.run(input={"orderId": "order-123"})
    
    assert result.status is InvocationStatus.SUCCEEDED
    assert result.result["status"] == "completed"
```

------

Cloud tests invoke the actual deployed function and retrieve execution history from AWS. This lets you verify integration with other AWS services, validate performance characteristics, and test with production-like data and configurations.

## What to test
<a name="durable-testing-patterns"></a>

Test durable functions by verifying execution outcomes, operation behavior, and error handling. Focus on business logic correctness rather than implementation details.

**Verify execution results:** Check that functions return the expected values for given inputs. Test both successful executions and error cases to ensure functions handle invalid input appropriately.

**Inspect operation execution:** Verify that steps, waits, and callbacks execute as expected. Check step results to ensure intermediate operations produce correct values. Validate that wait operations are configured with appropriate timeouts and that callbacks are created with correct settings.

**Test error handling:** Verify functions fail correctly with descriptive error messages when given invalid input. Test retry behavior by simulating transient failures and confirming operations retry appropriately. Check that permanent failures don't trigger unnecessary retries.

**Validate workflows:** For multi-step workflows, verify operations execute in the correct order. Test conditional branching to ensure different execution paths work correctly. Validate parallel operations execute concurrently and produce expected results.

The SDK documentation repositories contain extensive examples of testing patterns including multi-step workflows, error scenarios, timeout handling, and polling patterns.

## Testing strategy
<a name="durable-testing-strategy"></a>

Use local testing for unit tests during development and in CI/CD pipelines. Local tests run fast, don't require AWS credentials, and provide immediate feedback on code changes. Write local tests to verify business logic, error handling, and operation behavior.

Use cloud testing for integration tests before deploying to production. Cloud tests verify behavior with real AWS services and configurations, validate performance characteristics, and test end-to-end workflows. Run cloud tests in staging environments to catch integration issues before they reach production.

Mock external dependencies in local tests to isolate function logic and keep tests fast. Use cloud tests to verify actual integration with external services like databases, APIs, and other AWS services.

Write focused tests that verify one specific behavior. Use descriptive test names that explain what's being tested. Group related tests together and use test fixtures for common setup code. Keep tests simple and avoid complex test logic that's hard to understand.

## Debugging failures
<a name="durable-testing-debugging"></a>

When tests fail, inspect the execution result to understand what went wrong. Check the execution status to see if the function succeeded, failed, or timed out. Read error messages to understand the failure cause.

Inspect individual operation results to find where behavior diverged from expectations. Check step results to see what values were produced. Verify operation ordering to confirm operations executed in the expected sequence. Count operations to ensure the right number of steps, waits, and callbacks were created.

Common issues include non-deterministic code that produces different results on replay, shared state through global variables that breaks during replay, and missing operations due to conditional logic errors. Use standard debuggers and logging to step through function code and track execution flow.

For cloud tests, inspect execution history in CloudWatch Logs to see detailed operation logs. Use tracing to track execution flow across services and identify bottlenecks.

# Monitoring durable functions
<a name="durable-monitoring"></a>

You can monitor your durable functions using CloudWatch metrics, CloudWatch Logs, and tracing. Because durable functions can run for extended periods and span multiple function invocations, monitoring them requires understanding their unique execution patterns, including checkpoints, state transitions, and replay behavior.

## CloudWatch metrics
<a name="durable-monitoring-metrics"></a>

Lambda automatically publishes metrics to CloudWatch at no additional charge. Durable functions provide additional metrics beyond standard Lambda metrics to help you monitor long-running workflows, state management, and resource utilization.

### Durable execution metrics
<a name="durable-monitoring-execution-metrics"></a>

Lambda emits the following metrics for durable executions:


| Metric | Description | 
| --- | --- | 
| ApproximateRunningDurableExecutions | Number of durable executions in the RUNNING state | 
| ApproximateRunningDurableExecutionsUtilization | Percentage of your account's maximum running durable executions quota currently in use | 
| DurableExecutionDuration | Elapsed wall-clock time in milliseconds that a durable execution remained in the RUNNING state | 
| DurableExecutionStarted | Number of durable executions that started | 
| DurableExecutionStopped | Number of durable executions stopped using the StopDurableExecution API | 
| DurableExecutionSucceeded | Number of durable executions that completed successfully | 
| DurableExecutionFailed | Number of durable executions that completed with a failure | 
| DurableExecutionTimedOut | Number of durable executions that exceeded their configured execution timeout | 
| DurableExecutionOperations | Cumulative number of operations performed within a durable execution (max: 3,000) | 
| DurableExecutionStorageWrittenBytes | Cumulative amount of data in bytes persisted by a durable execution (max: 100 MB) | 

### CloudWatch metrics
<a name="durable-monitoring-standard-metrics"></a>

Lambda emits standard invocation, performance, and concurrency metrics for durable functions. Because a durable execution can span multiple function invocations as it progresses through checkpoints and replays, these metrics behave differently than for standard functions:
+ **Invocations:** Counts each function invocation, including replays. A single durable execution can generate multiple invocation data points.
+ **Duration:** Measures each function invocation separately. Use `DurableExecutionDuration` for total time taken by a single durable execution.
+ **Errors:** Tracks function invocation failures. Use `DurableExecutionFailed` for execution-level failures.

For a complete list of standard Lambda metrics, see [Types of metrics for Lambda functions](https://docs.aws.amazon.com//lambda/latest/dg/monitoring-metrics-types.html).

### Creating CloudWatch alarms
<a name="durable-monitoring-alarms"></a>

Create CloudWatch alarms to notify you when metrics exceed thresholds. Common alarms include:
+ `ApproximateRunningDurableExecutionsUtilization` exceeds 80% of your quota
+ `DurableExecutionFailed` increases above a threshold
+ `DurableExecutionTimedOut` indicates executions are timing out
+ `DurableExecutionStorageWrittenBytes` approaches storage limits

For more information, [see Using CloudWatch alarms.](https://docs.aws.amazon.com//AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html).

## EventBridge events
<a name="durable-monitoring-eventbridge"></a>

Lambda publishes durable execution status change events to EventBridge. You can use these events to trigger workflows, send notifications, or track execution lifecycle changes across your durable functions.

### Durable execution status change events
<a name="durable-eventbridge-status-changes"></a>

Lambda emits an event to EventBridge whenever a durable execution changes status. These events have the following characteristics:
+ **Source:** `aws.lambda`
+ **Detail type:** `Durable Execution Status Change`

Status change events are published for the following execution states:
+ `RUNNING` - Execution started
+ `SUCCEEDED` - Execution completed successfully
+ `STOPPED` - Execution stopped using the StopDurableExecution API
+ `FAILED` - Execution failed with an error
+ `TIMED_OUT` - Execution exceeded the configured timeout

The following example shows a durable execution status change event:

```
{
  "version": "0",
  "id": "d019b03c-a8a3-9d58-85de-241e96206538",
  "detail-type": "Durable Execution Status Change",
  "source": "aws.lambda",
  "account": "123456789012",
  "time": "2025-11-20T13:08:22Z",
  "region": "us-east-1",
  "resources": [],
  "detail": {
    "durableExecutionArn": "arn:aws:lambda:us-east-1:123456789012:function:my-function:$LATEST/durable-execution/090c4189-b18b-4296-9d0c-cfd01dc3a122/9f7d84c9-ea3d-3ffc-b3e5-5ec51c34ffc9",
    "durableExecutionName": "order-123",
    "functionArn": "arn:aws:lambda:us-east-1:123456789012:function:my-function:2",
    "status": "RUNNING",
    "startTimestamp": "2025-11-20T13:08:22.345Z"
  }
}
```

For terminal states (`SUCCEEDED`, `STOPPED`, `FAILED`, `TIMED_OUT`), the event includes an `endTimestamp` field indicating when the execution completed.

### Creating EventBridge rules
<a name="durable-eventbridge-rules"></a>

Create rules to route durable execution status change events to targets like Amazon Simple Notification Service, Amazon Simple Queue Service, or other Lambda functions.

The following example creates a rule that matches all durable execution status changes:

```
{
  "source": ["aws.lambda"],
  "detail-type": ["Durable Execution Status Change"]
}
```

The following example creates a rule that matches only failed executions:

```
{
  "source": ["aws.lambda"],
  "detail-type": ["Durable Execution Status Change"],
  "detail": {
    "status": ["FAILED"]
  }
}
```

The following example creates a rule that matches status changes for a specific function:

```
{
  "source": ["aws.lambda"],
  "detail-type": ["Durable Execution Status Change"],
  "detail": {
    "functionArn": [{
      "prefix": "arn:aws:lambda:us-east-1:123456789012:function:my-function"
    }]
  }
}
```

For more information about creating rules, see [Amazon EventBridge tutorials](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-tutorial.html) in the EventBridge User Guide.

## AWS X-Ray tracing
<a name="durable-monitoring-xray"></a>

You can enable X-Ray tracing on your durable functions. Lambda passes the X-Ray trace header to the durable execution, allowing you to trace requests across your workflow.

To enable X-Ray; tracing using the Lambda console, choose your function, then choose Configuration, Monitoring and operations tools, and turn on Active tracing under X-Ray.

To enable X-Ray tracing using the AWS CLI:

```
aws lambda update-function-configuration \
    --function-name my-durable-function \
    --tracing-config Mode=Active
```

To enable AWS X-Ray tracing using AWS SAM:

```
Resources:
  MyDurableFunction:
    Type: AWS::Serverless::Function
    Properties:
      Tracing: Active
      DurableConfig:
        ExecutionTimeout: 3600
```

For more information about X-Ray, [see the AWS X-Ray Developer Guide.](https://docs.aws.amazon.com//xray/latest/devguide/aws-xray.html)

# Best practices for Lambda durable functions
<a name="durable-best-practices"></a>

Durable functions use a replay-based execution model that requires different patterns than traditional Lambda functions. Follow these best practices to build reliable, cost-effective workflows.

## Write deterministic code
<a name="durable-determinism"></a>

During replay, your function runs from the beginning and must follow the same execution path as the original run. Code outside durable operations must be deterministic, producing the same results given the same inputs.

**Wrap non-deterministic operations in steps:**
+ Random number generation and UUIDs
+ Current time or timestamps
+ External API calls and database queries
+ File system operations

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';
import { randomUUID } from 'crypto';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Generate transaction ID inside a step
    const transactionId = await context.step('generate-transaction-id', async () => {
      return randomUUID();
    });
    
    // Use the same ID throughout execution, even during replay
    const payment = await context.step('process-payment', async () => {
      return processPayment(event.amount, transactionId);
    });
    
    return { statusCode: 200, transactionId, payment };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext
import uuid

@durable_execution
def handler(event, context: DurableContext):
    # Generate transaction ID inside a step
    transaction_id = context.step(
        lambda _: str(uuid.uuid4()),
        name='generate-transaction-id'
    )
    
    # Use the same ID throughout execution, even during replay
    payment = context.step(
        lambda _: process_payment(event['amount'], transaction_id),
        name='process-payment'
    )
    
    return {'statusCode': 200, 'transactionId': transaction_id, 'payment': payment}
```

------

**Important**  
Don't use global variables or closures to share state between steps. Pass data through return values. Global state breaks during replay because steps return cached results but global variables reset.

**Avoid closure mutations:** Variables captured in closures can lose mutations during replay. Steps return cached results, but variable updates outside the step aren't replayed.

------
#### [ TypeScript ]

```
// ❌ WRONG: Mutations lost on replay
export const handler = withDurableExecution(async (event, context) => {
  let total = 0;
  
  for (const item of items) {
    await context.step(async () => {
      total += item.price; // ⚠️ Mutation lost on replay!
      return saveItem(item);
    });
  }
  
  return { total }; // Inconsistent value!
});

// ✅ CORRECT: Accumulate with return values
export const handler = withDurableExecution(async (event, context) => {
  let total = 0;
  
  for (const item of items) {
    total = await context.step(async () => {
      const newTotal = total + item.price;
      await saveItem(item);
      return newTotal; // Return updated value
    });
  }
  
  return { total }; // Consistent!
});

// ✅ EVEN BETTER: Use map for parallel processing
export const handler = withDurableExecution(async (event, context) => {
  const results = await context.map(
    items,
    async (ctx, item) => {
      await ctx.step(async () => saveItem(item));
      return item.price;
    }
  );
  
  const total = results.getResults().reduce((sum, price) => sum + price, 0);
  return { total };
});
```

------
#### [ Python ]

```
# ❌ WRONG: Mutations lost on replay
@durable_execution
def handler(event, context: DurableContext):
    total = 0
    
    for item in items:
        context.step(
            lambda _: save_item_and_mutate(item, total),  # ⚠️ Mutation lost on replay!
            name=f'save-item-{item["id"]}'
        )
    
    return {'total': total}  # Inconsistent value!

# ✅ CORRECT: Accumulate with return values
@durable_execution
def handler(event, context: DurableContext):
    total = 0
    
    for item in items:
        total = context.step(
            lambda _: save_item_and_return_total(item, total),
            name=f'save-item-{item["id"]}'
        )
    
    return {'total': total}  # Consistent!

# ✅ EVEN BETTER: Use map for parallel processing
@durable_execution
def handler(event, context: DurableContext):
    def process_item(ctx, item):
        ctx.step(lambda _: save_item(item))
        return item['price']
    
    results = context.map(items, process_item)
    total = sum(results.get_results())
    
    return {'total': total}
```

------

## Design for idempotency
<a name="durable-idempotency"></a>

Operations may execute multiple times due to retries or replay. Non-idempotent operations cause duplicate side effects like charging customers twice or sending multiple emails.

**Use idempotency tokens:** Generate tokens inside steps and include them with external API calls to prevent duplicate operations.

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Generate idempotency token once
    const idempotencyToken = await context.step('generate-idempotency-token', async () => {
      return crypto.randomUUID();
    });
    
    // Use token to prevent duplicate charges
    const charge = await context.step('charge-payment', async () => {
      return paymentService.charge({
        amount: event.amount,
        cardToken: event.cardToken,
        idempotencyKey: idempotencyToken
      });
    });
    
    return { statusCode: 200, charge };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext
import uuid

@durable_execution
def handler(event, context: DurableContext):
    # Generate idempotency token once
    idempotency_token = context.step(
        lambda _: str(uuid.uuid4()),
        name='generate-idempotency-token'
    )
    
    # Use token to prevent duplicate charges
    def charge_payment(_):
        return payment_service.charge(
            amount=event['amount'],
            card_token=event['cardToken'],
            idempotency_key=idempotency_token
        )
    
    charge = context.step(charge_payment, name='charge-payment')
    
    return {'statusCode': 200, 'charge': charge}
```

------

**Use at-most-once semantics:** For critical operations that must never duplicate (financial transactions, inventory deductions), configure at-most-once execution mode.

------
#### [ TypeScript ]

```
// Critical operation that must not duplicate
await context.step('deduct-inventory', async () => {
  return inventoryService.deduct(event.productId, event.quantity);
}, {
  executionMode: 'AT_MOST_ONCE_PER_RETRY'
});
```

------
#### [ Python ]

```
# Critical operation that must not duplicate
context.step(
    lambda _: inventory_service.deduct(event['productId'], event['quantity']),
    name='deduct-inventory',
    config=StepConfig(execution_mode='AT_MOST_ONCE_PER_RETRY')
)
```

------

**Database idempotency:** Use check-before-write patterns, conditional updates, or upsert operations to prevent duplicate records.

## Manage state efficiently
<a name="durable-state-management"></a>

Every checkpoint saves state to persistent storage. Large state objects increase costs, slow checkpointing, and impact performance. Store only essential workflow coordination data.

**Keep state minimal:**
+ Store IDs and references, not full objects
+ Fetch detailed data within steps as needed
+ Use Amazon S3 or DynamoDB for large data, pass references in state
+ Avoid passing large payloads between steps

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Store only the order ID, not the full order object
    const orderId = event.orderId;
    
    // Fetch data within each step as needed
    await context.step('validate-order', async () => {
      const order = await orderService.getOrder(orderId);
      return validateOrder(order);
    });
    
    await context.step('process-payment', async () => {
      const order = await orderService.getOrder(orderId);
      return processPayment(order);
    });
    
    return { statusCode: 200, orderId };
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext

@durable_execution
def handler(event, context: DurableContext):
    # Store only the order ID, not the full order object
    order_id = event['orderId']
    
    # Fetch data within each step as needed
    context.step(
        lambda _: validate_order(order_service.get_order(order_id)),
        name='validate-order'
    )
    
    context.step(
        lambda _: process_payment(order_service.get_order(order_id)),
        name='process-payment'
    )
    
    return {'statusCode': 200, 'orderId': order_id}
```

------

## Design effective steps
<a name="durable-step-design"></a>

Steps are the fundamental unit of work in durable functions. Well-designed steps make workflows easier to understand, debug, and maintain.

**Step design principles:**
+ **Use descriptive names** - Names like `validate-order` instead of `step1` make logs and errors easier to understand
+ **Keep names static** - Don't use dynamic names with timestamps or random values. Step names must be deterministic for replay
+ **Balance granularity** - Break complex operations into focused steps, but avoid excessive tiny steps that increase checkpoint overhead
+ **Group related operations** - Operations that should succeed or fail together belong in the same step

## Use wait operations efficiently
<a name="durable-wait-operations"></a>

Wait operations suspend execution without consuming resources or incurring costs. Use them instead of keeping Lambda running.

**Time-based waits:** Use `context.wait()` for delays instead of `setTimeout` or `sleep`.

**External callbacks:** Use `context.waitForCallback()` when waiting for external systems. Always set timeouts to prevent indefinite waits.

**Polling:** Use `context.waitForCondition()` with exponential backoff to poll external services without overwhelming them.

------
#### [ TypeScript ]

```
// Wait 24 hours without cost
await context.wait({ seconds: 86400 });

// Wait for external callback with timeout
const result = await context.waitForCallback(
  'external-job',
  async (callbackId) => {
    await externalService.submitJob({
      data: event.data,
      webhookUrl: `https://api.example.com/callbacks/${callbackId}`
    });
  },
  { timeout: { seconds: 3600 } }
);
```

------
#### [ Python ]

```
# Wait 24 hours without cost
context.wait(86400)

# Wait for external callback with timeout
result = context.wait_for_callback(
    lambda callback_id: external_service.submit_job(
        data=event['data'],
        webhook_url=f'https://api.example.com/callbacks/{callback_id}'
    ),
    name='external-job',
    config=WaitForCallbackConfig(timeout_seconds=3600)
)
```

------

## Additional considerations
<a name="durable-additional-considerations"></a>

**Error handling:** Retry transient failures like network timeouts and rate limits. Don't retry permanent failures like invalid input or authentication errors. Configure retry strategies with appropriate max attempts and backoff rates. For detailed examples, see [Error handling and retries](durable-execution-sdk-retries.md).

**Performance:** Minimize checkpoint size by storing references instead of full payloads. Use `context.parallel()` and `context.map()` to execute independent operations concurrently. Batch related operations to reduce checkpoint overhead.

**Versioning:** Invoke functions with version numbers or aliases to pin executions to specific code versions. Ensure new code versions can handle state from older versions. Don't rename steps or change their behavior in ways that break replay.

**Serialization:** Use JSON-compatible types for operation inputs and results. Convert dates to ISO strings and custom objects to plain objects before passing them to durable operations.

**Monitoring:** Enable structured logging with execution IDs and step names. Set up CloudWatch alarms for error rates and execution duration. Use tracing to identify bottlenecks. For detailed guidance, see [Monitoring and debugging](durable-monitoring.md).

**Testing:** Test happy path, error handling, and replay behavior. Test timeout scenarios for callbacks and waits. Use local testing to reduce iteration time. For detailed guidance, see [Testing durable functions](durable-testing.md).

**Common mistakes to avoid:** Don't nest `context.step()` calls, use child contexts instead. Wrap non-deterministic operations in steps. Always set timeouts for callbacks. Balance step granularity with checkpoint overhead. Store references instead of large objects in state.

## Additional resources
<a name="durable-additional-resources"></a>
+ [Python SDK documentation](https://github.com/aws/aws-durable-execution-sdk-python/tree/main/docs) - Complete API reference, testing patterns, and advanced examples
+ [TypeScript SDK documentation](https://github.com/aws/aws-durable-execution-sdk-js/tree/main/docs) - Complete API reference, testing patterns, and advanced examples