# Durable execution SDK
<a name="durable-execution-sdk"></a>

The durable execution SDK is the foundation for building durable functions. It provides the primitives you need to checkpoint progress, handle retries, and manage execution flow. The SDK abstracts the complexity of checkpoint management and replay, letting you write sequential code that automatically becomes fault-tolerant.

The SDK is available for JavaScript, TypeScript, Python, and Java. For complete API documentation and examples, see the [JavaScript/TypeScript SDK](https://github.com/aws/aws-durable-execution-sdk-js), [Python SDK](https://github.com/aws/aws-durable-execution-sdk-python) and [Java SDK](https://github.com/aws/aws-durable-execution-sdk-java) on GitHub.

## DurableContext
<a name="durable-sdk-context"></a>

The SDK provides your function with a `DurableContext` object that exposes all durable operations. This context replaces the standard Lambda context and provides methods for creating checkpoints, managing execution flow, and coordinating with external systems.

To use the SDK, wrap your Lambda handler with the durable execution wrapper:

------
#### [ TypeScript ]

```
import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Your function receives DurableContext instead of Lambda context
    // Use context.step(), context.wait(), etc.
    return result;
  }
);
```

------
#### [ Python ]

```
from aws_durable_execution_sdk_python import durable_execution, DurableContext

@durable_execution
def handler(event: dict, context: DurableContext):
    # Your function receives DurableContext
    # Use context.step(), context.wait(), etc.
    return result
```

------
#### [ Java ]

```
import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;

public class Handler extends DurableHandler<Object, String> {
    @Override
    public String handleRequest(Object input, DurableContext context) {
        // Your function receives DurableContext
        // Use context.step(), context.wait(), etc.
        return result;
    }
}
```

------

The wrapper intercepts your function invocation, loads any existing checkpoint log, and provides the `DurableContext` that manages replay and checkpointing.

## What the SDK does
<a name="durable-sdk-what-it-does"></a>

The SDK handles three critical responsibilities that enable durable execution:

**Checkpoint management:** The SDK automatically creates checkpoints as your function executes durable operations. Each checkpoint records the operation type, inputs, and results. When your function completes a step, the SDK persists the checkpoint before continuing. This ensures your function can resume from any completed operation if interrupted.

**Replay coordination:** When your function resumes after a pause or interruption, the SDK performs replay. It runs your code from the beginning but skips completed operations, using stored checkpoint results instead of re-executing them. The SDK ensures replay is deterministic—given the same inputs and checkpoint log, your function produces the same results.

**State isolation:** The SDK maintains execution state separately from your business logic. Each durable execution has its own checkpoint log that other executions cannot access. The SDK encrypts checkpoint data at rest and ensures state remains consistent across replays.

## How checkpointing works
<a name="durable-sdk-how-checkpointing-works"></a>

When you call a durable operation, the SDK follows this sequence:

1. **Check for existing checkpoint:** The SDK checks if this operation already completed in a previous invocation. If a checkpoint exists, the SDK returns the stored result without re-executing the operation.

1. **Execute the operation:** If no checkpoint exists, the SDK executes your operation code. For steps, this means calling your function. For waits, this means scheduling resumption.

1. **Create checkpoint:** After the operation completes, the SDK serializes the result and creates a checkpoint. The checkpoint includes the operation type, name, inputs, result, and timestamp.

1. **Persist checkpoint:** The SDK calls the Lambda checkpoint API to persist the checkpoint. This ensures the checkpoint is durable before continuing execution.

1. **Return result:** The SDK returns the operation result to your code, which continues to the next operation.

This sequence ensures that once an operation completes, its result is safely stored. If your function is interrupted at any point, the SDK can replay up to the last completed checkpoint.

## Replay behavior
<a name="durable-sdk-replay-behavior"></a>

When your function resumes after a pause or interruption, the SDK performs replay:

1. **Load checkpoint log:** The SDK retrieves the checkpoint log for this execution from Lambda.

1. **Run from beginning:** The SDK invokes your handler function from the start, not from where it paused.

1. **Skip completed durable operations:** As your code calls durable operations, the SDK checks each against the checkpoint log. For completed durable operations, the SDK returns the stored result without executing the operation code.
**Note**  
If a child context's result was larger than the maximum checkpoint size (256 KB), the context's code is executed again during replay. This allows you to construct large results from the durable operations that ran inside the context, which will be looked up from the checkpoint log. Therefore it is imperative to only run deterministic code in the context itself. When using child contexts with large results, it is a best practice to perform long-running or non-deterministic work inside of steps and only perform short-running tasks which combine the results in the context itself.

1. **Resume at interruption point:** When the SDK reaches an operation without a checkpoint, it executes normally and creates new checkpoints as durable operations complete.

This replay mechanism requires your code to be deterministic. Given the same inputs and checkpoint log, your function must make the same sequence of durable operation calls. The SDK enforces this by validating that operation names and types match the checkpoint log during replay.

## Available durable operations
<a name="durable-sdk-operations"></a>

The `DurableContext` provides operations for different coordination patterns. Each durable operation creates checkpoints automatically, ensuring your function can resume from any point.

### Steps
<a name="durable-sdk-op-step"></a>

Executes business logic with automatic checkpointing and retry. Use steps for operations that call external services, perform calculations, or execute any logic that should be checkpointed. The SDK creates a checkpoint before and after the step, storing the result for replay.

------
#### [ TypeScript ]

```
const result = await context.step('process-payment', async () => {
  return await paymentService.charge(amount);
});
```

------
#### [ Python ]

```
result = context.step(
    lambda _: payment_service.charge(amount),
    name='process-payment'
)
```

------
#### [ Java ]

```
var result = context.step("process-payment", Payment.class, 
    () -> paymentService.charge(amount)
);
```

------

Steps support configurable retry strategies, execution semantics (at-most-once or at-least-once), and custom serialization.

### Waits
<a name="durable-sdk-op-wait"></a>

Pauses execution for a specified duration without consuming compute resources. The SDK creates a checkpoint, terminates the function invocation, and schedules resumption. When the wait completes, Lambda invokes your function again and the SDK replays to the wait point before continuing.

------
#### [ TypeScript ]

```
// Wait 1 hour without charges
await context.wait({ seconds: 3600 });
```

------
#### [ Python ]

```
# Wait 1 hour without charges
context.wait(Duration.from_seconds(3600))
```

------
#### [ Java ]

```
// Wait 1 hour without charges
context.wait(Duration.ofHours(1));
```

------

### Callbacks
<a name="durable-sdk-op-callback"></a>

Callbacks enable your function to pause and wait for external systems to provide input. When you create a callback, the SDK generates a unique callback ID and creates a checkpoint. Your function then suspends (terminates the invocation) without incurring compute charges. External systems submit callback results using the `SendDurableExecutionCallbackSuccess` or `SendDurableExecutionCallbackFailure` Lambda APIs. When a callback is submitted, Lambda invokes your function again, the SDK replays to the callback point, and your function continues with the callback result.

The SDK provides two methods for working with callbacks:

**createCallback:** Creates a callback and returns both a promise and a callback ID. You send the callback ID to an external system, which submits the result using the Lambda API.

------
#### [ TypeScript ]

```
const [promise, callbackId] = await context.createCallback('approval', {
  timeout: { hours: 24 }
});

await sendApprovalRequest(callbackId, requestData);
const approval = await promise;
```

------
#### [ Python ]

```
callback = context.create_callback(
    name='approval',
    config=CallbackConfig(timeout_seconds=86400)
)

context.step(
    lambda _: send_approval_request(callback.callback_id),
    name='send_request'
)

approval = callback.result()
```

------
#### [ Java ]

```
var config = CallbackConfig.builder(Duration.ofHours(24)).timeout()

var callback = context.createCallback("approval", String.class, config);

context.step("send-request", String.class, () -> {
    notificationService.sendApprovalRequest(callback.callbackId(), requestData);
    return "request-sent";
});

// Blocks until the callback finishes or times out
String approval = callback.get();
```

------

**waitForCallback:** Simplifies callback handling by combining callback creation and submission in one operation. The SDK creates the callback, executes your submitter function with the callback ID, and waits for the result.

------
#### [ TypeScript ]

```
const result = await context.waitForCallback(
  'external-api',
  async (callbackId, ctx) => {
    await submitToExternalAPI(callbackId, requestData);
  },
  { timeout: { minutes: 30 } }
);
```

------
#### [ Python ]

```
result = context.wait_for_callback(
    lambda callback_id: submit_to_external_api(callback_id, request_data),
    name='external-api',
    config=WaitForCallbackConfig(timeout_seconds=1800)
)
```

------
#### [ Java ]

```
var result = context.waitForCallback(
    "external-api",
    String.class,
    (callbackId, ctx) -> {
        submitToExternalAPI(callbackId, requestData);
    },
    WaitForCallbackConfig.builder()
        .callbackConfig(CallbackConfig.builder()
            .timeout(Duration.ofMinutes(30))
            .build())
        .build());
```

------

Configure timeouts to prevent functions from waiting indefinitely. If a callback times out, the SDK throws a `CallbackError` and your function can handle the timeout case. Use heartbeat timeouts for long-running callbacks to detect when external systems stop responding.

Use callbacks for human-in-the-loop workflows, external system integration, webhook responses, or any scenario where execution must pause for external input.

### Parallel execution
<a name="durable-sdk-op-parallel"></a>

Executes multiple operations concurrently with optional concurrency control. The SDK manages parallel execution, creates checkpoints for each operation, and handles failures according to your completion policy.

------
#### [ TypeScript ]

```
const results = await context.parallel([
  async (ctx) => ctx.step('task1', async () => processTask1()),
  async (ctx) => ctx.step('task2', async () => processTask2()),
  async (ctx) => ctx.step('task3', async () => processTask3())
]);
```

------
#### [ Python ]

```
results = context.parallel([
    lambda ctx: ctx.step(lambda _: process_task1(), name='task1'),
    lambda ctx: ctx.step(lambda _: process_task2(), name='task2'),
    lambda ctx: ctx.step(lambda _: process_task3(), name='task3')
])
```

------
#### [ Java ]

```
DurableFuture<String> f1;
DurableFuture<Integer> f2;
DurableFuture<Boolean> f3;
try (var parallel = context.parallel("tasks")) {
    f1 = parallel.branch("string-task",  String.class,  ctx -> ctx.step("string-task",  String.class,  s -> processString()));
    f2 = parallel.branch("integer-task", Integer.class, ctx -> ctx.step("integer-task", Integer.class, s -> processInteger()));
    f3 = parallel.branch("boolean-task", Boolean.class, ctx -> ctx.step("boolean-task", Boolean.class, s -> processBoolean()));
}
String stringResult = f1.get();
int integerResult = f2.get();
boolean booleanResult = f3.get();
```

------

Use `parallel` to execute independent operations concurrently.

### Map
<a name="durable-sdk-op-map"></a>

Concurrently execute an operation on each item in an array with optional concurrency control. The SDK manages concurrent execution, creates checkpoints for each operation, and handles failures according to your completion policy.

------
#### [ TypeScript ]

```
const results = await context.map(itemArray, async (ctx, item, index) =>
  ctx.step('task', async () => processItem(item, index))
);
```

------
#### [ Python ]

```
results = context.map(
    item_array,
    lambda ctx, item, index: ctx.step(
        lambda _: process_item(item, index),
        name='task'
    )
)
```

------
#### [ Java ]

```
var results = context.map(
    "process-items",
    itemArray,
    String.class,
    (item, index, ctx) -> ctx.step("task", String.class, s -> processItem(item, index)));
```

------

Use `map` to process arrays with concurrency control.

### Child contexts
<a name="durable-sdk-op-child-context"></a>

Creates an isolated execution context for grouping operations. Child contexts have their own checkpoint log and can contain multiple steps, waits, and other operations. The SDK treats the entire child context as a single unit for retry and recovery.

Use child contexts to organize complex workflows, implement sub-workflows, or isolate operations that should retry together.

------
#### [ TypeScript ]

```
const result = await context.runInChildContext(
  'batch-processing',
  async (childCtx) => {
    return await processBatch(childCtx, items);
  }
);
```

------
#### [ Python ]

```
result = context.run_in_child_context(
    lambda child_ctx: process_batch(child_ctx, items),
    name='batch-processing'
)
```

------
#### [ Java ]

```
var result = context.runInChildContext(
    "batch-processing", 
    String.class, 
    childCtx -> process_batch(childCtx, items)
);
```

------

The replay mechanism demands that durable operations happen in a deterministic order. Using multiple child contexts you can have multiple streams of work execute concurrently, and the determinism applies separately within each context. This allows you to build high-performance functions which efficiently utilize multiple CPU cores.

For example, imagine we start two child contexts, A and B. On the initial invocation, the steps within the contexts were run in this order, with the 'A' steps running concurrently with the 'B' steps: A1, B1, B2, A2, A3. Upon replay, the timing is much faster as results are retrieved from checkpoint log, and the steps happen to be encountered in a different order: B1, A1, A2, B2, A3. Because the 'A' steps were encountered in the correct order (A1, A2, A3) and the 'B' steps were encountered in the correct order (B1, B2), the need for determinism was satisfied correctly.

### Conditional waits
<a name="durable-sdk-op-wait-condition"></a>

Polls for a condition with automatic checkpointing between attempts. The SDK executes your check function, creates a checkpoint with the result, waits according to your strategy, and repeats until the condition is met.

------
#### [ TypeScript ]

```
const result = await context.waitForCondition(
  async (state, ctx) => {
    const status = await checkJobStatus(state.jobId);
    return { ...state, status };
  },
  {
    initialState: { jobId: 'job-123', status: 'pending' },
    waitStrategy: (state) => 
      state.status === 'completed' 
        ? { shouldContinue: false }
        : { shouldContinue: true, delay: { seconds: 30 } }
  }
);
```

------
#### [ Python ]

```
result = context.wait_for_condition(
    lambda state, ctx: check_job_status(state['jobId']),
    config=WaitForConditionConfig(
        initial_state={'jobId': 'job-123', 'status': 'pending'},
        wait_strategy=lambda state, attempt: 
            {'should_continue': False} if state['status'] == 'completed'
            else {'should_continue': True, 'delay': 30}
    )
)
```

------
#### [ Java ]

```
record JobState(String jobId, String status) {}

var result = context.waitForCondition(
    "check-job",
    JobState.class,
    (state, ctx) -> {
        var status = checkJobStatus(state.jobId());
        var updatedState = new JobState(state.jobId(), status);
        if ("completed".equals(status)) {
            return WaitForConditionResult.stopPolling(updatedState);
        }
        return WaitForConditionResult.continuePolling(updatedState);
    },
    WaitForConditionConfig.<JobState>builder()
        .initialState(new JobState("job-123", "pending"))
        .waitStrategy((state, attempt) -> Duration.ofSeconds(30))
        .build());
```

------

Use `waitForCondition` for polling external systems, waiting for resources to be ready, or implementing retry with backoff.

### Function invocation
<a name="durable-sdk-op-invoke"></a>

Invokes another Lambda function and waits for its result. The SDK creates a checkpoint, invokes the target function, and resumes your function when the invocation completes. This enables function composition and workflow decomposition.

------
#### [ TypeScript ]

```
const result = await context.invoke(
  'invoke-processor',
  'arn:aws:lambda:us-east-1:123456789012:function:processor:1',
  { data: inputData }
);
```

------
#### [ Python ]

```
result = context.invoke(
    'arn:aws:lambda:us-east-1:123456789012:function:processor:1',
    {'data': input_data},
    name='invoke-processor'
)
```

------
#### [ Java ]

```
var result = context.invoke(
    "invoke-processor", 
    "arn:aws:lambda:us-east-1:123456789012:function:processor:1",
    inputData,
    Result.class, 
    InvokeConfig.builder().build()
);
```

------

## How durable operations are metered
<a name="durable-operations-checkpoint-consumption"></a>

Each durable operation you call through `DurableContext` creates checkpoints to track execution progress and store state data. These operations incur charges based on their usage, and the checkpoints may contain data that contributes to your data write and retention costs. Stored data includes invocation event data, payloads returned from steps, and data passed when completing callbacks. Understanding how durable operations are metered helps you estimate execution costs and optimize your workflows. For details on pricing, see the [Lambda pricing page](https://aws.amazon.com/lambda/pricing/).

Payload size refers to the size of the serialized data that a durable operation persists. The data is measured in bytes and the size can vary depending on the serializer used by the operation. The payload of an operation could be the result itself for successful completions, or the serialized error object if the operation failed.

### Basic operations
<a name="durable-operations-basic"></a>

Basic operations are the fundamental building blocks for durable functions:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| Execution | Started | 1 | Input payload size | 
| Execution | Completed (Succeeded/Failed/Stopped) | 0 | Output payload size | 
| Step | Retry/Succeeded/Failed | 1 \$1 N retries | Returned payload size from each attempt | 
| Wait | Started | 1 | N/A | 
| WaitForCondition | Each poll attempt | 1 \$1 N polls | Returned payload size from each poll attempt | 
| Invocation-level Retry | Started | 1 | Payload for error object | 

### Callback operations
<a name="durable-operations-callbacks"></a>

Callback operations enable your function to pause and wait for external systems to provide input. These operations create checkpoints when the callback is created and when it's completed:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| CreateCallback | Started | 1 | N/A | 
| Callback completion via API call | Completed | 0 | Callback payload | 
| WaitForCallback | Started | 3 \$1 N retries (context \$1 callback \$1 step) | Payloads returned by submitter step attempts, plus two copies of the callback payload | 

### Compound operations
<a name="durable-operations-compound"></a>

Compound operations combine multiple durable operations to handle complex coordination patterns like parallel execution, array processing, and nested contexts:


| Operation | Checkpoint timing | Number of operations | Data persisted | 
| --- | --- | --- | --- | 
| Parallel | Started | 1 \$1 N branches (1 parent context \$1 N child contexts) | Up to two copies of the returned payload size from each branch, plus the statuses of each branch | 
| Map | Started | 1 \$1 N branches (1 parent context \$1 N child contexts) | Up to two copies of the returned payload size from each iteration, plus the statuses of each iteration | 
| Promise helpers | Completed | 1 | Returned payload size from the promise | 
| RunInChildContext | Succeeded/Failed | 1 | Returned payload size from the child context | 

For contexts, such as from `runInChildContext` or used internally by compound operations, results smaller than 256 KB are checkpointed directly. Larger results aren't stored—instead, they're reconstructed during replay by re-processing the context's operations.

# Supported runtimes for durable functions
<a name="durable-supported-runtimes"></a>

Durable functions are available for selected managed runtimes and OCI container images for additional runtime version flexibility. You can create durable functions for Node.js, Python, and Java using managed runtimes directly in the console or programmatically through infrastructure-as-code.

## Lambda managed runtimes
<a name="durable-managed-runtimes"></a>

The following managed runtimes support durable functions when you create functions in the Lambda console or using the AWS CLI with the `--durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 7}'` parameter. For complete information about Lambda runtimes, see [Lambda runtimes](lambda-runtimes.md).


| Language | Runtime | 
| --- | --- | 
| Node.js | nodejs22.x | 
| Node.js | nodejs24.x | 
| Python | python3.13 | 
| Python | python3.14 | 
| Java | java17 | 
| Java | java21 | 
| Java | java25 | 

**Note**  
Lambda Node.js and Python runtimes include the durable execution SDK for testing and development. However, we recommend including the SDK in your deployment package for production. This ensures version consistency and avoids potential runtime updates that might affect your function behavior. Because Java is a compiled language, Lambda Java runtimes do not include the durable execution SDK, so it must be included in your deployment package.

### Node.js
<a name="durable-runtime-nodejs"></a>

Install the SDK in your Node.js project:

```
npm install @aws/durable-execution-sdk-js
```

The SDK supports JavaScript and TypeScript. For TypeScript projects, the SDK includes type definitions.

### Python
<a name="durable-runtime-python"></a>

Install the SDK in your Python project:

```
pip install aws-durable-execution-sdk-python
```

The Python SDK uses synchronous methods and doesn't require `async/await`.

### Java
<a name="durable-runtime-java"></a>

Add a dependency to `pom.xml`:

```
<dependency>
    <groupId>software.amazon.lambda.durable</groupId>
    <artifactId>aws-durable-execution-sdk-java</artifactId>
    <version>VERSION</version>
</dependency>
```

Install the SDK in your Java project:

```
mvn install
```

The Java SDK provides both synchronous and asynchronous versions of each method.

## Container images
<a name="durable-container-images"></a>

You can use durable functions with container images to support additional runtime versions or custom runtime configurations. Container images let you use runtime versions not available as managed runtimes or customize your runtime environment.

To create a durable function using a container image:

1. Create a Dockerfile based on an Lambda base image

1. Install the durable execution SDK in your container

1. Build and push the container image to Amazon Elastic Container Registry

1. Create the Lambda function from the container image with durable execution enabled

### Container example
<a name="durable-container-python"></a>

Create a Dockerfile:

------
#### [ Python ]

Create a Dockerfile for Python 3.11:

```
FROM public.ecr.aws/lambda/python:3.11

# Copy requirements file
COPY requirements.txt ${LAMBDA_TASK_ROOT}/

# Install dependencies including durable SDK
RUN pip install -r requirements.txt

# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}/

# Set the handler
CMD [ "lambda_function.handler" ]
```

Create a `requirements.txt` file:

```
aws-durable-execution-sdk-python
```

------
#### [ Java ]

Create a Dockerfile for Java 25:

```
FROM --platform=linux/amd64 public.ecr.aws/lambda/java:25

# Install Maven
RUN dnf install -y maven

WORKDIR /var/task

# Copy Maven configuration and source code
COPY pom.xml .
COPY src ./src

# Build
RUN mvn clean package -DskipTests

# Move JAR to lib directory
RUN mv target/*.jar lib/

# Set the handler
CMD ["src.path.to.lambdaFunction::handler"]
```

------

Build and push the image:

```
# Build the image
docker build -t my-durable-function .

# Tag for ECR
docker tag my-durable-function:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest

# Push to ECR
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest
```

Create the function with durable execution enabled:

```
aws lambda create-function \
  --function-name myDurableFunction \
  --package-type Image \
  --code ImageUri=123456789012.dkr.ecr.us-east-1.amazonaws.com/my-durable-function:latest \
  --role arn:aws:iam::123456789012:role/lambda-execution-role \
  --durable-config '{"ExecutionTimeout": 3600, "RetentionPeriodInDays": 7}'
```

For more information about using container images with Lambda, see [Creating Lambda container images](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html) in the Lambda Developer Guide.

## Runtime considerations
<a name="durable-runtime-considerations"></a>

**SDK version management:** Include the durable execution SDK in your deployment package or container image. This ensures your function uses a specific SDK version and isn't affected by runtime updates. Pin SDK versions in your `package.json` or `requirements.txt` to control when you upgrade.

**Runtime updates:** AWS updates managed runtimes to include security patches and bug fixes. These updates may include new SDK versions. To avoid unexpected behavior, include the SDK in your deployment package and test thoroughly before deploying to production.

**Container image size:** Container images have a maximum uncompressed size of 10 GB. The durable execution SDK adds minimal size to your image. Optimize your container by using multi-stage builds and removing unnecessary dependencies.

**Cold start performance:** Container images may have longer cold start times than managed runtimes. The durable execution SDK has minimal impact on cold start performance. Use provisioned concurrency if cold start latency is critical for your application.