

# Using DynamoDB with other AWS services
<a name="OtherServices"></a>

Amazon DynamoDB is integrated with other AWS services, letting you automate repeating tasks or build applications that span multiple services. 

**Topics**
+ [Configuring AWS credentials using Amazon Cognito for DynamoDB](Cognito.Credentials.md)
+ [Integrating with Amazon Redshift](RedshiftforDynamoDB.md)
+ [Processing DynamoDB data with Apache Hive on Amazon EMR](EMRforDynamoDB.md)
+ [Integrating DynamoDB with Amazon S3](S3forDynamoDB.md)
+ [DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse](amazon-sagemaker-lakehouse-for-DynamoDB.md)
+ [DynamoDB zero-ETL integration with Amazon OpenSearch Service](OpenSearchIngestionForDynamoDB.md)
+ [Integrating DynamoDB with Amazon EventBridge](eventbridge-for-dynamodb.md)
+ [Integrating DynamoDB with Amazon Managed Streaming for Apache Kafka](msk-for-dynamodb.md)
+ [Best practices for integrating with DynamoDB](bp-integration.md)

# Configuring AWS credentials using Amazon Cognito for DynamoDB
<a name="Cognito.Credentials"></a>

The recommended way to obtain AWS credentials for your web and mobile applications is to use Amazon Cognito. Amazon Cognito helps you avoid hardcoding your AWS credentials on your files. It uses AWS Identity and Access Management (IAM) roles to generate temporary credentials for your application's authenticated and unauthenticated users.

 For example, to configure your JavaScript files to use an Amazon Cognito unauthenticated role to access the Amazon DynamoDB web service, do the following.

**To configure credentials to integrate with Amazon Cognito**

1.  Create an Amazon Cognito identity pool that allows unauthenticated identities. 

   ```
   aws cognito-identity create-identity-pool \
       --identity-pool-name DynamoPool \
       --allow-unauthenticated-identities \
       --output json
   {
       "IdentityPoolId": "us-west-2:12345678-1ab2-123a-1234-a12345ab12",
       "AllowUnauthenticatedIdentities": true,
       "IdentityPoolName": "DynamoPool"
   }
   ```

1.  Copy the following policy into a file named `myCognitoPolicy.json`. Replace the identity pool ID (*us-west-2:12345678-1ab2-123a-1234-a12345ab12*) with your own `IdentityPoolId` obtained in the previous step.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Federated": "cognito-identity.amazonaws.com"
         },
         "Action": "sts:AssumeRoleWithWebIdentity",
         "Condition": {
           "StringEquals": {
             "cognito-identity.amazonaws.com:aud": "us-west-2:12345678-1ab2-123a-1234-a12345ab12"
           },
           "ForAnyValue:StringLike": {
             "cognito-identity.amazonaws.com:amr": "unauthenticated"
           }
         }
       }
     ]
   }
   ```

------

1.  Create an IAM role that assumes the previous policy. In this way, Amazon Cognito becomes a trusted entity that can assume the `Cognito_DynamoPoolUnauth` role. 

   ```
   aws iam create-role --role-name Cognito_DynamoPoolUnauth \
   --assume-role-policy-document file://PathToFile/myCognitoPolicy.json --output json
   ```

1.  Grant the `Cognito_DynamoPoolUnauth` role full access to DynamoDB by attaching a managed policy (`AmazonDynamoDBFullAccess`). 

   ```
   aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess \
   --role-name Cognito_DynamoPoolUnauth
   ```
**Note**  
 Alternatively, you can grant fine-grained access to DynamoDB. For more information, see [Using IAM policy conditions for fine-grained access control](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/specifying-conditions.html). 

1.  Obtain and copy the IAM role Amazon Resource Name (ARN).

   ```
   aws iam get-role --role-name Cognito_DynamoPoolUnauth --output json 
   ```

1.  Add the `Cognito_DynamoPoolUnauth` role to the `DynamoPool` identity pool. The format to specify is `KeyName=string`, where `KeyName` is `unauthenticated` and the string is the role ARN obtained in the previous step. 

   ```
   aws cognito-identity set-identity-pool-roles \
   --identity-pool-id "us-west-2:12345678-1ab2-123a-1234-a12345ab12" \
   --roles unauthenticated=arn:aws:iam::123456789012:role/Cognito_DynamoPoolUnauth --output json
   ```

1.  Specify the Amazon Cognito credentials in your files. Modify the `IdentityPoolId` and `RoleArn` accordingly. 

   ```
   AWS.config.credentials = new AWS.CognitoIdentityCredentials({
   IdentityPoolId: "us-west-2:12345678-1ab2-123a-1234-a12345ab12",
   RoleArn: "arn:aws:iam::123456789012:role/Cognito_DynamoPoolUnauth"
   });
   ```

 You can now run your JavaScript programs against the DynamoDB web service using Amazon Cognito credentials. For more information, see [Setting credentials in a web browser](https://docs.aws.amazon.com/sdk-for-javascript/latest/developer-guide/setting-credentials-browser.html) in the *AWS SDK for JavaScript Getting Started Guide*. 

# Integrating with Amazon Redshift
<a name="RedshiftforDynamoDB"></a>

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.

DynamoDB and Amazon Redshift can be used together to address different data storage and processing needs within an application or data ecosystem.

See the topics below for more detailed topics on how to integrate DynamoDB with Amazon Redshift.

**Topics**
+ [Cross-account integration considerations with CMK](cross-account-integration-considerations.md)
+ [DynamoDB zero-ETL integration with Amazon Redshift](RedshiftforDynamoDB-zero-etl.md)
+ [Loading data from DynamoDB into Amazon Redshift with the COPY command](RedshiftforDynamoDB-copy-data.md)

# Cross-account integration considerations with CMK
<a name="cross-account-integration-considerations"></a>

When you attempt to integrate from DynamoDB to Amazon Redshift, the initial action is launched from Amazon Redshift. Without the proper permissions, this action could result in a silent failure. The following sections detail the permissions required for this cross-account integration.

## Required AWS KMS policies and permissions
<a name="required-kms-policies-permissions"></a>

Replace the following placeholders in the examples:
+ `111122223333`: The AWS account ID where Amazon Redshift is hosted
+ `444455556666`: The AWS account ID where DynamoDB is hosted
+ `REDSHIFT_ROLE_NAME`: The IAM role name used by Amazon Redshift
+ `REGION`: The AWS Region where your resources are located
+ `TABLE_NAME`: The name of your DynamoDB table
+ `KMS_KEY_ID`: The ID of your KMS key

### KMS key policy in the DynamoDB account
<a name="kms-key-policy-dynamodb-account"></a>

The following AWS KMS key policy enables cross-account access between your DynamoDB and Amazon Redshift services. In this example, account 444455556666 contains the DynamoDB table and AWS KMS key, while account 111122223333 contains the Amazon Redshift cluster that needs access to decrypt the data.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "Enable IAM User Permissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::444455556666:root"
            },
            "Action": "kms:*",
            "Resource": "*"
        },
        {
            "Sid": "Allow Redshift to use the key",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/REDSHIFT_ROLE_NAME"
            },
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey",
                "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": "*"
        }
    ]
}
```

------

### IAM Policy for the Amazon Redshift role (in Amazon Redshift account)
<a name="iam-policy-redshift-role"></a>

The following IAM policy allows a Amazon Redshift service to access DynamoDB tables and their associated AWS KMS encryption keys in a cross-account scenario. In this example, account 444455556666 contains the DynamoDB resources and AWS KMS keys that the Amazon Redshift service needs to access.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowDynamoDBAccess",
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:BatchGetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:BatchGetItem",
                "dynamodb:GetItem",
                "dynamodb:GetRecords",
                "dynamodb:GetShardIterator",
                "dynamodb:DescribeStream",
                "dynamodb:ListStreams"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:444455556666:table/TABLE_NAME",
                "arn:aws:dynamodb:*:444455556666:table/TABLE_NAME/stream/*"
            ]
        },
        {
            "Sid": "AllowKMSAccess",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey",
                "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": "arn:aws:kms:us-east-1:444455556666:key/KMS_KEY_ID"
        }
    ]
}
```

------

### Trust relationship for the Amazon Redshift role
<a name="trust-relationship-redshift-role"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

### DynamoDB Table policy (if using resource-based policies)
<a name="dynamodb-table-policy"></a>

The following resource-based policy allows a Amazon Redshift service in account 111122223333 to access DynamoDB tables and Streams in account 444455556666. Attach this policy to your DynamoDB table to enable cross-account access.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowRedshiftAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/REDSHIFT_ROLE_NAME"
            },
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:BatchGetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:BatchGetItem",
                "dynamodb:GetItem",
                "dynamodb:GetRecords",
                "dynamodb:GetShardIterator",
                "dynamodb:DescribeStream",
                "dynamodb:ListStreams"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:444455556666:table/TABLE_NAME",
                "arn:aws:dynamodb:*:444455556666:table/TABLE_NAME/stream/*"
            ]
        }
    ]
}
```

------

## Important considerations
<a name="important-considerations"></a>

1. Ensure the KMS key is in the same region as your DynamoDB table.

1. The KMS key must be a customer managed key (CMK), not an AWS managed key.

1. If you're using DynamoDB global tables, configure permissions for all relevant regions.

1. Consider adding condition statements to restrict access based on VPC endpoints or IP ranges.

1. For enhanced security, consider using `aws:PrincipalOrgID` condition to restrict access to your organization.

1. Monitor KMS key usage through CloudTrail and CloudWatch metrics.

# DynamoDB zero-ETL integration with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl"></a>

Amazon DynamoDB zero-ETL integration with Amazon Redshift enables seamless analytics on DynamoDB data without any coding. This fully-managed feature automatically replicates DynamoDB tables into an Amazon Redshift database so users can run SQL queries and analytics on their DynamoDB data without having to set up complex ETL processes. The integration works by replicating data from the DynamoDB table to the Amazon Redshift database. 

To set up the integration, simply specify a DynamoDB table as the source and an Amazon Redshift database as the target. On activation, the integration exports the full DynamoDB table to populate the Amazon Redshift database. The time it takes for this initial process to complete depends on the DynamoDB table size. The zero-ETL integration then incrementally replicates updates from DynamoDB to Amazon Redshift every 15-30 minutes using DynamoDB incremental exports. This means the replicated DynamoDB data in Amazon Redshift is kept up-to-date automatically. 

Once configured, users can analyze the DynamoDB data in Amazon Redshift using standard SQL clients and tools, without impacting DynamoDB table performance. By eliminating cumbersome ETL, this zero-ETL integration provides a fast, easy way to unlock insights from DynamoDB through Amazon Redshift analytics and machine learning capabilities. 

**Topics**
+ [Prerequisites before creating a DynamoDB zero-ETL integration with Amazon Redshift](#RedshiftforDynamoDB-zero-etl-prereqs)
+ [Limitations when using DynamoDB zero-ETL integrations with Amazon Redshift](#RedshiftforDynamoDB-zero-etl-limitations)
+ [Creating a DynamoDB zero-ETL integration with Amazon Redshift](RedshiftforDynamoDB-zero-etl-getting-started.md)
+ [Viewing DynamoDB zero-ETL integrations with Amazon Redshift](RedshiftforDynamoDB-zero-etl-viewing.md)
+ [Deleting DynamoDB zero-ETL integrations with Amazon Redshift](RedshiftforDynamoDB-zero-etl-deleting.md)

## Prerequisites before creating a DynamoDB zero-ETL integration with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl-prereqs"></a>

1.  You must have your source DynamoDB table and target Amazon Redshift cluster created before creating an integration. This information is covered in [Step 1: Configuring a source DynamoDB table](RedshiftforDynamoDB-zero-etl-getting-started.md#RedshiftforDynamoDB-zero-etl-getting-started-configuring) and [Step 2: Creating an Amazon Redshift data warehouse](RedshiftforDynamoDB-zero-etl-getting-started.md#RedshiftforDynamoDB-zero-etl-getting-started-creating). 

1.  A zero-ETL integration between Amazon DynamoDB and Amazon Redshift requires your source DynamoDB table to have [Point-in-time recovery (PITR)](Point-in-time-recovery.md) enabled.

1. For **resource-based policies**, the zero-ETL integration requires a resource-based policy attached directly to your DynamoDB table. This inline policy grants the Amazon Redshift service permission to access your table data for replication. For more information about resource-based policies for DynamoDB, see [Using resource-based policies for DynamoDB](access-control-resource-based.md).

   If you create the integration where your DynamoDB table and Amazon Redshift data warehouse are in the same account, you can use the **Fix it for me** option during the create integration step to automatically apply the required resource policies to both DynamoDB and Amazon Redshift.

   If you create an integration where your DynamoDB table and Amazon Redshift data warehouse are in different AWS accounts, you will need to manually apply the following resource policy on your DynamoDB table.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "StatementthatallowsAmazonRedshiftservicetoDescribeTableandExportTable",
               "Effect": "Allow",
               "Principal": {
                   "Service": "redshift.amazonaws.com"
               },
               "Action": [
                   "dynamodb:ExportTableToPointInTime",
                   "dynamodb:DescribeTable"
               ],
               "Resource": "*",
               "Condition": {
                   "StringEquals": {
                       "aws:SourceAccount": "111122223333"
                   },
                   "ArnEquals": {
                       "aws:SourceArn": "arn:aws:redshift:us-east-1:111122223333:integration:*"
                   }
               }
           },
           {
               "Sid": "StatementthatallowsAmazonRedshiftservicetoDescribeTableandExportTable",
               "Effect": "Allow",
               "Principal": {
                   "Service": "redshift.amazonaws.com"
               },
               "Action": "dynamodb:DescribeExport",
               "Resource": "arn:aws:dynamodb:us-east-1:111122223333:table/table-name/export/*",
               "Condition": {
                   "StringEquals": {
                       "aws:SourceAccount": "111122223333"
                   },
                   "ArnEquals": {
                       "aws:SourceArn": "arn:aws:redshift:us-east-1:111122223333:integration:*"
                   }
               }
           }
       ]
   }
   ```

------

   You may also need to configure the resource policy on your Amazon Redshift data warehouse. For more information, see [Configure authorization using the Amazon Redshift API](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.redshift-iam.html#zero-etl-using.resource-policies).

1. 

**For Identity-based policies:**

   1.  The user creating the integration requires an identity-based policy that authorizes the following actions: `GetResourcePolicy`, `PutResourcePolicy`, and `UpdateContinuousBackups`.
**Note**  
 The following policy examples will show the resource as `arn:aws:redshift{-serverless}`. This is an example to show that the arn can be either `arn:aws:redshift` or `arn:aws:redshift-serverless` depending on if your namespace is an Amazon Redshift cluster or Amazon Redshift Serverless namespace. 

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "dynamodb:ListTables"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "dynamodb:GetResourcePolicy",
                      "dynamodb:PutResourcePolicy",
                      "dynamodb:UpdateContinuousBackups"
                  ],
                  "Resource": [
                      "arn:aws:dynamodb:us-east-1:111122223333:table/table-name"
                  ]
              },
              {
                  "Sid": "AllowRedshiftDescribeIntegration",
                  "Effect": "Allow",
                  "Action": [
                      "redshift:DescribeIntegrations"
                  ],
                  "Resource": "*"
              },
              {
                  "Sid": "AllowRedshiftCreateIntegration",
                  "Effect": "Allow",
                  "Action": "redshift:CreateIntegration",
                  "Resource": "arn:aws:redshift:us-east-1:111122223333:integration:*"
              },
              {
                  "Sid": "AllowRedshiftModifyDeleteIntegration",
                  "Effect": "Allow",
                  "Action": [
                      "redshift:ModifyIntegration",
                      "redshift:DeleteIntegration"
                  ],
                  "Resource": "arn:aws:redshift:us-east-1:111122223333:integration:uuid"
              },
              {
                  "Sid": "AllowRedshiftCreateInboundIntegration",
                  "Effect": "Allow",
                  "Action": "redshift:CreateInboundIntegration",
                  "Resource": "arn:aws:redshift:us-east-1:111122223333:namespace:uuid"
              }
          ]
      }
      ```

------

   1.  The user responsible for configuring the destination Amazon Redshift namespace requires an identity-based policy that authorizes the following actions: `PutResourcePolicy`, `DeleteResourcePolicy`, and `GetResourcePolicy`.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "redshift:PutResourcePolicy",
                      "redshift:DeleteResourcePolicy",
                      "redshift:GetResourcePolicy"
                  ],
                  "Resource": [
                      "arn:aws:redshift:us-east-1:111122223333:cluster:cluster-name"
                  ]
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "redshift:DescribeInboundIntegrations"
                  ],
                  "Resource": [
                      "arn:aws:redshift:us-east-1:111122223333:cluster:cluster-name"
                  ]
              }
          ]
      }
      ```

------

1. 

**Encryption key permissions**  
If the source DynamoDB table is encrypted using customer managed AWS KMS key, you will need to add the following policy on your KMS key. This policy allows Amazon Redshift to be able to export data from your encrypted table using your KMS key.

   ```
   {
       "Sid": "Statement to allow Amazon Redshift service to perform Decrypt operation on the source DynamoDB Table",
       "Effect": "Allow",
       "Principal": {
           "Service": [
               "redshift.amazonaws.com"
           ]
       },
       "Action": "kms:Decrypt",
       "Resource": "*",
       "Condition": {
           "StringEquals": {
               "aws:SourceAccount": "<account>"
           },
           "ArnEquals": {
               "aws:SourceArn": "arn:aws:redshift:<region>:<account>:integration:*"
           }
       }
   }
   ```

 You can also follow the steps on [ Getting started with zero-ETL integrations](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.setting-up.html#zero-etl-using.redshift-iam) in the Amazon Redshift management guide to configure the permissions of the Amazon Redshift namespace. 

## Limitations when using DynamoDB zero-ETL integrations with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl-limitations"></a>

 The following general limitations apply to the current release of this integration. These limitations can change in subsequent releases. 

**Note**  
In addition to the limitations below, also review the general considerations when using zero-ETL integrations see [ Considerations when using zero-ETL integrations with Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl.reqs-lims.html) in the *Amazon Redshift Management Guide*.
+ The DynamoDB table and Amazon Redshift cluster need to be in the same Region.
+ The source DynamoDB table must be encrypted with either an Amazon-owned or Customer-managed AWS KMS key. Amazon managed encryption is not supported for the source DynamoDB table.

# Creating a DynamoDB zero-ETL integration with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl-getting-started"></a>

 Before creating a zero-ETL integration, you must first set up your source DynamoDB table and then the target Amazon Redshift data warehouse. 

## Step 1: Configuring a source DynamoDB table
<a name="RedshiftforDynamoDB-zero-etl-getting-started-configuring"></a>

 To create a zero-ETL integration with Amazon Redshift, you need to enable point-in-time recovery (PITR) on your table. If you do not have PITR turned on, the console can fix this for you during the integration setup process. For details on how to enable PITR, see [Point-in-time recovery](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery_Howitworks.html). 

## Step 2: Creating an Amazon Redshift data warehouse
<a name="RedshiftforDynamoDB-zero-etl-getting-started-creating"></a>

If you don't already have an Amazon Redshift data warehouse, you can create one. To create an Amazon Redshift Serverless workgroup, see [Creating a workgroup with a namespace](https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-console-workgroups-create-workgroup-wizard.html). To create an Amazon Redshift cluster, see [ Creating a cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/create-cluster.html). 

 The target Amazon Redshift workgroup or cluster must have the enable\$1case\$1sensitive\$1identifier parameter turned on for the integration to be successful. For more information on enabling case sensitivity, see [ Turn on case sensitivity for your data warehouse](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-setting-up.case-sensitivity.html) in the Amazon Redshift management guide. 

 After the Amazon Redshift workgroup or cluster setup is complete, you need to configure your data warehouse. See [Zero-ETL integrations](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.html) in the Amazon Redshift Management Guide for more information.

## Step 3: Creating a DynamoDB zero-ETL integration
<a name="RedshiftforDynamoDB-zero-etl-getting-started-creating-zetl"></a>

Before you create a zero-ETL integration, make sure to complete the tasks in the section titled [Prerequisites before creating a DynamoDB zero-ETL integration with Amazon Redshift](RedshiftforDynamoDB-zero-etl.md#RedshiftforDynamoDB-zero-etl-prereqs). Creating an integration between DynamoDB and Amazon Redshift is a two-step process. First create an integration from the DynamoDB, and then attach a Amazon Redshift database to this newly created integration. 

**Create a zero-ETL integration**

1.  Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodbv2](https://console.aws.amazon.com/dynamodbv2). 

1.  In the navigation pane, choose **Integrations**. 

1. Select **Create zero-ETL integration** and choose **Amazon Redshift**.

1. This will take you to the **Amazon Redshift console**. To continue with the procedure, see the **DynamoDB section** in [Create a zero-ETL integration for DynamoDB](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-setting-up.create-integration-ddb.html).

# Viewing DynamoDB zero-ETL integrations with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl-viewing"></a>

 You can view the details of a zero-ETL integration to see its configuration information and current status. 

**To view the details of a zero-ETL integration in the Amazon DynamoDB console:**

1.  Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodbv2](https://console.aws.amazon.com/dynamodbv2). 

1.  In the DynamoDB console, choose **Integrations**. 

1.  In the **Zero-ETL integration** pane, select the zero-ETL integration you want to view. 

**To view the details of a zero-ETL integration in the Amazon Redshift console:**

1.  Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2](https://console.aws.amazon.com/redshiftv2). 

1. Follow the steps at [Viewing zero-ETL integrations](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.describing.html).

**Note**  
 The possible statuses of a zero-ETL integration with Amazon Redshift are listed in [Viewing zero-ETL integrations](https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.describing.html) in the *Amazon Redshift Management Guide*.

# Deleting DynamoDB zero-ETL integrations with Amazon Redshift
<a name="RedshiftforDynamoDB-zero-etl-deleting"></a>

 When you delete a zero-ETL integration, your data isn't deleted from DynamoDB or Amazon Redshift, but DynamoDB stops sending data from your source table to the Amazon Redshift target. 

**To delete a zero-ETL integration**

1.  Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodbv2](https://console.aws.amazon.com/dynamodbv2). 

1.  In the DynamoDB console, choose **Integrations**. 

1.  In the **Zero-ETL integration** pane, select the zero-ETL integration you want to delete. 

1.  Choose **Manage**. This will take you to the integration details page.

1.  To confirm the deletion, choose **Delete**. 

# Loading data from DynamoDB into Amazon Redshift with the COPY command
<a name="RedshiftforDynamoDB-copy-data"></a>



Amazon Redshift works with Amazon DynamoDB with advanced business intelligence capabilities and a powerful SQL-based interface. When you copy data from a DynamoDB table into Amazon Redshift, you can perform complex data analysis queries on that data, including joins with other tables in your Amazon Redshift cluster.

In terms of provisioned throughput, a copy operation from a DynamoDB table counts against that table's read capacity. After the data is copied, your SQL queries in Amazon Redshift do not affect DynamoDB in any way. This is because your queries act upon a copy of the data from DynamoDB, rather than upon DynamoDB itself.

Before you can load data from a DynamoDB table, you must first create an Amazon Redshift table to serve as the destination for the data. Keep in mind that you are copying data from a NoSQL environment into a SQL environment, and that there are certain rules in one environment that do not apply in the other. Here are some of the differences to consider:
+ DynamoDB table names can contain up to 255 characters, including '.' (dot) and '-' (dash) characters, and are case-sensitive. Amazon Redshift table names are limited to 127 characters, cannot contain dots or dashes and are not case-sensitive. In addition, table names cannot conflict with any Amazon Redshift reserved words.
+ DynamoDB does not support the SQL concept of NULL. You need to specify how Amazon Redshift interprets empty or blank attribute values in DynamoDB, treating them either as NULLs or as empty fields.
+ DynamoDB data types do not correspond directly with those of Amazon Redshift. You need to ensure that each column in the Amazon Redshift table is of the correct data type and size to accommodate the data from DynamoDB.

Here is an example COPY command from Amazon Redshift SQL:

```
copy favoritemovies from 'dynamodb://my-favorite-movies-table'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>'
readratio 50;
```

In this example, the source table in DynamoDB is `my-favorite-movies-table`. The target table in Amazon Redshift is `favoritemovies`. The `readratio 50` clause regulates the percentage of provisioned throughput that is consumed; in this case, the COPY command will use no more than 50 percent of the read capacity units provisioned for `my-favorite-movies-table`. We highly recommend setting this ratio to a value less than the average unused provisioned throughput.

For detailed instructions on loading data from DynamoDB into Amazon Redshift, refer to the following sections in the [https://docs.aws.amazon.com/redshift/latest/dg/](https://docs.aws.amazon.com/redshift/latest/dg/):
+ [Loading data from a DynamoDB table](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-dynamodb.html)
+ [The COPY command](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html)
+ [COPY examples](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html)

# Processing DynamoDB data with Apache Hive on Amazon EMR
<a name="EMRforDynamoDB"></a>

Amazon DynamoDB is integrated with Apache Hive, a data warehousing application that runs on Amazon EMR. Hive can read and write data in DynamoDB tables, allowing you to:
+ Query live DynamoDB data using a SQL-like language (HiveQL).
+ Copy data from a DynamoDB table to an Amazon S3 bucket, and vice-versa.
+ Copy data from a DynamoDB table into Hadoop Distributed File System (HDFS), and vice-versa.
+ Perform join operations on DynamoDB tables.

**Topics**
+ [Overview](#EMRforDynamoDB.Overview)
+ [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md)
+ [Creating an external table in Hive](EMRforDynamoDB.ExternalTableForDDB.md)
+ [Processing HiveQL statements](EMRforDynamoDB.ProcessingHiveQL.md)
+ [Querying data in DynamoDB](EMRforDynamoDB.Querying.md)
+ [Copying data to and from Amazon DynamoDB](EMRforDynamoDB.CopyingData.md)
+ [Performance tuning](EMRforDynamoDB.PerformanceTuning.md)

## Overview
<a name="EMRforDynamoDB.Overview"></a>

Amazon EMR is a service that makes it easy to quickly and cost-effectively process vast amounts of data. To use Amazon EMR, you launch a managed cluster of Amazon EC2 instances running the Hadoop open source framework. *Hadoop* is a distributed application that implements the MapReduce algorithm, where a task is mapped to multiple nodes in the cluster. Each node processes its designated work, in parallel with the other nodes. Finally, the outputs are reduced on a single node, yielding the final result.

You can choose to launch your Amazon EMR cluster so that it is persistent or transient:
+ A *persistent* cluster runs until you shut it down. Persistent clusters are ideal for data analysis, data warehousing, or any other interactive use.
+ A *transient* cluster runs long enough to process a job flow, and then shuts down automatically. Transient clusters are ideal for periodic processing tasks, such as running scripts.

For information about Amazon EMR architecture and administration, see the [Amazon EMR Management Guide](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide).

When you launch an Amazon EMR cluster, you specify the initial number and type of Amazon EC2 instances. You also specify other distributed applications (in addition to Hadoop itself) that you want to run on the cluster. These applications include Hue, Mahout, Pig, Spark, and more.

For information about applications for Amazon EMR, see the [Amazon EMR Release Guide](https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide).

Depending on the cluster configuration, you might have one or more of the following node types:
+ Leader node — Manages the cluster, coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed and monitors the health of the instance groups. There is only one leader node in a cluster.
+ Core nodes — Runs MapReduce tasks and stores data using the Hadoop Distributed File System (HDFS).
+ Task nodes (optional) — Runs MapReduce tasks.

# Tutorial: Working with Amazon DynamoDB and Apache Hive
<a name="EMRforDynamoDB.Tutorial"></a>

In this tutorial, you will launch an Amazon EMR cluster, and then use Apache Hive to process data stored in a DynamoDB table.

*Hive* is a data warehouse application for Hadoop that allows you to process and analyze data from multiple sources. Hive provides a SQL-like language, *HiveQL*, that lets you work with data stored locally in the Amazon EMR cluster or in an external data source (such as Amazon DynamoDB).

For more information, see to the [Hive Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial).

**Topics**
+ [Before you begin](#EMRforDynamoDB.Tutorial.BeforeYouBegin)
+ [Step 1: Create an Amazon EC2 key pair](EMRforDynamoDB.Tutorial.EC2KeyPair.md)
+ [Step 2: Launch an Amazon EMR cluster](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)
+ [Step 3: Connect to the Leader node](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)
+ [Step 4: Load data into HDFS](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)
+ [Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)
+ [Step 6: Query the data in the DynamoDB table](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)
+ [Step 7: (Optional) clean up](EMRforDynamoDB.Tutorial.CleanUp.md)

## Before you begin
<a name="EMRforDynamoDB.Tutorial.BeforeYouBegin"></a>

For this tutorial, you will need the following:
+ An AWS account. If you do not have one, see [Signing up for AWS](SettingUp.DynamoWebService.md#SettingUp.DynamoWebService.SignUpForAWS).
+ An SSH client (Secure Shell). You use the SSH client to connect to the leader node of the Amazon EMR cluster and run interactive commands. SSH clients are available by default on most Linux, Unix, and Mac OS X installations. Windows users can download and install the [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/) client, which has SSH support.

**Next step**  
[Step 1: Create an Amazon EC2 key pair](EMRforDynamoDB.Tutorial.EC2KeyPair.md)

# Step 1: Create an Amazon EC2 key pair
<a name="EMRforDynamoDB.Tutorial.EC2KeyPair"></a>

In this step, you will create the Amazon EC2 key pair you need to connect to an Amazon EMR leader node and run Hive commands.

1. Sign in to the AWS Management Console and open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/).

1. Choose a region (for example, `US West (Oregon)`). This should be the same region in which your DynamoDB table is located.

1. In the navigation pane, choose **Key Pairs**.

1. Choose **Create Key Pair**. 

1. In **Key pair name**, type a name for your key pair (for example, `mykeypair`), and then choose **Create**. 

1. Download the private key file. The file name will end with `.pem` (such as `mykeypair.pem`). Keep this private key file in a safe place. You will need it to access any Amazon EMR cluster that you launch with this key pair. 
**Important**  
If you lose the key pair, you cannot connect to the leader node of your Amazon EMR cluster.

   For more information about key pairs, see [Amazon EC2 Key Pairs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) in the *Amazon EC2 User Guide*. 

**Next step**  
[Step 2: Launch an Amazon EMR cluster](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)

# Step 2: Launch an Amazon EMR cluster
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster"></a>

In this step, you will configure and launch an Amazon EMR cluster. Hive and a storage handler for DynamoDB will already be installed on the cluster.

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

1. Choose **Create Cluster**.

1. On the **Create Cluster - Quick Options** page, do the following:

   1. In **Cluster name**, type a name for your cluster (for example: `My EMR cluster`).

   1. In **EC2 key pair**, choose the key pair you created earlier.

   Leave the other settings at their defaults.

1. Choose **Create cluster**.

It will take several minutes to launch your cluster. You can use the **Cluster Details** page in the Amazon EMR console to monitor its progress.

When the status changes to `Waiting`, the cluster is ready for use.

## Cluster log files and Amazon S3
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster.LogFilesAndS3"></a>

An Amazon EMR cluster generates log files that contain information about the cluster status and debugging information. The default settings for **Create Cluster - Quick Options** include setting up Amazon EMR logging.

If one does not already exist, the AWS Management Console creates an Amazon S3 bucket. The bucket name is `aws-logs-account-id-region`, where ` account-id` is your AWS account number and `region` is the region in which you launched the cluster (for example, `aws-logs-123456789012-us-west-2`).

**Note**  
You can use the Amazon S3 console to view the log files. For more information, see [View Log Files](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.html) in the *Amazon EMR Management Guide*.

You can use this bucket for purposes in addition to logging. For example, you can use the bucket as a location for storing a Hive script or as a destination when exporting data from Amazon DynamoDB to Amazon S3.

**Next step**  
[Step 3: Connect to the Leader node](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)

# Step 3: Connect to the Leader node
<a name="EMRforDynamoDB.Tutorial.ConnectToLeaderNode"></a>

When the status of your Amazon EMR cluster changes to `Waiting`, you will be able to connect to the leader node using SSH and perform command line operations.

1. In the Amazon EMR console, choose your cluster's name to view its status.

1. On the **Cluster Details** page, find the **Leader public DNS** field. This is the public DNS name for the leader node of your Amazon EMR cluster.

1. To the right of the DNS name, choose the **SSH** link.

1. Follow the instructions in **Connect to the Leader Node Using SSH **.

   Depending on your operating system, choose the **Windows** tab or the **Mac/Linux** tab, and follow the instructions for connecting to the leader node.

After you connect to the leader node using either SSH or PuTTY, you should see a command prompt similar to the following:

```
[hadoop@ip-192-0-2-0 ~]$ 
```

**Next step**  
[Step 4: Load data into HDFS](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)

# Step 4: Load data into HDFS
<a name="EMRforDynamoDB.Tutorial.LoadDataIntoHDFS"></a>

In this step, you will copy a data file into Hadoop Distributed File System (HDFS), and then create an external Hive table that maps to the data file.

**Download the sample data**

1. Download the sample data archive (`features.zip`):

   ```
   wget https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/samples/features.zip
   ```

1. Extract the `features.txt` file from the archive:

   ```
   unzip features.zip
   ```

1. View the first few lines of the `features.txt` file:

   ```
   head features.txt
   ```

   The result should look similar to this:

   ```
   1535908|Big Run|Stream|WV|38.6370428|-80.8595469|794
   875609|Constable Hook|Cape|NJ|40.657881|-74.0990309|7
   1217998|Gooseberry Island|Island|RI|41.4534361|-71.3253284|10
   26603|Boone Moore Spring|Spring|AZ|34.0895692|-111.410065|3681
   1506738|Missouri Flat|Flat|WA|46.7634987|-117.0346113|2605
   1181348|Minnow Run|Stream|PA|40.0820178|-79.3800349|1558
   1288759|Hunting Creek|Stream|TN|36.343969|-83.8029682|1024
   533060|Big Charles Bayou|Bay|LA|29.6046517|-91.9828654|0
   829689|Greenwood Creek|Stream|NE|41.596086|-103.0499296|3671
   541692|Button Willow Island|Island|LA|31.9579389|-93.0648847|98
   ```

   The `features.txt` file contains a subset of data from the United States Board on Geographic Names ([http://geonames.usgs.gov/domestic/download\$1data.htm](http://geonames.usgs.gov/domestic/download_data.htm)). The fields in each line represent the following:
   + Feature ID (unique identifier)
   + Name
   + Class (lake; forest; stream; and so on)
   + State
   + Latitude (degrees)
   + Longitude (degrees)
   + Height (in feet)

1. At the command prompt, enter the following command:

   ```
   hive
   ```

   The command prompt changes to this: `hive>` 

1. Enter the following HiveQL statement to create a native Hive table:

   ```
   CREATE TABLE hive_features
       (feature_id             BIGINT,
       feature_name            STRING ,
       feature_class           STRING ,
       state_alpha             STRING,
       prim_lat_dec            DOUBLE ,
       prim_long_dec           DOUBLE ,
       elev_in_ft              BIGINT)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '\n';
   ```

1. Enter the following HiveQL statement to load the table with data:

   ```
   LOAD DATA
   LOCAL
   INPATH './features.txt'
   OVERWRITE
   INTO TABLE hive_features;
   ```

1. You now have a native Hive table populated with data from the `features.txt` file. To verify, enter the following HiveQL statement:

   ```
   SELECT state_alpha, COUNT(*)
   FROM hive_features
   GROUP BY state_alpha;
   ```

   The output should show a list of states and the number of geographic features in each.

**Next step**  
[Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)

# Step 5: Copy data to DynamoDB
<a name="EMRforDynamoDB.Tutorial.CopyDataToDDB"></a>

In this step, you will copy data from the Hive table (`hive_features`) to a new table in DynamoDB.

1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. Choose **Create Table**.

1. On the **Create DynamoDB table** page, do the following:

   1. In **Table**, type **Features**.

   1. For **Primary key**, in the **Partition key** field, type **Id**. Set the data type to **Number**.

      Clear **Use Default Settings**. For **Provisioned Capacity**, type the following:
      + **Read Capacity Units**—`10`
      + **Write Capacity Units**—`10`

   Choose **Create**.

1. At the Hive prompt, enter the following HiveQL statement: 

   ```
   CREATE EXTERNAL TABLE ddb_features
       (feature_id   BIGINT,
       feature_name  STRING,
       feature_class STRING,
       state_alpha   STRING,
       prim_lat_dec  DOUBLE,
       prim_long_dec DOUBLE,
       elev_in_ft    BIGINT)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES(
       "dynamodb.table.name" = "Features",
       "dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
   );
   ```

   You have now established a mapping between Hive and the Features table in DynamoDB.

1. Enter the following HiveQL statement to import data to DynamoDB:

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT
       feature_id,
       feature_name,
       feature_class,
       state_alpha,
       prim_lat_dec,
       prim_long_dec,
       elev_in_ft
   FROM hive_features;
   ```

   Hive will submit a MapReduce job, which will be processed by your Amazon EMR cluster. It will take several minutes to complete the job.

1. Verify that the data has been loaded into DynamoDB:

   1. In the DynamoDB console navigation pane, choose **Tables**.

   1. Choose the Features table, and then choose the **Items** tab to view the data.

**Next step**  
[Step 6: Query the data in the DynamoDB table](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)

# Step 6: Query the data in the DynamoDB table
<a name="EMRforDynamoDB.Tutorial.QueryDataInDynamoDB"></a>

In this step, you will use HiveQL to query the Features table in DynamoDB. Try the following Hive queries:

1. All of the feature types (`feature_class`) in alphabetical order:

   ```
   SELECT DISTINCT feature_class
   FROM ddb_features
   ORDER BY feature_class;
   ```

1. All of the lakes that begin with the letter "M":

   ```
   SELECT feature_name, state_alpha
   FROM ddb_features
   WHERE feature_class = 'Lake'
   AND feature_name LIKE 'M%'
   ORDER BY feature_name;
   ```

1. States with at least three features higher than a mile (5,280 feet):

   ```
   SELECT state_alpha, feature_class, COUNT(*)
   FROM ddb_features
   WHERE elev_in_ft > 5280
   GROUP by state_alpha, feature_class
   HAVING COUNT(*) >= 3
   ORDER BY state_alpha, feature_class;
   ```

**Next step**  
[Step 7: (Optional) clean up](EMRforDynamoDB.Tutorial.CleanUp.md)

# Step 7: (Optional) clean up
<a name="EMRforDynamoDB.Tutorial.CleanUp"></a>

Now that you have completed the tutorial, you can continue reading this section to learn more about working with DynamoDB data in Amazon EMR. You might decide to keep your Amazon EMR cluster up and running while you do this.

If you don't need the cluster anymore, you should terminate it and remove any associated resources. This will help you avoid being charged for resources you don't need.

1. Terminate the Amazon EMR cluster:

   1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

   1. Choose the Amazon EMR cluster, choose **Terminate**, and then confirm.

1. Delete the Features table in DynamoDB:

   1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

   1. In the navigation pane, choose **Tables**.

   1. Choose the Features table. From the **Actions** menu, choose **Delete Table**.

1. Delete the Amazon S3 bucket containing the Amazon EMR log files:

   1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

   1. From the list of buckets, choose `aws-logs- accountID-region`, where *accountID* is your AWS account number and *region* is the region in which you launched the cluster.

   1. From the **Action** menu, choose **Delete**.

# Creating an external table in Hive
<a name="EMRforDynamoDB.ExternalTableForDDB"></a>

In [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), you created an external Hive table that mapped to a DynamoDB table. When you issued HiveQL statements against the external table, the read and write operations were passed through to the DynamoDB table.

You can think of an external table as a pointer to a data source that is managed and stored elsewhere. In this case, the underlying data source is a DynamoDB table. (The table must already exist. You cannot create, update, or delete a DynamoDB table from within Hive.) You use the `CREATE EXTERNAL TABLE` statement to create the external table. After that, you can use HiveQL to work with data in DynamoDB, as if that data were stored locally within Hive.

**Note**  
You can use `INSERT` statements to insert data into an external table and `SELECT` statements to select data from it. However, you cannot use `UPDATE` or `DELETE` statements to manipulate data in the table.

If you no longer need the external table, you can remove it using the `DROP TABLE` statement. In this case, `DROP TABLE` only removes the external table in Hive. It does not affect the underlying DynamoDB table or any of its data.

**Topics**
+ [CREATE EXTERNAL TABLE syntax](#EMRforDynamoDB.ExternalTableForDDB.Syntax)
+ [Data type mappings](#EMRforDynamoDB.ExternalTableForDDB.DataTypes)

## CREATE EXTERNAL TABLE syntax
<a name="EMRforDynamoDB.ExternalTableForDDB.Syntax"></a>

The following shows the HiveQL syntax for creating an external Hive table that maps to a DynamoDB table:

```
CREATE EXTERNAL TABLE hive_table
    (hive_column1_name hive_column1_datatype, hive_column2_name hive_column2_datatype...)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
    "dynamodb.table.name" = "dynamodb_table",
    "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name..."
);
```

Line 1 is the start of the `CREATE EXTERNAL TABLE` statement, where you provide the name of the Hive table (*hive\$1table*) you want to create.

Line 2 specifies the columns and data types for *hive\$1table*. You need to define columns and data types that correspond to the attributes in the DynamoDB table. 

Line 3 is the `STORED BY` clause, where you specify a class that handles data management between the Hive and the DynamoDB table. For DynamoDB, `STORED BY` should be set to `'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'`. 

Line 4 is the start of the `TBLPROPERTIES` clause, where you define the following parameters for `DynamoDBStorageHandler`: 
+ `dynamodb.table.name`—the name of the DynamoDB table.
+  `dynamodb.column.mapping`—pairs of column names in the Hive table and their corresponding attributes in the DynamoDB table. Each pair is of the form *hive\$1column\$1name:dynamodb\$1attribute\$1name*, and the pairs are separated by commas.

Note the following:
+ The name of the Hive table name does not have to be the same as the DynamoDB table name.
+ The Hive table column names do not have to be the same as those in the DynamoDB table.
+ The table specified by `dynamodb.table.name` must exist in DynamoDB.
+ For `dynamodb.column.mapping`:
  + You must map the key schema attributes for the DynamoDB table. This includes the partition key and the sort key (if present).
  + You do not have to map the non-key attributes of the DynamoDB table. However, you will not see any data from those attributes when you query the Hive table.
  + If the data types of a Hive table column and a DynamoDB attribute are incompatible, you will see `NULL` in these columns when you query the Hive table.

**Note**  
The `CREATE EXTERNAL TABLE` statement does not perform any validation on the `TBLPROPERTIES` clause. The values you provide for `dynamodb.table.name` and `dynamodb.column.mapping` are only evaluated by the `DynamoDBStorageHandler` class when you attempt to access the table.

## Data type mappings
<a name="EMRforDynamoDB.ExternalTableForDDB.DataTypes"></a>

The following table shows DynamoDB data types and compatible Hive data types:


****  

| DynamoDB Data Type | Hive Data Type | 
| --- | --- | 
|  String  |  `STRING`  | 
|  Number  |  `BIGINT` or `DOUBLE`  | 
|  Binary  |  `BINARY`  | 
|  String Set  |  `ARRAY<STRING>`  | 
|  Number Set  |  `ARRAY<BIGINT>` or `ARRAY<DOUBLE>`  | 
|  Binary Set  |  `ARRAY<BINARY>`  | 

**Note**  
The following DynamoDB data types are not supported by the `DynamoDBStorageHandler` class, so they cannot be used with `dynamodb.column.mapping`:  
Map
List
Boolean
Null
However, if you need to work with these data types, you can create a single entity called `item` that represents the entire DynamoDB item as a map of strings for both keys and values in the map. For more information, see [Copying data without a column mapping](EMRforDynamoDB.CopyingData.S3.md#EMRforDynamoDB.CopyingData.S3.NoColumnMapping)

If you want to map a DynamoDB attribute of type Number, you must choose an appropriate Hive type:
+ The Hive `BIGINT` type is for 8-byte signed integers. It is the same as the `long` data type in Java.
+ The Hive `DOUBLE` type is for 8-bit double precision floating point numbers. It is the same as the `double` type in Java.

If you have numeric data stored in DynamoDB that has a higher precision than the Hive data type you choose, then accessing the DynamoDB data could cause a loss of precision. 

If you export data of type Binary from DynamoDB to (Amazon S3) or HDFS, the data is stored as a Base64-encoded string. If you import data from Amazon S3 or HDFS into the DynamoDB Binary type, you must ensure the data is encoded as a Base64 string.

# Processing HiveQL statements
<a name="EMRforDynamoDB.ProcessingHiveQL"></a>

Hive is an application that runs on Hadoop, which is a batch-oriented framework for running MapReduce jobs. When you issue a HiveQL statement, Hive determines whether it can return the results immediately or whether it must submit a MapReduce job.

For example, consider the *ddb\$1features* table (from [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md)). The following Hive query prints state abbreviations and the number of summits in each:

```
SELECT state_alpha, count(*)
FROM ddb_features
WHERE feature_class = 'Summit'
GROUP BY state_alpha;
```

Hive does not return the results immediately. Instead, it submits a MapReduce job, which is processed by the Hadoop framework. Hive will wait until the job is complete before it shows the results from the query:

```
AK  2
AL  2
AR  2
AZ  3
CA  7
CO  2
CT  2
ID  1
KS  1
ME  2
MI  1
MT  3
NC  1
NE  1
NM  1
NY  2
OR  5
PA  1
TN  1
TX  1
UT  4
VA  1
VT  2
WA  2
WY  3
Time taken: 8.753 seconds, Fetched: 25 row(s)
```

## Monitoring and canceling jobs
<a name="EMRforDynamoDB.MonitorAndCancelJob"></a>

When Hive launches a Hadoop job, it prints output from that job. The job completion status is updated as the job progresses. In some cases, the status might not be updated for a long time. (This can happen when you are querying a large DynamoDB table that has a low provisioned read capacity setting.)

If you need to cancel the job before it is complete, you can type **Ctrl\$1C** at any time.

# Querying data in DynamoDB
<a name="EMRforDynamoDB.Querying"></a>

The following examples show some ways that you can use HiveQL to query data stored in DynamoDB.

These examples refer to the *ddb\$1features* table in the tutorial ([Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)).

**Topics**
+ [Using aggregate functions](#EMRforDynamoDB.Querying.AggregateFunctions)
+ [Using the GROUP BY and HAVING clauses](#EMRforDynamoDB.Querying.GroupByAndHaving)
+ [Joining two DynamoDB tables](#EMRforDynamoDB.Querying.JoiningTwoTables)
+ [Joining tables from different sources](#EMRforDynamoDB.Querying.JoiningTablesFromDifferentSources)

## Using aggregate functions
<a name="EMRforDynamoDB.Querying.AggregateFunctions"></a>

HiveQL provides built-in functions for summarizing data values. For example, you can use the `MAX` function to find the largest value for a selected column. The following example returns the elevation of the highest feature in the state of Colorado.

```
SELECT MAX(elev_in_ft)
FROM ddb_features
WHERE state_alpha = 'CO';
```

## Using the GROUP BY and HAVING clauses
<a name="EMRforDynamoDB.Querying.GroupByAndHaving"></a>

You can use the `GROUP BY` clause to collect data across multiple records. This is often used with an aggregate function such as `SUM`, `COUNT`, `MIN`, or `MAX`. You can also use the `HAVING` clause to discard any results that do not meet certain criteria.

The following example returns a list of the highest elevations from states that have more than five features in the *ddb\$1features* table.

```
SELECT state_alpha, max(elev_in_ft)
FROM ddb_features
GROUP BY state_alpha
HAVING count(*) >= 5;
```

## Joining two DynamoDB tables
<a name="EMRforDynamoDB.Querying.JoiningTwoTables"></a>

The following example maps another Hive table (*east\$1coast\$1states*) to a table in DynamoDB. The `SELECT` statement is a join across these two tables. The join is computed on the cluster and returned. The join does not take place in DynamoDB. 

Consider a DynamoDB table named EastCoastStates that contains the following data:

```
StateName       StateAbbrev

Maine           ME
New Hampshire   NH
Massachusetts   MA
Rhode Island    RI
Connecticut     CT
New York        NY
New Jersey      NJ
Delaware        DE
Maryland        MD
Virginia        VA
North Carolina  NC
South Carolina  SC
Georgia         GA
Florida         FL
```

Let's assume the table is available as a Hive external table named east\$1coast\$1states:

```
1. CREATE EXTERNAL TABLE ddb_east_coast_states (state_name STRING, state_alpha STRING)
2. STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
3. TBLPROPERTIES ("dynamodb.table.name" = "EastCoastStates",
4. "dynamodb.column.mapping" = "state_name:StateName,state_alpha:StateAbbrev");
```

The following join returns the states on the East Coast of the United States that have at least three features:

```
SELECT ecs.state_name, f.feature_class, COUNT(*)
FROM ddb_east_coast_states ecs
JOIN ddb_features f on ecs.state_alpha = f.state_alpha
GROUP BY ecs.state_name, f.feature_class
HAVING COUNT(*) >= 3;
```

## Joining tables from different sources
<a name="EMRforDynamoDB.Querying.JoiningTablesFromDifferentSources"></a>

In the following example, s3\$1east\$1coast\$1states is a Hive table associated with a CSV file stored in Amazon S3. The *ddb\$1features* table is associated with data in DynamoDB. The following example joins these two tables, returning the geographic features from states whose names begin with "New."

```
1. create external table s3_east_coast_states (state_name STRING, state_alpha STRING)
2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
3. LOCATION 's3://bucketname/path/subpath/';
```

```
SELECT ecs.state_name, f.feature_name, f.feature_class
FROM s3_east_coast_states ecs
JOIN ddb_features f
ON ecs.state_alpha = f.state_alpha
WHERE ecs.state_name LIKE 'New%';
```

# Copying data to and from Amazon DynamoDB
<a name="EMRforDynamoDB.CopyingData"></a>

In the [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), you copied data from a native Hive table into an external DynamoDB table, and then queried the external DynamoDB table. The table is external because it exists outside of Hive. Even if you drop the Hive table that maps to it, the table in DynamoDB is not affected.

Hive is an excellent solution for copying data among DynamoDB tables, Amazon S3 buckets, native Hive tables, and Hadoop Distributed File System (HDFS). This section provides examples of these operations.

**Topics**
+ [Copying data between DynamoDB and a native Hive table](EMRforDynamoDB.CopyingData.NativeHive.md)
+ [Copying data between DynamoDB and Amazon S3](EMRforDynamoDB.CopyingData.S3.md)
+ [Copying data between DynamoDB and HDFS](EMRforDynamoDB.CopyingData.HDFS.md)
+ [Using data compression](EMRforDynamoDB.CopyingData.Compression.md)
+ [Reading non-printable UTF-8 character data](EMRforDynamoDB.CopyingData.NonPrintableData.md)

# Copying data between DynamoDB and a native Hive table
<a name="EMRforDynamoDB.CopyingData.NativeHive"></a>

If you have data in a DynamoDB table, you can copy the data to a native Hive table. This will give you a snapshot of the data, as of the time you copied it. 

You might decide to do this if you need to perform many HiveQL queries, but do not want to consume provisioned throughput capacity from DynamoDB. Because the data in the native Hive table is a copy of the data from DynamoDB, and not "live" data, your queries should not expect that the data is up-to-date.

**Note**  
The examples in this section are written with the assumption you followed the steps in [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md) and have an external table in DynamoDB named *ddb\$1features*. 

**Example From DynamoDB to native Hive table**  
You can create a native Hive table and populate it with data from *ddb\$1features*, like this:  

```
CREATE TABLE features_snapshot AS
SELECT * FROM ddb_features;
```
You can then refresh the data at any time:  

```
INSERT OVERWRITE TABLE features_snapshot
SELECT * FROM ddb_features;
```
In these examples, the subquery `SELECT * FROM ddb_features` will retrieve all of the data from *ddb\$1features*. If you only want to copy a subset of the data, you can use a `WHERE` clause in the subquery.  
The following example creates a native Hive table, containing only some of the attributes for lakes and summits:  

```
CREATE TABLE lakes_and_summits AS
SELECT feature_name, feature_class, state_alpha
FROM ddb_features
WHERE feature_class IN ('Lake','Summit');
```

**Example From native Hive table to DynamoDB**  
Use the following HiveQL statement to copy the data from the native Hive table to *ddb\$1features*:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM features_snapshot;
```

# Copying data between DynamoDB and Amazon S3
<a name="EMRforDynamoDB.CopyingData.S3"></a>

If you have data in a DynamoDB table, you can use Hive to copy the data to an Amazon S3 bucket.

You might do this if you want to create an archive of data in your DynamoDB table. For example, suppose you have a test environment where you need to work with a baseline set of test data in DynamoDB. You can copy the baseline data to an Amazon S3 bucket, and then run your tests. Afterward, you can reset the test environment by restoring the baseline data from the Amazon S3 bucket to DynamoDB.

If you worked through [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), then you already have an Amazon S3 bucket that contains your Amazon EMR logs. You can use this bucket for the examples in this section, if you know the root path for the bucket:

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

1. For **Name**, choose your cluster.

1. The URI is listed in **Log URI** under **Configuration Details**.

1. Make a note of the root path of the bucket. The naming convention is:

   `s3://aws-logs-accountID-region`

   where *accountID* is your AWS account ID and region is the AWS region for the bucket.

**Note**  
For these examples, we will use a subpath within the bucket, as in this example:  
 `s3://aws-logs-123456789012-us-west-2/hive-test`

The following procedures are written with the assumption you followed the steps in the tutorial and have an external table in DynamoDB named *ddb\$1features*.

**Topics**
+ [Copying data using the Hive default format](#EMRforDynamoDB.CopyingData.S3.DefaultFormat)
+ [Copying data with a user-specified format](#EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat)
+ [Copying data without a column mapping](#EMRforDynamoDB.CopyingData.S3.NoColumnMapping)
+ [Viewing the data in Amazon S3](#EMRforDynamoDB.CopyingData.S3.ViewingData)

## Copying data using the Hive default format
<a name="EMRforDynamoDB.CopyingData.S3.DefaultFormat"></a>

**Example From DynamoDB to Amazon S3**  
Use an `INSERT OVERWRITE` statement to write directly to Amazon S3.  

```
INSERT OVERWRITE DIRECTORY 's3://aws-logs-123456789012-us-west-2/hive-test'
SELECT * FROM ddb_features;
```
The data file in Amazon S3 looks like this:  

```
920709^ASoldiers Farewell Hill^ASummit^ANM^A32.3564729^A-108.33004616135
1178153^AJones Run^AStream^APA^A41.2120086^A-79.25920781260
253838^ASentinel Dome^ASummit^ACA^A37.7229821^A-119.584338133
264054^ANeversweet Gulch^AValley^ACA^A41.6565269^A-122.83614322900
115905^AChacaloochee Bay^ABay^AAL^A30.6979676^A-87.97388530
```
Each field is separated by an SOH character (start of heading, 0x01). In the file, SOH appears as **^A**.

**Example From Amazon S3 to DynamoDB**  

1. Create an external table pointing to the unformatted data in Amazon S3.

   ```
   CREATE EXTERNAL TABLE s3_features_unformatted
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data to DynamoDB.

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT * FROM s3_features_unformatted;
   ```

## Copying data with a user-specified format
<a name="EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat"></a>

If you want to specify your own field separator character, you can create an external table that maps to the Amazon S3 bucket. You might use this technique for creating data files with comma-separated values (CSV).

**Example From DynamoDB to Amazon S3**  

1. Create a Hive external table that maps to Amazon S3. When you do this, ensure that the data types are consistent with those of the DynamoDB external table.

   ```
   CREATE EXTERNAL TABLE s3_features_csv
       (feature_id       BIGINT,
       feature_name      STRING,
       feature_class     STRING,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE,
       prim_long_dec     DOUBLE,
       elev_in_ft        BIGINT)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data from DynamoDB.

   ```
   INSERT OVERWRITE TABLE s3_features_csv
   SELECT * FROM ddb_features;
   ```
The data file in Amazon S3 looks like this:  

```
920709,Soldiers Farewell Hill,Summit,NM,32.3564729,-108.3300461,6135
1178153,Jones Run,Stream,PA,41.2120086,-79.2592078,1260
253838,Sentinel Dome,Summit,CA,37.7229821,-119.58433,8133
264054,Neversweet Gulch,Valley,CA,41.6565269,-122.8361432,2900
115905,Chacaloochee Bay,Bay,AL,30.6979676,-87.9738853,0
```

**Example From Amazon S3 to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from Amazon S3:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM s3_features_csv;
```

## Copying data without a column mapping
<a name="EMRforDynamoDB.CopyingData.S3.NoColumnMapping"></a>

You can copy data from DynamoDB in a raw format and write it to Amazon S3 without specifying any data types or column mapping. You can use this method to create an archive of DynamoDB data and store it in Amazon S3.



**Example From DynamoDB to Amazon S3**  

1. Create an external table associated with your DynamoDB table. (There is no `dynamodb.column.mapping` in this HiveQL statement.)

   ```
   CREATE EXTERNAL TABLE ddb_features_no_mapping
       (item MAP<STRING, STRING>)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES ("dynamodb.table.name" = "Features");
   ```

   

1. Create another external table associated with your Amazon S3 bucket.

   ```
   CREATE EXTERNAL TABLE s3_features_no_mapping
       (item MAP<STRING, STRING>)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t'
   LINES TERMINATED BY '\n'
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data from DynamoDB to Amazon S3.

   ```
   INSERT OVERWRITE TABLE s3_features_no_mapping
   SELECT * FROM ddb_features_no_mapping;
   ```
The data file in Amazon S3 looks like this:  

```
Name^C{"s":"Soldiers Farewell Hill"}^BState^C{"s":"NM"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"6135"}^BLatitude^C{"n":"32.3564729"}^BId^C{"n":"920709"}^BLongitude^C{"n":"-108.3300461"}
Name^C{"s":"Jones Run"}^BState^C{"s":"PA"}^BClass^C{"s":"Stream"}^BElevation^C{"n":"1260"}^BLatitude^C{"n":"41.2120086"}^BId^C{"n":"1178153"}^BLongitude^C{"n":"-79.2592078"}
Name^C{"s":"Sentinel Dome"}^BState^C{"s":"CA"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"8133"}^BLatitude^C{"n":"37.7229821"}^BId^C{"n":"253838"}^BLongitude^C{"n":"-119.58433"}
Name^C{"s":"Neversweet Gulch"}^BState^C{"s":"CA"}^BClass^C{"s":"Valley"}^BElevation^C{"n":"2900"}^BLatitude^C{"n":"41.6565269"}^BId^C{"n":"264054"}^BLongitude^C{"n":"-122.8361432"}
Name^C{"s":"Chacaloochee Bay"}^BState^C{"s":"AL"}^BClass^C{"s":"Bay"}^BElevation^C{"n":"0"}^BLatitude^C{"n":"30.6979676"}^BId^C{"n":"115905"}^BLongitude^C{"n":"-87.9738853"}
```
Each field begins with an STX character (start of text, 0x02) and ends with an ETX character (end of text, 0x03). In the file, STX appears as **^B** and ETX appears as **^C**.

**Example From Amazon S3 to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from Amazon S3:  

```
INSERT OVERWRITE TABLE ddb_features_no_mapping
SELECT * FROM s3_features_no_mapping;
```

## Viewing the data in Amazon S3
<a name="EMRforDynamoDB.CopyingData.S3.ViewingData"></a>

If you use SSH to connect to the leader node, you can use the AWS Command Line Interface (AWS CLI) to access the data that Hive wrote to Amazon S3.

The following steps are written with the assumption you have copied data from DynamoDB to Amazon S3 using one of the procedures in this section.

1. If you are currently at the Hive command prompt, exit to the Linux command prompt.

   ```
   hive> exit;
   ```

1. List the contents of the hive-test directory in your Amazon S3 bucket. (This is where Hive copied the data from DynamoDB.)

   ```
   aws s3 ls s3://aws-logs-123456789012-us-west-2/hive-test/
   ```

   The response should look similar to this:

   `2016-11-01 23:19:54 81983 000000_0` 

   The file name (*000000\$10*) is system-generated.

1. (Optional) You can copy the data file from Amazon S3 to the local file system on the leader node. After you do this, you can use standard Linux command line utilities to work with the data in the file.

   ```
   aws s3 cp s3://aws-logs-123456789012-us-west-2/hive-test/000000_0 .
   ```

   The response should look similar to this:

   `download: s3://aws-logs-123456789012-us-west-2/hive-test/000000_0 to ./000000_0`
**Note**  
The local file system on the leader node has limited capacity. Do not use this command with files that are larger than the available space in the local file system.

# Copying data between DynamoDB and HDFS
<a name="EMRforDynamoDB.CopyingData.HDFS"></a>

If you have data in a DynamoDB table, you can use Hive to copy the data to the Hadoop Distributed File System (HDFS).

You might do this if you are running a MapReduce job that requires data from DynamoDB. If you copy the data from DynamoDB into HDFS, Hadoop can process it, using all of the available nodes in the Amazon EMR cluster in parallel. When the MapReduce job is complete, you can then write the results from HDFS to DDB.

In the following examples, Hive will read from and write to the following HDFS directory: `/user/hadoop/hive-test`

**Note**  
The examples in this section are written with the assumption you followed the steps in [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md) and you have an external table in DynamoDB named *ddb\$1features*. 

**Topics**
+ [Copying data using the Hive default format](#EMRforDynamoDB.CopyingData.HDFS.DefaultFormat)
+ [Copying data with a user-specified format](#EMRforDynamoDB.CopyingData.HDFS.UserSpecifiedFormat)
+ [Copying data without a column mapping](#EMRforDynamoDB.CopyingData.HDFS.NoColumnMapping)
+ [Accessing the data in HDFS](#EMRforDynamoDB.CopyingData.HDFS.ViewingData)

## Copying data using the Hive default format
<a name="EMRforDynamoDB.CopyingData.HDFS.DefaultFormat"></a>

**Example From DynamoDB to HDFS**  
Use an `INSERT OVERWRITE` statement to write directly to HDFS.  

```
INSERT OVERWRITE DIRECTORY 'hdfs:///user/hadoop/hive-test'
SELECT * FROM ddb_features;
```
The data file in HDFS looks like this:  

```
920709^ASoldiers Farewell Hill^ASummit^ANM^A32.3564729^A-108.33004616135
1178153^AJones Run^AStream^APA^A41.2120086^A-79.25920781260
253838^ASentinel Dome^ASummit^ACA^A37.7229821^A-119.584338133
264054^ANeversweet Gulch^AValley^ACA^A41.6565269^A-122.83614322900
115905^AChacaloochee Bay^ABay^AAL^A30.6979676^A-87.97388530
```
Each field is separated by an SOH character (start of heading, 0x01). In the file, SOH appears as **^A**.

**Example From HDFS to DynamoDB**  

1. Create an external table that maps to the unformatted data in HDFS.

   ```
   CREATE EXTERNAL TABLE hdfs_features_unformatted
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data to DynamoDB.

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT * FROM hdfs_features_unformatted;
   ```

## Copying data with a user-specified format
<a name="EMRforDynamoDB.CopyingData.HDFS.UserSpecifiedFormat"></a>

If you want to use a different field separator character, you can create an external table that maps to the HDFS directory. You might use this technique for creating data files with comma-separated values (CSV).

**Example From DynamoDB to HDFS**  

1. Create a Hive external table that maps to HDFS. When you do this, ensure that the data types are consistent with those of the DynamoDB external table.

   ```
   CREATE EXTERNAL TABLE hdfs_features_csv
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data from DynamoDB.

   ```
   INSERT OVERWRITE TABLE hdfs_features_csv
   SELECT * FROM ddb_features;
   ```
The data file in HDFS looks like this:  

```
920709,Soldiers Farewell Hill,Summit,NM,32.3564729,-108.3300461,6135
1178153,Jones Run,Stream,PA,41.2120086,-79.2592078,1260
253838,Sentinel Dome,Summit,CA,37.7229821,-119.58433,8133
264054,Neversweet Gulch,Valley,CA,41.6565269,-122.8361432,2900
115905,Chacaloochee Bay,Bay,AL,30.6979676,-87.9738853,0
```

**Example From HDFS to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from HDFS:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM hdfs_features_csv;
```

## Copying data without a column mapping
<a name="EMRforDynamoDB.CopyingData.HDFS.NoColumnMapping"></a>

You can copy data from DynamoDB in a raw format and write it to HDFS without specifying any data types or column mapping. You can use this method to create an archive of DynamoDB data and store it in HDFS.



**Note**  
If your DynamoDB table contains attributes of type Map, List, Boolean or Null, then this is the only way you can use Hive to copy data from DynamoDB to HDFS.

**Example From DynamoDB to HDFS**  

1. Create an external table associated with your DynamoDB table. (There is no `dynamodb.column.mapping` in this HiveQL statement.)

   ```
   CREATE EXTERNAL TABLE ddb_features_no_mapping
       (item MAP<STRING, STRING>)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES ("dynamodb.table.name" = "Features");
   ```

   

1. Create another external table associated with your HDFS directory.

   ```
   CREATE EXTERNAL TABLE hdfs_features_no_mapping
       (item MAP<STRING, STRING>)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t'
   LINES TERMINATED BY '\n'
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data from DynamoDB to HDFS.

   ```
   INSERT OVERWRITE TABLE hdfs_features_no_mapping
   SELECT * FROM ddb_features_no_mapping;
   ```
The data file in HDFS looks like this:  

```
Name^C{"s":"Soldiers Farewell Hill"}^BState^C{"s":"NM"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"6135"}^BLatitude^C{"n":"32.3564729"}^BId^C{"n":"920709"}^BLongitude^C{"n":"-108.3300461"}
Name^C{"s":"Jones Run"}^BState^C{"s":"PA"}^BClass^C{"s":"Stream"}^BElevation^C{"n":"1260"}^BLatitude^C{"n":"41.2120086"}^BId^C{"n":"1178153"}^BLongitude^C{"n":"-79.2592078"}
Name^C{"s":"Sentinel Dome"}^BState^C{"s":"CA"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"8133"}^BLatitude^C{"n":"37.7229821"}^BId^C{"n":"253838"}^BLongitude^C{"n":"-119.58433"}
Name^C{"s":"Neversweet Gulch"}^BState^C{"s":"CA"}^BClass^C{"s":"Valley"}^BElevation^C{"n":"2900"}^BLatitude^C{"n":"41.6565269"}^BId^C{"n":"264054"}^BLongitude^C{"n":"-122.8361432"}
Name^C{"s":"Chacaloochee Bay"}^BState^C{"s":"AL"}^BClass^C{"s":"Bay"}^BElevation^C{"n":"0"}^BLatitude^C{"n":"30.6979676"}^BId^C{"n":"115905"}^BLongitude^C{"n":"-87.9738853"}
```
Each field begins with an STX character (start of text, 0x02) and ends with an ETX character (end of text, 0x03). In the file, STX appears as **^B** and ETX appears as **^C**.

**Example From HDFS to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from HDFS:  

```
INSERT OVERWRITE TABLE ddb_features_no_mapping
SELECT * FROM hdfs_features_no_mapping;
```

## Accessing the data in HDFS
<a name="EMRforDynamoDB.CopyingData.HDFS.ViewingData"></a>

HDFS is a distributed file system, accessible to all of the nodes in the Amazon EMR cluster. If you use SSH to connect to the leader node, you can use command line tools to access the data that Hive wrote to HDFS.

HDFS is not the same thing as the local file system on the leader node. You cannot work with files and directories in HDFS using standard Linux commands (such as `cat`, `cp`, `mv`, or `rm`). Instead, you perform these tasks using the `hadoop fs` command.

The following steps are written with the assumption you have copied data from DynamoDB to HDFS using one of the procedures in this section.

1. If you are currently at the Hive command prompt, exit to the Linux command prompt.

   ```
   hive> exit;
   ```

1. List the contents of the /user/hadoop/hive-test directory in HDFS. (This is where Hive copied the data from DynamoDB.)

   ```
   hadoop fs -ls /user/hadoop/hive-test
   ```

   The response should look similar to this:

   ```
   Found 1 items
   -rw-r--r-- 1 hadoop hadoop 29504 2016-06-08 23:40 /user/hadoop/hive-test/000000_0
   ```

   The file name (*000000\$10*) is system-generated.

1. View the contents of the file:

   ```
   hadoop fs -cat /user/hadoop/hive-test/000000_0
   ```
**Note**  
In this example, the file is relatively small (approximately 29 KB). Be careful when you use this command with files that are very large or contain non-printable characters.

1. (Optional) You can copy the data file from HDFS to the local file system on the leader node. After you do this, you can use standard Linux command line utilities to work with the data in the file.

   ```
   hadoop fs -get /user/hadoop/hive-test/000000_0
   ```

   This command will not overwrite the file.
**Note**  
The local file system on the leader node has limited capacity. Do not use this command with files that are larger than the available space in the local file system.

# Using data compression
<a name="EMRforDynamoDB.CopyingData.Compression"></a>

When you use Hive to copy data among different data sources, you can request on-the-fly data compression. Hive provides several compression codecs. You can choose one during your Hive session. When you do this, the data is compressed in the specified format. 

The following example compresses data using the Lempel-Ziv-Oberhumer (LZO) algorithm. 

```
 1. SET hive.exec.compress.output=true;
 2. SET io.seqfile.compression.type=BLOCK;
 3. SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;
 4. 
 5. CREATE EXTERNAL TABLE lzo_compression_table (line STRING)
 6. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
 7. LOCATION 's3://bucketname/path/subpath/';
 8. 
 9. INSERT OVERWRITE TABLE lzo_compression_table SELECT *
10. FROM hiveTableName;
```

The resulting file in Amazon S3 will have a system-generated name with `.lzo` at the end (for example, `8d436957-57ba-4af7-840c-96c2fc7bb6f5-000000.lzo`).

The available compression codecs are:
+ `org.apache.hadoop.io.compress.GzipCodec`
+ `org.apache.hadoop.io.compress.DefaultCodec`
+ `com.hadoop.compression.lzo.LzoCodec`
+ `com.hadoop.compression.lzo.LzopCodec`
+ `org.apache.hadoop.io.compress.BZip2Codec`
+ `org.apache.hadoop.io.compress.SnappyCodec`

# Reading non-printable UTF-8 character data
<a name="EMRforDynamoDB.CopyingData.NonPrintableData"></a>

To read and write non-printable UTF-8 character data, you can use the `STORED AS SEQUENCEFILE` clause when you create a Hive table. A SequenceFile is a Hadoop binary file format. You need to use Hadoop to read this file. The following example shows how to export data from DynamoDB into Amazon S3. You can use this functionality to handle non-printable UTF-8 encoded characters. 

```
1. CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col array<string>)
2. STORED AS SEQUENCEFILE
3. LOCATION 's3://bucketname/path/subpath/';
4. 
5. INSERT OVERWRITE TABLE s3_export SELECT *
6. FROM hiveTableName;
```

# Performance tuning
<a name="EMRforDynamoDB.PerformanceTuning"></a>

When you create a Hive external table that maps to a DynamoDB table, you do not consume any read or write capacity from DynamoDB. However, read and write activity on the Hive table (such as `INSERT` or `SELECT`) translates directly into read and write operations on the underlying DynamoDB table.

Apache Hive on Amazon EMR implements its own logic for balancing the I/O load on the DynamoDB table and seeks to minimize the possibility of exceeding the table's provisioned throughput. At the end of each Hive query, Amazon EMR returns runtime metrics, including the number of times your provisioned throughput was exceeded. You can use this information, together with CloudWatch metrics on your DynamoDB table, to improve performance in subsequent requests.

The Amazon EMR console provides basic monitoring tools for your cluster. For more information, see [View and Monitor a Cluster](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view.html) in the *Amazon EMR Management Guide*.

You can also monitor your cluster and Hadoop jobs using web-based tools, such as Hue, Ganglia, and the Hadoop web interface. For more information, see [View Web Interfaces Hosted on Amazon EMR Clusters](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-web-interfaces.html) in the *Amazon EMR Management Guide*.

This section describes steps you can take to performance-tune Hive operations on external DynamoDB tables. 

**Topics**
+ [DynamoDB provisioned throughput](EMRforDynamoDB.PerformanceTuning.Throughput.md)
+ [Adjusting the mappers](EMRforDynamoDB.PerformanceTuning.Mappers.md)
+ [Additional topics](EMRforDynamoDB.PerformanceTuning.Misc.md)

# DynamoDB provisioned throughput
<a name="EMRforDynamoDB.PerformanceTuning.Throughput"></a>

When you issue HiveQL statements against the external DynamoDB table, the `DynamoDBStorageHandler` class makes the appropriate low-level DynamoDB API requests, which consume provisioned throughput. If there is not enough read or write capacity on the DynamoDB table, the request will be throttled, resulting in slow HiveQL performance. For this reason, you should ensure that the table has enough throughput capacity.

For example, suppose that you have provisioned 100 read capacity units for your DynamoDB table. This will let you read 409,600 bytes per second (100 × 4 KB read capacity unit size). Now suppose that the table contains 20 GB of data (21,474,836,480 bytes) and you want to use the `SELECT` statement to select all of the data using HiveQL. You can estimate how long the query will take to run like this:

 * 21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours * 

In this scenario, the DynamoDB table is a bottleneck. It won't help to add more Amazon EMR nodes, because the Hive throughput is constrained to only 409,600 bytes per second. The only way to decrease the time required for the `SELECT` statement is to increase the provisioned read capacity of the DynamoDB table. 

You can perform a similar calculation to estimate how long it would take to bulk-load data into a Hive external table mapped to a DynamoDB table. Determine the total number of write capacity units needed per item (less than 1KB = 1, 1-2KB = 2, etc), and multiply that by the number of items to load. This will give you the number of write capacity units required. Divide that number by the number of write capacity units that are allocated per second. This will yield the number of seconds it will take to load the table.

You should regularly monitor the CloudWatch metrics for your table. For a quick overview in the DynamoDB console, choose your table and then choose the **Metrics** tab. From here, you can view read and write capacity units consumed and read and write requests that have been throttled.

## Read capacity
<a name="EMRforDynamoDB.PerformanceTuning.Throughput.ReadCapacity"></a>

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of `ProvisionedThroughputExceeded` messages in the job output, you can adjust the default read rate. To do this, you can modify the `dynamodb.throughput.read.percent` configuration variable. You can use the `SET` command to set this variable at the Hive command prompt:

```
1. SET dynamodb.throughput.read.percent=1.0;
```

This variable persists for the current Hive session only. If you exit Hive and return to it later, `dynamodb.throughput.read.percent` will return to its default value.

The value of `dynamodb.throughput.read.percent` can be between `0.1` and `1.5`, inclusively. `0.5` represents the default read rate, meaning that Hive will attempt to consume half of the read capacity of the table. If you increase the value above `0.5`, Hive will increase the request rate; decreasing the value below `0.5` decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned read capacity of the table, or if your read requests are being throttled too much, try reducing `dynamodb.throughput.read.percent` below `0.5`. If you have sufficient read capacity in the table and want more responsive HiveQL operations, you can set the value above `0.5`.

## Write capacity
<a name="EMRforDynamoDB.PerformanceTuning.Throughput.WriteCapacity"></a>

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of `ProvisionedThroughputExceeded` messages in the job output, you can adjust the default write rate. To do this, you can modify the `dynamodb.throughput.write.percent` configuration variable. You can use the `SET` command to set this variable at the Hive command prompt:

```
1. SET dynamodb.throughput.write.percent=1.0;
```

This variable persists for the current Hive session only. If you exit Hive and return to it later, `dynamodb.throughput.write.percent` will return to its default value.

The value of `dynamodb.throughput.write.percent` can be between `0.1` and `1.5`, inclusively. `0.5` represents the default write rate, meaning that Hive will attempt to consume half of the write capacity of the table. If you increase the value above `0.5`, Hive will increase the request rate; decreasing the value below `0.5` decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned write capacity of the table, or if your write requests are being throttled too much, try reducing `dynamodb.throughput.write.percent` below `0.5`. If you have sufficient capacity in the table and want more responsive HiveQL operations, you can set the value above `0.5`.

When you write data to DynamoDB using Hive, ensure that the number of write capacity units is greater than the number of mappers in the cluster. For example, consider an Amazon EMR cluster consisting of 10 *m1.xlarge* nodes. The *m1.xlarge* node type provides 8 mapper tasks, so the cluster would have a total of 80 mappers (10 × 8). If your DynamoDB table has fewer than 80 write capacity units, then a Hive write operation could consume all of the write throughput for that table.

To determine the number of mappers for Amazon EMR node types, see [Task Configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html) in the *Amazon EMR Developer Guide*.

For more information on mappers, see [Adjusting the mappers](EMRforDynamoDB.PerformanceTuning.Mappers.md).

# Adjusting the mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers"></a>

When Hive launches a Hadoop job, the job is processed by one or more mapper tasks. Assuming that your DynamoDB table has sufficient throughput capacity, you can modify the number of mappers in the cluster, potentially improving performance.

**Note**  
The number of mapper tasks used in a Hadoop job are influenced by *input splits*, where Hadoop subdivides the data into logical blocks. If Hadoop does not perform enough input splits, then your write operations might not be able to consume all the write throughput available in the DynamoDB table. 

## Increasing the number of mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers.Increasing"></a>

Each mapper in an Amazon EMR has a maximum read rate of 1 MiB per second. The number of mappers in a cluster depends on the size of the nodes in your cluster. (For information about node sizes and the number of mappers per node, see [Task Configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html) in the *Amazon EMR Developer Guide*.)

If your DynamoDB table has ample throughput capacity for reads, you can try increasing the number of mappers by doing one of the following:
+ Increase the size of the nodes in your cluster. For example, if your cluster is using *m1.large* nodes (three mappers per node), you can try upgrading to *m1.xlarge* nodes (eight mappers per node).
+ Increase the number of nodes in your cluster. For example, if you have three-node cluster of *m1.xlarge* nodes, you have a total of 24 mappers available. If you were to double the size of the cluster, with the same type of node, you would have 48 mappers.

You can use the AWS Management Console to manage the size or the number of nodes in your cluster. (You might need to restart the cluster for these changes to take effect.)

Another way to increase the number of mappers is to modify the `mapred.tasktracker.map.tasks.maximum` Hadoop configuration parameter. (This is a Hadoop parameter, not a Hive parameter. You cannot modify it interactively from the command prompt.). If you increase the value of `mapred.tasktracker.map.tasks.maximum`, you can increase the number of mappers without increasing the size or number of nodes. However, it is possible for the cluster nodes to run out of memory if you set the value too high.

You set the value for `mapred.tasktracker.map.tasks.maximum` as a bootstrap action when you first launch your Amazon EMR cluster. For more information, see [(Optional) Create Bootstrap Actions to Install Additional Software](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html) in the *Amazon EMR Management Guide*.

## Decreasing the number of mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers.Decreasing"></a>

If you use the `SELECT` statement to select data from an external Hive table that maps to DynamoDB, the Hadoop job can use as many tasks as necessary, up to the maximum number of mappers in the cluster. In this scenario, it is possible that a long-running Hive query can consume all of the provisioned read capacity of the DynamoDB table, negatively impacting other users.

You can use the `dynamodb.max.map.tasks` parameter to set an upper limit for map tasks:

```
SET dynamodb.max.map.tasks=1
```

This value must be equal to or greater than 1. When Hive processes your query, the resulting Hadoop job will use no more than `dynamodb.max.map.tasks` when reading from the DynamoDB table.

# Additional topics
<a name="EMRforDynamoDB.PerformanceTuning.Misc"></a>

The following are some more ways to tune applications that use Hive to access DynamoDB.

## Retry duration
<a name="EMRforDynamoDB.PerformanceTuning.Misc.RetryDuration"></a>

By default, Hive will rerun a Hadoop job if it has not returned any results from DynamoDB within two minutes. You can adjust this interval by modifying the `dynamodb.retry.duration` parameter:

```
1. SET dynamodb.retry.duration=2;
```

The value must be a nonzero integer, representing the number of minutes in the retry interval. The default for `dynamodb.retry.duration` is 2 (minutes).

## Parallel data requests
<a name="EMRforDynamoDB.PerformanceTuning.Misc.ParallelDataRequests"></a>

Multiple data requests, either from more than one user or more than one application to a single table can drain read provisioned throughput and slow performance. 

## Process duration
<a name="EMRforDynamoDB.PerformanceTuning.Misc.ProcessDuration"></a>

Data consistency in DynamoDB depends on the order of read and write operations on each node. While a Hive query is in progress, another application might load new data into the DynamoDB table or modify or delete existing data. In this case, the results of the Hive query might not reflect changes made to the data while the query was running. 

## Request time
<a name="EMRforDynamoDB.PerformanceTuning.Misc.RequestTime"></a>

Scheduling Hive queries that access a DynamoDB table when there is lower demand on the DynamoDB table improves performance. For example, if most of your application's users live in San Francisco, you might choose to export daily data at 4:00 A.M. PST when the majority of users are asleep and not updating records in your DynamoDB database. 



# Integrating DynamoDB with Amazon S3
<a name="S3forDynamoDB"></a>

Amazon DynamoDB import and export capabilities provide a simple and efficient way to move data between Amazon S3 and DynamoDB tables without writing any code.

DynamoDB import and export features help you move, transform, and copy DynamoDB table accounts. You can import from your S3 sources, and you can export your DynamoDB table data to Amazon S3 and use AWS services such as Athena, Amazon SageMaker AI, and AWS Lake Formation to analyze your data and extract actionable insights. You can also import data directly into new DynamoDB tables to build new applications with single-digit millisecond performance at scale, facilitate data sharing between tables and accounts, and simplify your disaster recovery and business continuity plans.

**Topics**
+ [Import from Amazon S3](S3DataImport.HowItWorks.md)
+ [Export to Amazon S3](S3DataExport.HowItWorks.md)

# DynamoDB data import from Amazon S3: how it works
<a name="S3DataImport.HowItWorks"></a>

To import data into DynamoDB, your data must be in an Amazon S3 bucket in CSV, DynamoDB JSON, or Amazon Ion format. Data can be compressed in ZSTD or GZIP format, or can be directly imported in uncompressed form. Source data can either be a single Amazon S3 object or multiple Amazon S3 objects that use the same prefix.

Your data will be imported into a new DynamoDB table, which will be created when you initiate the import request. You can create this table with secondary indexes, then query and update your data across all primary and secondary indexes as soon as the import is complete. You can also add a global table replica after the import is complete. 

**Note**  
During the Amazon S3 import process, DynamoDB creates a new target table that will be imported into. Import into existing tables is not currently supported by this feature.

Import from Amazon S3 does not consume write capacity on the new table, so you do not need to provision any extra capacity for importing data into DynamoDB. Data import pricing is based on the uncompressed size of the source data in Amazon S3, that is processed as a result of the import. Items that are processed but fail to load into the table due to formatting or other inconsistencies in the source data are also billed as part of the import process. See [ Amazon DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing) for details.

You can import data from an Amazon S3 bucket owned by a different account if you have the correct permissions to read from that specific bucket. The new table may also be in a different Region from the source Amazon S3 bucket. For more information, see [Amazon Simple Storage Service setup and permissions ](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access.html).

Import times are directly related to your data’s characteristics in Amazon S3. This includes data size, data format, compression scheme, uniformity of data distribution, number of Amazon S3 objects, and other related variables. In particular, data sets with uniformly distributed keys will be faster to import than skewed data sets. For example, if your secondary index's key is using the month of the year for partitioning, and all your data is from the month of December, then importing this data may take significantly longer. 

The attributes associated with keys are expected to be unique on the base table. If any keys are not unique, the import will overwrite the associated items until only the last overwrite remains. For example, if the primary key is the month and multiple items are set to the month of September, each new item will overwrite the previously written items and only one item with the primary key of "month" set to September will remain. In such cases, the number of items processed in the import table description will not match the number of items in the target table. 

AWS CloudTrail logs all console and API actions for table import. For more information, see [Logging DynamoDB operations by using AWS CloudTrail](logging-using-cloudtrail.md).

The following video is an introduction to importing directly from Amazon S3 into DynamoDB.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/fqq0CMOnOaI/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/fqq0CMOnOaI)


**Topics**
+ [Requesting a table import in DynamoDB](S3DataImport.Requesting.md)
+ [Amazon S3 import formats for DynamoDB](S3DataImport.Format.md)
+ [Import format quotas and validation](S3DataImport.Validation.md)
+ [Best practices for importing from Amazon S3 into DynamoDB](S3DataImport.BestPractices.md)

# Requesting a table import in DynamoDB
<a name="S3DataImport.Requesting"></a>

DynamoDB import allows you to import data from an Amazon S3 bucket to a new DynamoDB table. You can request a table import using the [DynamoDB console](https://console.aws.amazon.com/), the [CLI](AccessingDynamoDB.md#Tools.CLI), [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-dynamodb-table.html) or the [DynamoDB API.](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/Welcome.html)

If you want to use the AWS CLI, you must configure it first. For more information, see [Accessing DynamoDB](AccessingDynamoDB.md).

**Note**  
The Import Table feature interacts with multiple different AWS Services such as Amazon S3 and CloudWatch. Before you begin an import, make sure that the user or role that invokes the import APIs has permissions to all services and resources the feature depends on. 
 Do not modify the Amazon S3 objects while the import is in progress, as this can cause the operation to fail or be cancelled.
For more information on errors and troubleshooting, see [Import format quotas and validation](S3DataImport.Validation.md)

**Topics**
+ [Setting up IAM permissions](#DataImport.Requesting.Permissions)
+ [Requesting an import using the AWS Management Console](#S3DataImport.Requesting.Console)
+ [Getting details about past imports in the AWS Management Console](#S3DataImport.Requesting.Console.Details)
+ [Requesting an import using the AWS CLI](#S3DataImport.Requesting.CLI)
+ [Getting details about past imports in the AWS CLI](#S3DataImport.Requesting.CLI.Details)

## Setting up IAM permissions
<a name="DataImport.Requesting.Permissions"></a>

You can import data from any Amazon S3 bucket you have permission to read from. The source bucket does not need to be in the same Region or have the same owner as the source table. Your AWS Identity and Access Management (IAM) must include the relevant actions on the source Amazon S3 bucket, and required CloudWatch permissions for providing debugging information. An example policy is shown below.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowDynamoDBImportAction",
      "Effect": "Allow",
      "Action": [
        "dynamodb:ImportTable",
        "dynamodb:DescribeImport"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:111122223333:table/my-table*"
    },
    {
      "Sid": "AllowS3Access",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket/*",
        "arn:aws:s3:::your-bucket"
      ]
    },
    {
      "Sid": "AllowCloudwatchAccess",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents",
        "logs:PutRetentionPolicy"
      ],
      "Resource": "arn:aws:logs:us-east-1:111122223333:log-group/aws-dynamodb/*"
    },
    {
      "Sid": "AllowDynamoDBListImports",
      "Effect": "Allow",
      "Action": "dynamodb:ListImports",
      "Resource": "*"
    }
  ]
}
```

------

### Amazon S3 permissions
<a name="DataImport.Requesting.Permissions.s3"></a>

When starting an import on an Amazon S3 bucket source that is owned by another account, ensure that the role or user has access to the Amazon S3 objects. You can check that by executing an Amazon S3 `GetObject` command and using the credentials. When using the API, the Amazon S3 bucket owner parameter defaults to the current user’s account ID. For cross account imports, ensure that this parameter is correctly populated with the bucket owner’s account ID. The following code is an example Amazon S3 bucket policy in the source account.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {"Sid": "ExampleStatement",
            "Effect": "Allow",
            "Principal": {"AWS": "arn:aws:iam::123456789012:user/Dave"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        }
    ]
}
```

------

### AWS Key Management Service
<a name="DataImport.Requesting.Permissions.kms"></a>

When creating the new table for import, if you select an encryption at rest key that is not owned by DynamoDB then you must provide the AWS KMS permissions required to operate a DynamoDB table encrypted with customer managed keys. For more information see [Authorizing use of your AWS KMS key](encryption.usagenotes.html#dynamodb-kms-authz). If the Amazon S3 objects are encrypted with server side encryption KMS (SSE-KMS), ensure that the role or user initiating the import has access to decrypt using the AWS KMS key. This feature does not support customer-provided encryption keys (SSE-C) encrypted Amazon S3 objects. 

### CloudWatch permissions
<a name="DataImport.Requesting.Permissions.cw"></a>

The role or user that is initiating the import will need create and manage permissions for the log group and log streams associated with the import. 

## Requesting an import using the AWS Management Console
<a name="S3DataImport.Requesting.Console"></a>

The following example demonstrates how to use the DynamoDB console to import existing data to a new table named `MusicCollection`.

**To request a table import**

1. Sign in to the AWS Management Console and open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. In the navigation pane on the left side of the console, choose **Import from S3**.

1. On the page that appears, select **Import from S3**.

1. Choose **Import from S3**.

1. In **Source S3 URL**, enter the Amazon S3 source URL.

   If you own the source bucket, choose **Browse S3** to search for it. Alternatively, enter the bucket's URL in the following format – `s3://bucket/prefix`. The `prefix` is an Amazon S3 key prefix. It's either the Amazon S3 object name that you want to import or the key prefix shared by all the Amazon S3 objects that you want to import.
**Note**  
You can't use the same prefix as your DynamoDB export request. The export feature creates a folder structure and manifest files for all the exports. If you use the same Amazon S3 path, it will result in an error.   
Instead, point the import at the folder, which contains data from that specific export. The format of the correct path in this case will be `s3://bucket/prefix/AWSDynamoDB/<XXXXXXXX-XXXXXX>/data/`, where `XXXXXXXX-XXXXXX` is the export ID. You can find export ID in the export ARN, which has the following format – `arn:aws:dynamodb:<Region>:<AccountID>:table/<TableName>/export/<XXXXXXXX-XXXXXX>`. For example, `arn:aws:dynamodb:us-east-1:123456789012:table/ProductCatalog/export/01234567890123-a1b2c3d4`.

1. Specify if you are the **S3 bucket owner**. If the source bucket is owned by a different account, select **A different AWS account**. Then enter the account ID of the bucket owner.

1. Under **Import file compression**, select either **No compression**, **GZIP** or **ZSTD** as appropriate.

1. Select the appropriate Import file format. The options are **DynamoDB JSON**, **Amazon Ion** or **CSV**. If you select **CSV**, you will have two additional options: **CSV header** and **CSV delimiter character**.

   For **CSV header**, choose if the header will either be taken from the first line of the file or be customized. If you select **Customize your headers**, you can specify the header values you want to import with. CSV Headers specified by this method are case-sensitive and are expected to contain the keys of the target table. 

   For **CSV delimiter character**, you set the character which will separate items. Comma is selected by default. If you select **Custom delimiter character**, the delimiter must match the regex pattern: `[,;:|\t ]`.

1. Select the **Next** button and select the options for the new table that will be created to store your data. 
**Note**  
Primary Key and Sort Key must match the attributes in the file, or the import will fail. The attributes are case sensitive.

1. Select **Next** again to review your import options, then click **Import** to begin the import task. You will first see your new table listed in the “Tables” with the status “Creating”. At this time the table is not accessible.

1. Once the import completes, the status will show as "Active" and you can start using the table.

## Getting details about past imports in the AWS Management Console
<a name="S3DataImport.Requesting.Console.Details"></a>

You can find information about import tasks you've run in the past by clicking **Import from S3** in the navigation sidebar, then selecting the **Imports** tab. The import panel contains a list of all imports you've created in the past 90 days. Selecting the ARN of a task listed in the Imports tab will retrieve information about that import, including any advanced configuration settings you chose.

## Requesting an import using the AWS CLI
<a name="S3DataImport.Requesting.CLI"></a>

The following example imports CSV formatted data from an S3 bucket called bucket with a prefix of prefix to a new table called target-table.

```
aws dynamodb import-table --s3-bucket-source S3Bucket=bucket,S3KeyPrefix=prefix \ 
            --input-format CSV --table-creation-parameters '{"TableName":"target-table","KeySchema":  \
            [{"AttributeName":"hk","KeyType":"HASH"}],"AttributeDefinitions":[{"AttributeName":"hk","AttributeType":"S"}],"BillingMode":"PAY_PER_REQUEST"}' \ 
            --input-format-options '{"Csv": {"HeaderList": ["hk", "title", "artist", "year_of_release"], "Delimiter": ";"}}'
```

**Note**  
If you choose to encrypt your import using a key protected by AWS Key Management Service (AWS KMS), the key must be in the same Region as the destination Amazon S3 bucket.

## Getting details about past imports in the AWS CLI
<a name="S3DataImport.Requesting.CLI.Details"></a>

You can find information about import tasks you've run in the past by using the `list-imports` command. This command returns a list of all imports you've created in the past 90 days. Note that although import task metadata expires after 90 days and jobs older than that are no longer found on this list, DynamoDB does not delete any of the objects in your Amazon S3 bucket or the table created during import.

```
aws dynamodb list-imports
```

To retrieve detailed information about a specific import task, including any advanced configuration settings, use the `describe-import` command.

```
aws dynamodb describe-import \
    --import-arn arn:aws:dynamodb:us-east-1:123456789012:table/ProductCatalog/exp
```

# Amazon S3 import formats for DynamoDB
<a name="S3DataImport.Format"></a>

DynamoDB can import data in three formats: CSV, DynamoDB JSON, and Amazon Ion.

**Topics**
+ [CSV](#S3DataImport.Requesting.Formats.CSV)
+ [DynamoDB Json](#S3DataImport.Requesting.Formats.DDBJson)
+ [Amazon Ion](#S3DataImport.Requesting.Formats.Ion)

## CSV
<a name="S3DataImport.Requesting.Formats.CSV"></a>

A file in CSV format consists of multiple items delimited by newlines. By default, DynamoDB interprets the first line of an import file as the header and expects columns to be delimited by commas. You can also define headers that will be applied, as long as they match the number of columns in the file. If you define headers explicitly, the first line of the file will be imported as values. 

**Note**  
When importing from CSV files, all columns other than the hash range and keys of your base table and secondary indexes are imported as DynamoDB strings.

**Escaping double quotes**

Any double quotes characters that exist in the CSV file must be escaped. If they are not escaped, such as in this following example, the import will fail: 

```
id,value
"123",Women's Full "Length" Dress
```

This same import will succeed if the quotes are escaped with two sets of double quotes:

```
id,value
"""123""","Women's Full ""Length"" Dress"
```

Once the text has been properly escaped and imported, it will appear as it did in the original CSV file:

```
id,value
"123",Women's Full "Length" Dress
```

**Importing heterogeneous item types**

You can use a single CSV file to import different item types into one table. Define a header row that includes all attributes across your item types, and leave columns empty for attributes that don't apply to a given item. Empty columns are omitted from the imported item rather than stored as empty strings.

```
PK,SK,EntityType,Name,Email,OrderDate,Amount,ProductName,Quantity
USER#1,PROFILE,User,Alice,alice@example.com,,,,
USER#1,ORDER#2024-01-15,Order,,,2024-01-15,99.99,,
USER#1,ORDER#2024-02-10,Order,,,2024-02-10,149.50,,
PRODUCT#101,METADATA,Product,,,,,Laptop,50
PRODUCT#102,METADATA,Product,,,,,Mouse,200
USER#2,PROFILE,User,Bob,bob@example.com,,,,
USER#2,ORDER#2024-01-20,Order,,,2024-01-20,75.00,,
PRODUCT#103,METADATA,Product,,,,,Keyboard,150
USER#3,PROFILE,User,Charlie,charlie@example.com,,,,
PRODUCT#104,METADATA,Product,,,,,Monitor,30
```

In this example, user profiles, orders, and products share the same table. Each item type uses only the columns relevant to it.

## DynamoDB Json
<a name="S3DataImport.Requesting.Formats.DDBJson"></a>

A file in DynamoDB JSON format can consist of multiple Item objects. Each individual object is in DynamoDB’s standard marshalled JSON format, and newlines are used as item delimiters. As an added feature, exports from point in time are supported as an import source by default.

**Note**  
New lines are used as item delimiters for a file in DynamoDB JSON format and shouldn't be used within an item object.

```
{"Item": {"Authors": {"SS": ["Author1", "Author2"]}, "Dimensions": {"S": "8.5 x 11.0 x 1.5"}, "ISBN": {"S": "333-3333333333"}, "Id": {"N": "103"}, "InPublication": {"BOOL": false}, "PageCount": {"N": "600"}, "Price": {"N": "2000"}, "ProductCategory": {"S": "Book"}, "Title": {"S": "Book 103 Title"}}}
{"Item": {"Authors": {"SS": ["Author1", "Author2"]}, "Dimensions": {"S": "8.5 x 11.0 x 1.5"}, "ISBN": {"S": "444-444444444"}, "Id": {"N": "104"}, "InPublication": {"BOOL": false}, "PageCount": {"N": "600"}, "Price": {"N": "2000"}, "ProductCategory": {"S": "Book"}, "Title": {"S": "Book 104 Title"}}}
{"Item": {"Authors": {"SS": ["Author1", "Author2"]}, "Dimensions": {"S": "8.5 x 11.0 x 1.5"}, "ISBN": {"S": "555-5555555555"}, "Id": {"N": "105"}, "InPublication": {"BOOL": false}, "PageCount": {"N": "600"}, "Price": {"N": "2000"}, "ProductCategory": {"S": "Book"}, "Title": {"S": "Book 105 Title"}}}
```

## Amazon Ion
<a name="S3DataImport.Requesting.Formats.Ion"></a>

[Amazon Ion](https://amzn.github.io/ion-docs/) is a richly-typed, self-describing, hierarchical data serialization format built to address rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures.

When you import data in Ion format, the Ion datatypes are mapped to DynamoDB datatypes in the new DynamoDB table.


| S. No. | Ion to DynamoDB datatype conversion | B | 
| --- | --- | --- | 
| `1` | `Ion Data Type` | `DynamoDB Representation` | 
| `2` | `string` | `String (s)` | 
| `3` | `bool` | `Boolean (BOOL)` | 
| `4` | `decimal` | `Number (N)` | 
| `5` | `blob` | `Binary (B)` | 
| `6` | `list (with type annotation $dynamodb_SS, $dynamodb_NS, or $dynamodb_BS)` | `Set (SS, NS, BS)` | 
| `7` | `list` | `List` | 
| `8` | `struct` | `Map` | 

Items in an Ion file are delimited by newlines. Each line begins with an Ion version marker, followed by an item in Ion format.

**Note**  
In the following example, we've formatted items from an Ion-formatted file on multiple lines to improve readability.

```
$ion_1_0
[
  {
    Item:{
      Authors:$dynamodb_SS::["Author1","Author2"],
      Dimensions:"8.5 x 11.0 x 1.5",
      ISBN:"333-3333333333",
      Id:103.,
      InPublication:false,
      PageCount:6d2,
      Price:2d3,
      ProductCategory:"Book",
      Title:"Book 103 Title"
    }
  },
  {
    Item:{
      Authors:$dynamodb_SS::["Author1","Author2"],
      Dimensions:"8.5 x 11.0 x 1.5",
      ISBN:"444-4444444444",
      Id:104.,
      InPublication:false,
      PageCount:6d2,
      Price:2d3,
      ProductCategory:"Book",
      Title:"Book 104 Title"
    }
  },
  {
    Item:{
      Authors:$dynamodb_SS::["Author1","Author2"],
      Dimensions:"8.5 x 11.0 x 1.5",
      ISBN:"555-5555555555",
      Id:105.,
      InPublication:false,
      PageCount:6d2,
      Price:2d3,
      ProductCategory:"Book",
      Title:"Book 105 Title"
    }
  }
]
```

# Import format quotas and validation
<a name="S3DataImport.Validation"></a>

## Import quotas
<a name="S3DataImport.Validation.limits"></a>

DynamoDB Import from Amazon S3 can support up to 50 concurrent import jobs with a total import source object size of 15TB at a time in us-east-1, us-west-2, and eu-west-1 regions. In all other regions, up to 50 concurrent import tasks with a total size of 1TB is supported. Each import job can take up to 50,000 Amazon S3 objects in all regions. These default quotas are applied to every account. If you feel you need to revise these quotas, please contact your account team, and this will be considered on a case-by-case basis. For more details on DynamoDB limits, see [Service Quotas](ServiceQuotas.html).

## Validation errors
<a name="S3DataImport.Validation.Errors"></a>

During the import process, DynamoDB may encounter errors while parsing your data. For each error, DynamoDB emits a CloudWatch log and keeps a count of the total number of errors encountered. If the Amazon S3 object itself is malformed or if its contents cannot form a DynamoDB item, then we may skip processing the remaining portion of the object.

**Note**  
If the Amazon S3 data source has multiple items that share the same key, the items will overwrite until one remains. This can appear as if 1 item was imported and the others were ignored. The duplicate items will be overwritten in random order, are not counted as errors, and are not emitted to CloudWatch logs.  
Once the import is complete you can see the total count of items imported, total count of errors, and total count of items processed. For further troubleshooting you can also check the total size of items imported and total size of data processed.

There are three categories of import errors: API validation errors, data validation errors, and configuration errors.

### API validation errors
<a name="S3DataImport.Validation.Errors.API"></a>

API validation errors are item-level errors from the sync API. Common causes are permissions issues, missing required parameters and parameter validation failures. Details on why the API call failed are contained in the exceptions thrown by the `ImportTable` request.

### Data validation errors
<a name="S3DataImport.Validation.Errors.Data"></a>

Data validation errors can occur at either the item level or file level. During import, items are validated based on DynamoDB rules before importing into the target table. When an item fails validation and is not imported, the import job skips over that item and continues on with the next item. At the end of job, the import status is set to FAILED with a FailureCode, ItemValidationError and the FailureMessage "Some of the items failed validation checks and were not imported. Please check CloudWatch error logs for more details."

 Common causes for data validation errors include objects being unparsable, objects being in the incorrect format (input specifies DYNAMODB\$1JSON but the object is not in DYNAMODB\$1JSON), and schema mismatch with specified source table keys.

### Configuration errors
<a name="S3DataImport.Validation.Errors.Configuration"></a>

Configuration errors are typically workflow errors due to permission validation. The Import workflow checks some permissions after accepting the request. If there are issues calling any of the required dependencies like Amazon S3 or CloudWatch the process marks the import status as FAILED. The `failureCode` and `failureMessage` point to the reason for failure. Where applicable, the failure message also contains the request id that you can use to investigate the reason for failure in CloudTrail.

Common configuration errors include having the wrong URL for the Amazon S3 bucket, and not having permission to access the Amazon S3 bucket, CloudWatch Logs, and AWS KMS keys used to decrypt the Amazon S3 object. For more information see [Using and data keys](encryption.usagenotes.html#dynamodb-kms). 

### Validating source Amazon S3 objects
<a name="S3DataImport.Validation.Errors.S3Objects"></a>

In order to validate source S3 objects, take the following steps.

1. Validate the data format and compression type 
   + Make sure that all matching Amazon S3 objects under the specified prefix have the same format (DYNAMODB\$1JSON, DYNAMODB\$1ION, CSV)
   + Make sure that all matching Amazon S3 objects under the specified prefix are compressed the same way (GZIP, ZSTD, NONE)
**Note**  
The Amazon S3 objects do not need to have the corresponding extension (.csv / .json / .ion / .gz / .zstd etc) as the input format specified in ImportTable call takes precedence.

1. Validate that the import data conforms to the desired table schema
   + Make sure that each item in the source data has the primary key. A sort key is optional for imports.
   + Make sure that the attribute type associated with the primary key and any sort key matches the attribute type in the Table and the GSI schema, as specified in table creation parameters

### Troubleshooting
<a name="S3DataImport.Validation.Troubleshooting"></a>

#### CloudWatch logs
<a name="S3DataImport.Validation.Troubleshooting.Cloudwatch"></a>

For Import jobs that fail, detailed error messages are posted to CloudWatch logs. To access these logs, first retrieve the ImportArn from the output and describe-import using this command:

```
aws dynamodb describe-import --import-arn arn:aws:dynamodb:us-east-1:ACCOUNT:table/target-table/import/01658528578619-c4d4e311
}
```

Example output:

```
aws dynamodb describe-import --import-arn "arn:aws:dynamodb:us-east-1:531234567890:table/target-table/import/01658528578619-c4d4e311"
{
    "ImportTableDescription": {
        "ImportArn": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/target-table/import/01658528578619-c4d4e311",
        "ImportStatus": "FAILED",
        "TableArn": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/target-table",
        "TableId": "7b7ecc22-302f-4039-8ea9-8e7c3eb2bcb8",
        "ClientToken": "30f8891c-e478-47f4-af4a-67a5c3b595e3",
        "S3BucketSource": {
            "S3BucketOwner": "ACCOUNT",
            "S3Bucket": "my-import-source",
            "S3KeyPrefix": "import-test"
        },
        "ErrorCount": 1,
        "CloudWatchLogGroupArn": "arn:aws:logs:us-east-1:ACCOUNT:log-group:/aws-dynamodb/imports:*",
        "InputFormat": "CSV",
        "InputCompressionType": "NONE",
        "TableCreationParameters": {
            "TableName": "target-table",
            "AttributeDefinitions": [
                {
                    "AttributeName": "pk",
                    "AttributeType": "S"
                }
            ],
            "KeySchema": [
                {
                    "AttributeName": "pk",
                    "KeyType": "HASH"
                }
            ],
            "BillingMode": "PAY_PER_REQUEST"
        },
        "StartTime": 1658528578.619,
        "EndTime": 1658528750.628,
        "ProcessedSizeBytes": 70,
        "ProcessedItemCount": 1,
        "ImportedItemCount": 0,
        "FailureCode": "ItemValidationError",
        "FailureMessage": "Some of the items failed validation checks and were not imported. Please check CloudWatch error logs for more details."
    }
}
```

Retrieve the log group and the import id from the above response and use it to retrieve the error logs. The import ID is the last path element of the `ImportArn` field. The log group name is `/aws-dynamodb/imports`. The error log stream name is `import-id/error`. For this example, it would be `01658528578619-c4d4e311/error`.

#### Missing the key pk in the item
<a name="S3DataImport.Validation.Troubleshooting.Missing"></a>

If the source S3 object does not contain the primary key that was provided as a parameter, the import will fail. For example, when you define the primary key for the import as column name “pk”.

```
aws dynamodb import-table —s3-bucket-source S3Bucket=my-import-source,S3KeyPrefix=import-test.csv \ 
            —input-format CSV --table-creation-parameters '{"TableName":"target-table","KeySchema":  \
            [{"AttributeName":"pk","KeyType":"HASH"}],"AttributeDefinitions":[{"AttributeName":"pk","AttributeType":"S"}],"BillingMode":"PAY_PER_REQUEST"}'
```

The column “pk” is missing from the the source object `import-test.csv` which has the following contents:

```
title,artist,year_of_release
The Dark Side of the Moon,Pink Floyd,1973
```

This import will fail due to item validation error because of the missing primary key in the data source.

Example CloudWatch error log:

```
aws logs get-log-events —log-group-name /aws-dynamodb/imports —log-stream-name 01658528578619-c4d4e311/error
{
"events": [
{
"timestamp": 1658528745319,
"message": "{\"itemS3Pointer\":{\"bucket\":\"my-import-source\",\"key\":\"import-test.csv\",\"itemIndex\":0},\"importArn\":\"arn:aws:dynamodb:us-east-1:531234567890:table/target-table/import/01658528578619-c4d4e311\",\"errorMessages\":[\"One or more parameter values were invalid: Missing the key pk in the item\"]}",
"ingestionTime": 1658528745414
}
],
"nextForwardToken": "f/36986426953797707963335499204463414460239026137054642176/s",
"nextBackwardToken": "b/36986426953797707963335499204463414460239026137054642176/s"
}
```

This error log indicates that “One or more parameter values were invalid: Missing the key pk in the item”. Since this import job failed, the table “target-table” now exists and is empty because no items were imported. The first item was processed and the object failed Item Validation. 

To fix the issue, first delete “target-table” if it is no longer needed. Then either use a primary key column name that exists in the source object, or update the source data to:

```
pk,title,artist,year_of_release
Albums::Rock::Classic::1973::AlbumId::ALB25,The Dark Side of the Moon,Pink Floyd,1973
```

#### Target table exists
<a name="S3DataImport.Validation.Troubleshooting.TargetTable"></a>

When you start an import job and receive a response as follows:

```
An error occurred (ResourceInUseException) when calling the ImportTable operation: Table already exists: target-table
```

To fix this error, you will need to choose a table name that doesn’t already exist and retry the import. 

#### The specified bucket does not exist
<a name="S3DataImport.Validation.Troubleshooting.Bucket"></a>

If the source bucket does not exist, the import will fail and log the error message details in CloudWatch. 

Example describe import:

```
aws dynamodb —endpoint-url $ENDPOINT describe-import —import-arn "arn:aws:dynamodb:us-east-1:531234567890:table/target-table/import/01658530687105-e6035287"
{
"ImportTableDescription": {
"ImportArn": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/target-table/import/01658530687105-e6035287",
"ImportStatus": "FAILED",
"TableArn": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/target-table",
"TableId": "e1215a82-b8d1-45a8-b2e2-14b9dd8eb99c",
"ClientToken": "3048e16a-069b-47a6-9dfb-9c259fd2fb6f",
"S3BucketSource": {
"S3BucketOwner": "531234567890",
"S3Bucket": "BUCKET_DOES_NOT_EXIST",
"S3KeyPrefix": "import-test"
},
"ErrorCount": 0,
"CloudWatchLogGroupArn": "arn:aws:logs:us-east-1:ACCOUNT:log-group:/aws-dynamodb/imports:*",
"InputFormat": "CSV",
"InputCompressionType": "NONE",
"TableCreationParameters": {
"TableName": "target-table",
"AttributeDefinitions": [
{
"AttributeName": "pk",
"AttributeType": "S"
}
],
"KeySchema": [
{
"AttributeName": "pk",
"KeyType": "HASH"
}
],
"BillingMode": "PAY_PER_REQUEST"
},
"StartTime": 1658530687.105,
"EndTime": 1658530701.873,
"ProcessedSizeBytes": 0,
"ProcessedItemCount": 0,
"ImportedItemCount": 0,
"FailureCode": "S3NoSuchBucket",
"FailureMessage": "The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: Q4W6QYYFDWY6WAKH; S3 Extended Request ID: ObqSlLeIMJpQqHLRX2C5Sy7n+8g6iGPwy7ixg7eEeTuEkg/+chU/JF+RbliWytMlkUlUcuCLTrI=; Proxy: null)"
}
}
```

The `FailureCode` is `S3NoSuchBucket`, with `FailureMessage` containing details such as request id and the service that threw the error. Since the error was caught before the data was imported into the table, a new DynamoDB table is not created. In some cases, when these errors are encountered after the data import has started, the table with partially imported data is retained. 

To fix this error, make sure that the source Amazon S3 bucket exists and then restart the import process.

# Best practices for importing from Amazon S3 into DynamoDB
<a name="S3DataImport.BestPractices"></a>

The following are the best practices for importing data from Amazon S3 into DynamoDB.

## Stay under the limit of 50,000 S3 objects
<a name="S3DataImport.BestPractices.S3Limit"></a>

Each import job supports a maximum of 50,000 S3 objects. If your dataset contains more than 50,000 objects, consider consolidating them into larger objects.

## Avoid excessively large S3 objects
<a name="S3DataImport.BestPractices.AvoidLargeObjects"></a>

S3 objects are imported in parallel. Having numerous mid-sized S3 objects allows for parallel execution without excessive overhead. For items under 1 KB, consider placing 4,000,000 items into each S3 object. If you have a larger average item size, place proportionally fewer items into each S3 object.

## Randomize sorted data
<a name="S3DataImport.BestPractices.RandomizeSortedData"></a>

If an S3 object holds data in sorted order, it can create a *rolling hot partition*. This is a situation where one partition receives all the activity, and then the next partition after that, and so on. Data in sorted order is defined as items in sequence in the S3 object that will be written to the same target partition during the import. One common situation where data is in sorted order is a CSV file where items are sorted by partition key so that repeated items share the same partition key.

To avoid a rolling hot partition, we recommend that you randomize the order in these cases. This can improve performance by spreading the write operations. For more information, see [Distributing write activity efficiently during data upload in DynamoDB](bp-partition-key-data-upload.md).

## Compress data to keep the total S3 object size below the Regional limit
<a name="S3DataImport.BestPractices.CompressData"></a>

In the [import from S3 process](S3DataImport.Requesting.md), there is a limit on the sum total size of the S3 object data to be imported. The limit is 15 TB in the us-east-1, us-west-2, and eu-west-1 Regions, and 1 TB in all other Regions. The limit is based on the raw S3 object sizes.

Compression allows more raw data to fit within the limit. If compression alone isn’t sufficient to fit the import within the limit, you can also contact [AWS Premium Support](https://aws.amazon.com/premiumsupport/) for a quota increase.

## Be aware of how item size impacts performance
<a name="S3DataImport.BestPractices.ItemSize"></a>

If your average item size is very small (below 200 bytes), the import process might take a little longer than for larger item sizes.

## Do not modify S3 objects during active imports
<a name="S3DataImport.BestPractices.NoModification"></a>

Ensure that your source S3 objects remain unchanged while an import operation is in progress. If an S3 object is modified during an import, the operation will fail with error code `ObjectModifiedInS3DuringImport` and the message "The S3 object could not be imported because it was overwritten."

If you encounter this error, restart the import operation with a stable version of your S3 object. To avoid this issue, wait for the current import to complete before making changes to the source files.

## Consider importing without any Global Secondary Indexes
<a name="S3DataImport.BestPractices.GSI"></a>

The duration of an import task may depend on the presence of one or multiple global secondary indexes (GSIs). If you plan to establish indexes with partition keys that have low cardinality, you may see a faster import if you defer index creation until after the import task is finished (rather than including them in the import job).

**Note**  
Creating a GSI does not incur write charges, whether it is created during or after the import.

# DynamoDB data export to Amazon S3: how it works
<a name="S3DataExport.HowItWorks"></a>

DynamoDB export to S3 is a fully managed solution for exporting your DynamoDB data to an Amazon S3 bucket at scale. Using DynamoDB export to S3, you can export data from an Amazon DynamoDB table from any time within your [point-in-time recovery (PITR)](Point-in-time-recovery.md) window to an Amazon S3 bucket. You need to enable PITR on your table to use the export functionality. This feature enables you to perform analytics and complex queries on your data using other AWS services such as Athena, AWS Glue, Amazon SageMaker AI, Amazon EMR, and AWS Lake Formation.

DynamoDB export to S3 allows you to export both full and incremental data from your DynamoDB table. Exports are asynchronous, they don't consume [read capacity units (RCUs)](provisioned-capacity-mode.md) and have no impact on table performance and availability. The export file formats supported are DynamoDB JSON and Amazon Ion formats. You can also export data to an S3 bucket owned by another AWS account and to a different AWS region. Your data is always encrypted end-to-end.

DynamoDB full exports are charged based on the size of the DynamoDB table (table data and local secondary indexes) at the point in time for which the export is done. DynamoDB incremental exports are charged based on the size of data processed from your continuous backups for the time period being exported. Incremental export has a minimum charge of 10MB. Additional charges apply for storing exported data in Amazon S3 and for `PUT` requests made against your Amazon S3 bucket. For more information about these charges, see [Amazon DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing/) and [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/).

For specifics on service quotas, see [Table export to Amazon S3](ServiceQuotas.md#limits-table-export). 

**Topics**
+ [Requesting a table export in DynamoDB](S3DataExport_Requesting.md)
+ [DynamoDB table export output format](S3DataExport.Output.md)

# Requesting a table export in DynamoDB
<a name="S3DataExport_Requesting"></a>

DynamoDB table exports allow you to export table data to an Amazon S3 bucket, enabling you to perform analytics and complex queries on your data using other AWS services such as Athena, AWS Glue, Amazon SageMaker AI, Amazon EMR, and AWS Lake Formation. You can request a table export using the AWS Management Console, the AWS CLI, or the DynamoDB API.

**Note**  
Requester pays Amazon S3 buckets aren't supported.

DynamoDB supports both full export and incremental export:
+ With **full exports**, you can export a full snapshot of your table from any point in time within the point-in-time recovery (PITR) window to your Amazon S3 bucket.
+ With **incremental exports**, you can export data from your DynamoDB table that was changed, updated, or deleted between a specified time period, within your PITR window, to your Amazon S3 bucket. 

**Topics**
+ [Prerequisites](#S3DataExport_Requesting_Permissions)
+ [Requesting an export using the AWS Management Console](#S3DataExport_Requesting_Console)
+ [Getting details about past exports in the AWS Management Console](#S3DataExport_Requesting_Console_Details)
+ [Requesting an export using the AWS CLI and AWS SDKs](#S3DataExport_Requesting_CLI)
+ [Getting details about past exports using the AWS CLI and AWS SDKs](#S3DataExport_Requesting_CLI_Details)

## Prerequisites
<a name="S3DataExport_Requesting_Permissions"></a>

**Enable PITR**

To use the export to S3 feature, you must enable PITR on your table. For details about how to enable PITR, see [Point-in-time recovery](PointInTimeRecovery_Howitworks.md). If you request an export for a table that doesn't have PITR enabled, your request will fail with an exception message: “An error occurred (PointInTimeRecoveryUnavailableException) when calling the `ExportTableToPointInTime` operation: Point in time recovery is not enabled for table 'my-dynamodb-table”. You can only request and export from a point in time that is within your configured PITR `RecoveryPeriodInDays`.

**Set up S3 permissions**

You can export your table data to any Amazon S3 bucket you have permission to write to. The destination bucket doesn't need to be in the same AWS Region or have the same owner as the source table owner. Your AWS Identity and Access Management (IAM) policy needs to allow you to be able to perform S3 actions (`s3:AbortMultipartUpload`, `s3:PutObject`, and `s3:PutObjectAcl`) and the DynamoDB export action (`dynamodb:ExportTableToPointInTime`). Here's an example of a sample policy that will grant your user permissions to perform exports to an S3 bucket.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowDynamoDBExportAction",
            "Effect": "Allow",
            "Action": "dynamodb:ExportTableToPointInTime",
            "Resource": "arn:aws:dynamodb:us-east-1:111122223333:table/my-table"
        },
        {
            "Sid": "AllowS3BucketWrites",
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        }
    ]
}
```

------

If you need to write to an Amazon S3 bucket that is in another account or you don't have permissions to write to, the Amazon S3 bucket owner must add a bucket policy to allow you to export from DynamoDB to that bucket. Here's an example policy on the target Amazon S3 bucket.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "ExampleStatement",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:user/Dave"
            },
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        }
    ]
}
```

------

Revoking these permissions while an export is in progress will result in partial files.

**Note**  
If the table or bucket you're exporting to is encrypted with customer managed keys, that KMS key's policies must give DynamoDB permission to use it. This permission is given through the IAM User/Role that triggers the export job. For more information on encryption including best practices, see [How DynamoDB uses AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/services-dynamodb.html) and [Using a custom KMS key](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/encryption.howitworks.html#managed-key-customer-managed).

## Requesting an export using the AWS Management Console
<a name="S3DataExport_Requesting_Console"></a>

The following example demonstrates how to use the DynamoDB console to export an existing table named `MusicCollection`.

**Note**  
This procedure assumes that you have enabled point-in-time recovery. To enable it for the `MusicCollection` table, on the table's **Overview** tab, in the **Table details** section, choose **Enable** for **Point-in-time recovery**.

**To request a table export**

1. Sign in to the AWS Management Console and open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. In the navigation pane on the left side of the console, choose **Exports to S3**.

1. Select the **Export to S3** button.

1. Choose a source table and destination S3 bucket. If the destination bucket is owned by your account, you can use the **Browse S3** button to find it. Otherwise, enter the URL of the bucket using the `s3://bucketname/prefix format.` the **prefix** is an optional folder to help keep your destination bucket organized.

1. Choose **Full export** or **Incremental export**. A **full export** outputs the full table snapshot of your table as it was at the point in time you specify. An **incremental export **outputs the changes made to your table during the specified export period. Your output is compacted so it only contains the final state of the item from the export period. The item will only appear once in the export even if it has multiple updates within the same export period.

------
#### [ Full export ]

   1. Select the point in time you want to export the full table snapshot from. This can be any point in time within the PITR window. Alternatively, you can select **Current time** to export the latest snapshot.

   1. For **Exported file format**, choose between **DynamoDB JSON** and **Amazon Ion**. By default, your table will be exported in DynamoDB JSON format from the latest restorable time in the point in time recovery window and encrypted using an Amazon S3 key (SSE-S3). You can change these export settings if necessary. 
**Note**  
If you choose to encrypt your export using a key protected by AWS Key Management Service (AWS KMS), the key must be in the same Region as the destination S3 bucket.

------
#### [ Incremental export ]

   1. Select the **Export period** you want to export the incremental data for. Pick a start time within the PITR window. The export period duration must be at least 15 minutes and be no longer than 24 hours. The export period's start time is inclusive and the end time is exclusive.

   1. Choose between **Absolute mode** or **Relative mode**.

      1. **Absolute mode** will export incremental data for the time period you specify.

      1. **Relative mode** will export incremental data for an export period that is relative to your export job submission time.

   1. For **Exported file format**, choose between **DynamoDB JSON** and **Amazon Ion**. By default, your table will be exported in DynamoDB JSON format from the latest restorable time in the point in time recovery window and encrypted using an Amazon S3 key (SSE-S3). You can change these export settings if necessary.
**Note**  
If you choose to encrypt your export using a key protected by AWS Key Management Service (AWS KMS), the key must be in the same Region as the destination S3 bucket.

   1. For **Export view type**, select either **New and old images** or **New images only**. New image provides the latest state of the item. Old image provides the state of the item right before the specified “start date and time”. The default setting is **New and old images**. For more information on new images and old images, see [Incremental export output](S3DataExport.Output.md#incremental-export-output).

------

1. Choose **Export** to begin.

Exported data isn't transactionally consistent. Your transaction operations can be torn between two export outputs. A subset of items can be modified by a transaction operation reflected in the export, while another subset of modifications in the same transaction isn't reflected in the same export request. However, exports are eventually consistent. If a transaction is torn during an export, you'll have the remaining transaction in your next contiguous export, without duplicates. The time periods used for exports are based on an internal system clock and can vary by one minute of your application’s local clock.

## Getting details about past exports in the AWS Management Console
<a name="S3DataExport_Requesting_Console_Details"></a>

You can find information about export tasks you've run in the past by choosing the **Exports to S3** section in the navigation sidebar. This section contains a list of all exports you've created in the past 90 days. Select the ARN of a task listed in the **Exports** tab to retrieve information about that export, including any advanced configuration settings you chose. Note that although export task metadata expires after 90 days and jobs older than that are no longer found in this list, the objects in your S3 bucket remain as long as their bucket policies allow. DynamoDB never deletes any of the objects it creates in your S3 bucket during an export.

## Requesting an export using the AWS CLI and AWS SDKs
<a name="S3DataExport_Requesting_CLI"></a>

The following examples show how to export an existing table to an S3 bucket. 

**Note**  
This procedure assumes that you have enabled point-in-time recovery. To enable it for the `MusicCollection` table, run the following command.  

```
aws dynamodb update-continuous-backups \
    --table-name MusicCollection \
    --point-in-time-recovery-specification PointInTimeRecoveryEnabled=True
```

**Full export**

------
#### [ AWS CLI ]

**Note**  
If requesting a cross-account table export, make sure to include the `--s3-bucket-owner` option.

```
aws dynamodb export-table-to-point-in-time \
  --table-arn arn:aws:dynamodb:us-west-2:111122223333:table/MusicCollection \
  --s3-bucket ddb-export-musiccollection-9012345678 \
  --s3-prefix 2020-Nov \
  --export-format DYNAMODB_JSON \
  --export-time 1604632434 \
  --s3-bucket-owner 9012345678 \
  --s3-sse-algorithm AES256
```

------
#### [ Python ]

```
import boto3
from datetime import datetime

client = boto3.client('dynamodb')

client.export_table_to_point_in_time(
    TableArn='arn:aws:dynamodb:us-east-1:111122223333:table/TABLE',
    ExportTime=datetime(2023, 9, 20, 12, 0, 0),
    S3Bucket='bucket',
    S3Prefix='prefix',
    S3SseAlgorithm='AES256',
    ExportFormat='DYNAMODB_JSON'
)
```

------
#### [ Java ]

```
DynamoDbClient client = DynamoDbClient.create();

client.exportTableToPointInTime(b -> b
    .tableArn("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE")
    .exportTime(Instant.parse("2023-09-20T12:00:00Z"))
    .s3Bucket("bucket")
    .s3Prefix("prefix")
    .s3SseAlgorithm(S3SseAlgorithm.AES256)
    .exportFormat(ExportFormat.DYNAMODB_JSON));
```

------
#### [ .NET ]

```
var client = new AmazonDynamoDBClient();

await client.ExportTableToPointInTimeAsync(new ExportTableToPointInTimeRequest
{
    TableArn = "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE",
    ExportTime = new DateTime(2023, 9, 20, 12, 0, 0, DateTimeKind.Utc),
    S3Bucket = "bucket",
    S3Prefix = "prefix",
    S3SseAlgorithm = S3SseAlgorithm.AES256,
    ExportFormat = ExportFormat.DYNAMODB_JSON
});
```

------
#### [ JavaScript ]

```
import { DynamoDBClient, ExportTableToPointInTimeCommand } from "@aws-sdk/client-dynamodb";

const client = new DynamoDBClient();

await client.send(new ExportTableToPointInTimeCommand({
    TableArn: "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE",
    ExportTime: new Date("2023-09-20T12:00:00Z"),
    S3Bucket: "bucket",
    S3Prefix: "prefix",
    S3SseAlgorithm: "AES256",
    ExportFormat: "DYNAMODB_JSON"
}));
```

------
#### [ Go ]

```
cfg, _ := config.LoadDefaultConfig(context.TODO())
client := dynamodb.NewFromConfig(cfg)

exportTime := time.Date(2023, 9, 20, 12, 0, 0, 0, time.UTC)
client.ExportTableToPointInTime(context.TODO(), &dynamodb.ExportTableToPointInTimeInput{
    TableArn:       aws.String("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"),
    ExportTime:     &exportTime,
    S3Bucket:       aws.String("bucket"),
    S3Prefix:       aws.String("prefix"),
    S3SseAlgorithm: types.S3SseAlgorithmAes256,
    ExportFormat:   types.ExportFormatDynamodbJson,
})
```

------

**Incremental export**

------
#### [ AWS CLI ]

```
aws dynamodb export-table-to-point-in-time \
  --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/TABLENAME \
  --s3-bucket BUCKET --s3-prefix PREFIX \
  --incremental-export-specification ExportFromTime=1693569600,ExportToTime=1693656000,ExportViewType=NEW_AND_OLD_IMAGES \
  --export-type INCREMENTAL_EXPORT
```

------
#### [ Python ]

```
import boto3
from datetime import datetime

client = boto3.client('dynamodb')

client.export_table_to_point_in_time(
    TableArn='arn:aws:dynamodb:us-east-1:111122223333:table/TABLE',
    IncrementalExportSpecification={
      'ExportFromTime': datetime(2023, 9, 20, 12, 0, 0),
      'ExportToTime': datetime(2023, 9, 20, 13, 0, 0),
      'ExportViewType': 'NEW_AND_OLD_IMAGES'
    },
    ExportType='INCREMENTAL_EXPORT',
    S3Bucket='bucket',
    S3Prefix='prefix',
    S3SseAlgorithm='AES256',
    ExportFormat='DYNAMODB_JSON'
)
```

------
#### [ Java ]

```
DynamoDbClient client = DynamoDbClient.create();

client.exportTableToPointInTime(b -> b
    .tableArn("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE")
    .exportType(ExportType.INCREMENTAL_EXPORT)
    .incrementalExportSpecification(i -> i
        .exportFromTime(Instant.parse("2023-09-20T12:00:00Z"))
        .exportToTime(Instant.parse("2023-09-20T13:00:00Z"))
        .exportViewType(ExportViewType.NEW_AND_OLD_IMAGES))
    .s3Bucket("bucket")
    .s3Prefix("prefix")
    .s3SseAlgorithm(S3SseAlgorithm.AES256)
    .exportFormat(ExportFormat.DYNAMODB_JSON));
```

------
#### [ .NET ]

```
var client = new AmazonDynamoDBClient();

await client.ExportTableToPointInTimeAsync(new ExportTableToPointInTimeRequest
{
    TableArn = "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE",
    ExportType = ExportType.INCREMENTAL_EXPORT,
    IncrementalExportSpecification = new IncrementalExportSpecification
    {
        ExportFromTime = new DateTime(2023, 9, 20, 12, 0, 0, DateTimeKind.Utc),
        ExportToTime = new DateTime(2023, 9, 20, 13, 0, 0, DateTimeKind.Utc),
        ExportViewType = ExportViewType.NEW_AND_OLD_IMAGES
    },
    S3Bucket = "bucket",
    S3Prefix = "prefix",
    S3SseAlgorithm = S3SseAlgorithm.AES256,
    ExportFormat = ExportFormat.DYNAMODB_JSON
});
```

------
#### [ JavaScript ]

```
import { DynamoDBClient, ExportTableToPointInTimeCommand } from "@aws-sdk/client-dynamodb";

const client = new DynamoDBClient();

await client.send(new ExportTableToPointInTimeCommand({
    TableArn: "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE",
    ExportType: "INCREMENTAL_EXPORT",
    IncrementalExportSpecification: {
        ExportFromTime: new Date("2023-09-20T12:00:00Z"),
        ExportToTime: new Date("2023-09-20T13:00:00Z"),
        ExportViewType: "NEW_AND_OLD_IMAGES"
    },
    S3Bucket: "bucket",
    S3Prefix: "prefix",
    S3SseAlgorithm: "AES256",
    ExportFormat: "DYNAMODB_JSON"
}));
```

------
#### [ Go ]

```
cfg, _ := config.LoadDefaultConfig(context.TODO())
client := dynamodb.NewFromConfig(cfg)

fromTime := time.Date(2023, 9, 20, 12, 0, 0, 0, time.UTC)
toTime := time.Date(2023, 9, 20, 13, 0, 0, 0, time.UTC)
client.ExportTableToPointInTime(context.TODO(), &dynamodb.ExportTableToPointInTimeInput{
    TableArn:   aws.String("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"),
    ExportType: types.ExportTypeIncrementalExport,
    IncrementalExportSpecification: &types.IncrementalExportSpecification{
        ExportFromTime: &fromTime,
        ExportToTime:   &toTime,
        ExportViewType: types.ExportViewTypeNewAndOldImages,
    },
    S3Bucket:       aws.String("bucket"),
    S3Prefix:       aws.String("prefix"),
    S3SseAlgorithm: types.S3SseAlgorithmAes256,
    ExportFormat:   types.ExportFormatDynamodbJson,
})
```

------

**Note**  
If you choose to encrypt your export using a key protected by AWS Key Management Service (AWS KMS), the key must be in the same Region as the destination S3 bucket.

## Getting details about past exports using the AWS CLI and AWS SDKs
<a name="S3DataExport_Requesting_CLI_Details"></a>

You can find information about export requests you've run in the past by using the `list-exports` command. This command returns a list of all exports you've created in the past 90 days. Note that although export task metadata expires after 90 days and jobs older than that are no longer returned by the `list-exports` command, the objects in your S3 bucket remain as long as their bucket policies allow. DynamoDB never deletes any of the objects it creates in your S3 bucket during an export.

Exports have a status of `PENDING` until they either succeed or fail. If they succeed, the status changes to `COMPLETED`. If they fail, the status changes to `FAILED` with a `failure_message` and `failure_reason`.

**List exports**

------
#### [ AWS CLI ]

```
aws dynamodb list-exports \
    --table-arn arn:aws:dynamodb:us-east-1:111122223333:table/ProductCatalog
```

------
#### [ Python ]

```
import boto3

client = boto3.client('dynamodb')

print(
  client.list_exports(
     TableArn='arn:aws:dynamodb:us-east-1:111122223333:table/TABLE',
  )
)
```

------
#### [ Java ]

```
DynamoDbClient client = DynamoDbClient.create();

ListExportsResponse response = client.listExports(b -> b
    .tableArn("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"));

response.exportSummaries().forEach(System.out::println);
```

------
#### [ .NET ]

```
var client = new AmazonDynamoDBClient();

var response = await client.ListExportsAsync(new ListExportsRequest
{
    TableArn = "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"
});

response.ExportSummaries.ForEach(Console.WriteLine);
```

------
#### [ JavaScript ]

```
import { DynamoDBClient, ListExportsCommand } from "@aws-sdk/client-dynamodb";

const client = new DynamoDBClient();

const response = await client.send(new ListExportsCommand({
    TableArn: "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"
}));

console.log(response.ExportSummaries);
```

------
#### [ Go ]

```
cfg, _ := config.LoadDefaultConfig(context.TODO())
client := dynamodb.NewFromConfig(cfg)

response, _ := client.ListExports(context.TODO(), &dynamodb.ListExportsInput{
    TableArn: aws.String("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE"),
})

fmt.Println(response.ExportSummaries)
```

------

**Describe export**

------
#### [ AWS CLI ]

```
aws dynamodb describe-export \
    --export-arn arn:aws:dynamodb:us-east-1:111122223333:table/ProductCatalog/export/01695353076000-a1b2c3d4
```

------
#### [ Python ]

```
import boto3

client = boto3.client('dynamodb')

print(
  client.describe_export(
     ExportArn='arn:aws:dynamodb:us-east-1:111122223333:table/TABLE/export/01695353076000-06e2188f',
  )['ExportDescription']
)
```

------
#### [ Java ]

```
DynamoDbClient client = DynamoDbClient.create();

DescribeExportResponse response = client.describeExport(b -> b
    .exportArn("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE/export/01695353076000-06e2188f"));

System.out.println(response.exportDescription());
```

------
#### [ .NET ]

```
var client = new AmazonDynamoDBClient();

var response = await client.DescribeExportAsync(new DescribeExportRequest
{
    ExportArn = "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE/export/01695353076000-06e2188f"
});

Console.WriteLine(response.ExportDescription);
```

------
#### [ JavaScript ]

```
import { DynamoDBClient, DescribeExportCommand } from "@aws-sdk/client-dynamodb";

const client = new DynamoDBClient();

const response = await client.send(new DescribeExportCommand({
    ExportArn: "arn:aws:dynamodb:us-east-1:111122223333:table/TABLE/export/01695353076000-06e2188f"
}));

console.log(response.ExportDescription);
```

------
#### [ Go ]

```
cfg, _ := config.LoadDefaultConfig(context.TODO())
client := dynamodb.NewFromConfig(cfg)

response, _ := client.DescribeExport(context.TODO(), &dynamodb.DescribeExportInput{
    ExportArn: aws.String("arn:aws:dynamodb:us-east-1:111122223333:table/TABLE/export/01695353076000-06e2188f"),
})

fmt.Println(response.ExportDescription)
```

------

# DynamoDB table export output format
<a name="S3DataExport.Output"></a>

A DynamoDB table export includes manifest files in addition to the files containing your table data. These files are all saved in the Amazon S3 bucket that you specify in your [export request](S3DataExport_Requesting.md). The following sections describe the format and contents of each output object.

**Topics**

## Full export output
<a name="full-export-output"></a>

### Manifest files
<a name="S3DataExport.Output_Manifest"></a>

DynamoDB creates manifest files, along with their checksum files, in the specified S3 bucket for each export request.

```
export-prefix/AWSDynamoDB/ExportId/manifest-summary.json
export-prefix/AWSDynamoDB/ExportId/manifest-summary.checksum
export-prefix/AWSDynamoDB/ExportId/manifest-files.json
export-prefix/AWSDynamoDB/ExportId/manifest-files.checksum
```

You choose an **export-prefix** when you request a table export. This helps you keep files in the destination S3 bucket organized. The **ExportId** is a unique token generated by the service to ensure that multiple exports to the same S3 bucket and `export-prefix` don't overwrite each other.

The export creates at least 1 file per partition. For partitions that are empty, your export request will create an empty file. All of the items in each file are from that particular partition's hashed keyspace.

**Note**  
DynamoDB also creates an empty file named `_started` in the same directory as the manifest files. This file verifies that the destination bucket is writable and that the export has begun. It can safely be deleted.

#### The summary manifest
<a name="S3DataExport.Output_Manifest_Summary"></a>

The `manifest-summary.json` file contains summary information about the export job. This allows you to know which data files in the shared data folder are associated with this export. Its format is as follows:

```
{
   "version": "2020-06-30",
   "exportArn": "arn:aws:dynamodb:us-east-1:123456789012:table/ProductCatalog/export/01234567890123-a1b2c3d4",
   "startTime": "2020-11-04T07:28:34.028Z",
   "endTime": "2020-11-04T07:33:43.897Z",
   "tableArn": "arn:aws:dynamodb:us-east-1:123456789012:table/ProductCatalog",
   "tableId": "12345a12-abcd-123a-ab12-1234abc12345",
   "exportTime": "2020-11-04T07:28:34.028Z",
   "s3Bucket": "ddb-productcatalog-export",
   "s3Prefix": "2020-Nov",
   "s3SseAlgorithm": "AES256",
   "s3SseKmsKeyId": null,
   "manifestFilesS3Key": "AWSDynamoDB/01693685827463-2d8752fd/manifest-files.json",
   "billedSizeBytes": 0,
   "itemCount": 8,
   "outputFormat": "DYNAMODB_JSON",
   "exportType": "FULL_EXPORT"
}
```

#### The files manifest
<a name="S3DataExport.Output_Manifest_Files"></a>

The `manifest-files.json` file contains information about the files that contain your exported table data. The file is in [JSON lines](https://jsonlines.org/) format, so newlines are used as item delimiters. In the following example, the details of one data file from a files manifest are formatted on multiple lines for the sake of readability.

```
{
"itemCount": 8,
   "md5Checksum": "sQMSpEILNgoQmarvDFonGQ==",
   "etag": "af83d6f217c19b8b0fff8023d8ca4716-1",
   "dataFileS3Key": "AWSDynamoDB/01693685827463-2d8752fd/data/asdl123dasas.json.gz"
}
```

### Data files
<a name="S3DataExport.Output_Data"></a>

DynamoDB can export your table data in two formats: DynamoDB JSON and Amazon Ion. Regardless of the format you choose, your data will be written to multiple compressed files named by the keys. These files are also listed in the `manifest-files.json` file.

The directory structure of your Amazon S3 bucket after a full export will contain all of your manifest files and data files under the export Id folder.

```
amzn-s3-demo-bucket/DestinationPrefix
.
└── AWSDynamoDB
    ├── 01693685827463-2d8752fd     // the single full export
    │   ├── manifest-files.json     // manifest points to files under 'data' subfolder
    │   ├── manifest-files.checksum
    │   ├── manifest-summary.json   // stores metadata about request
    │   ├── manifest-summary.md5  
    │   ├── data                    // The data exported by full export
    │   │   ├── asdl123dasas.json.gz
    │   │   ...
    │   └── _started                // empty file for permission check
```

#### DynamoDB JSON
<a name="S3DataExport.Output_Data_DDB-JSON"></a>

A table export in DynamoDB JSON format consists of multiple `Item` objects. Each individual object is in DynamoDB's standard marshalled JSON format.

When creating custom parsers for DynamoDB JSON export data, the format is [JSON lines](https://jsonlines.org/). This means that newlines are used as item delimiters. Many AWS services, such as Athena and AWS Glue, will parse this format automatically.

In the following example, a single item from a DynamoDB JSON export has been formatted on multiple lines for the sake of readability.

```
{
    "Item":{
        "Authors":{
            "SS":[
                "Author1",
                "Author2"
            ]
        },
        "Dimensions":{
            "S":"8.5 x 11.0 x 1.5"
        },
        "ISBN":{
            "S":"333-3333333333"
        },
        "Id":{
            "N":"103"
        },
        "InPublication":{
            "BOOL":false
        },
        "PageCount":{
            "N":"600"
        },
        "Price":{
            "N":"2000"
        },
        "ProductCategory":{
            "S":"Book"
        },
        "Title":{
            "S":"Book 103 Title"
        }
    }
}
```

#### Amazon Ion
<a name="S3DataExport.Output_Data_ION"></a>

[Amazon Ion](http://amzn.github.io/ion-docs/) is a richly-typed, self-describing, hierarchical data serialization format built to address rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. DynamoDB supports exporting table data in Ion's [text format](http://amzn.github.io/ion-docs/docs/spec.html), which is a superset of JSON.

When you export a table to Ion format, the DynamoDB datatypes used in the table are mapped to [Ion datatypes](http://amzn.github.io/ion-docs/docs/spec.html). DynamoDB sets use [Ion type annotations](http://amzn.github.io/ion-docs/docs/spec.html#annot) to disambiguate the datatype used in the source table.

The following table lists the mapping of DynamoDB data types to ion data types:


| DynamoDB data type | Ion representation | 
| --- | --- | 
| String (S) | string | 
| Boolean (BOOL) | bool | 
| Number (N) | decimal | 
| Binary (B) | blob | 
| Set (SS, NS, BS) | list (with type annotation \$1dynamodb\$1SS, \$1dynamodb\$1NS, or \$1dynamodb\$1BS) | 
| List | list | 
| Map | struct | 

Items in an Ion export are delimited by newlines. Each line begins with an Ion version marker, followed by an item in Ion format. In the following example, an item from an Ion export has been formatted on multiple lines for the sake of readability.

```
$ion_1_0 {
    Item:{
        Authors:$dynamodb_SS::["Author1","Author2"],
        Dimensions:"8.5 x 11.0 x 1.5",
        ISBN:"333-3333333333",
        Id:103.,
        InPublication:false,
        PageCount:6d2,
        Price:2d3,
        ProductCategory:"Book",
        Title:"Book 103 Title"
    }
}
```

## Incremental export output
<a name="incremental-export-output"></a>

### Manifest files
<a name="S3DataIncrementalExport.Output_Manifest"></a>

DynamoDB creates manifest files, along with their checksum files, in the specified S3 bucket for each export request.

```
export-prefix/AWSDynamoDB/ExportId/manifest-summary.json
export-prefix/AWSDynamoDB/ExportId/manifest-summary.checksum
export-prefix/AWSDynamoDB/ExportId/manifest-files.json
export-prefix/AWSDynamoDB/ExportId/manifest-files.checksum
```

You choose an **export-prefix** when you request a table export. This helps you keep files in the destination S3 bucket organized. The **ExportId** is a unique token generated by the service to ensure that multiple exports to the same S3 bucket and `export-prefix` don't overwrite each other.

The export creates at least 1 file per partition. For partitions that are empty, your export request will create an empty file. All of the items in each file are from that particular partition's hashed keyspace.

**Note**  
DynamoDB also creates an empty file named `_started` in the same directory as the manifest files. This file verifies that the destination bucket is writable and that the export has begun. It can safely be deleted.

#### The summary manifest
<a name="S3DataIncrementalExport.Output_Manifest_Summary"></a>

The `manifest-summary.json` file contains summary information about the export job. This allows you to know which data files in the shared data folder are associated with this export. Its format is as follows:

```
{
 "version": "2023-08-01",
 "exportArn": "arn:aws:dynamodb:us-east-1:599882009758:table/export-test/export/01695097218000-d6299cbd",
 "startTime": "2023-09-19T04:20:18.000Z",
 "endTime": "2023-09-19T04:40:24.780Z",
 "tableArn": "arn:aws:dynamodb:us-east-1:599882009758:table/export-test",
 "tableId": "b116b490-6460-4d4a-9a6b-5d360abf4fb3",
 "exportFromTime": "2023-09-18T17:00:00.000Z",
 "exportToTime": "2023-09-19T04:00:00.000Z",
 "s3Bucket": "jason-exports",
 "s3Prefix": "20230919-prefix",
 "s3SseAlgorithm": "AES256",
 "s3SseKmsKeyId": null,
 "manifestFilesS3Key": "20230919-prefix/AWSDynamoDB/01693685934212-ac809da5/manifest-files.json",
 "billedSizeBytes": 20901239349,
 "itemCount": 169928274,
 "outputFormat": "DYNAMODB_JSON",
 "outputView": "NEW_AND_OLD_IMAGES",
 "exportType": "INCREMENTAL_EXPORT"
}
```

#### The files manifest
<a name="S3DataIncrementalExport.Output_Manifest_Files"></a>

The `manifest-files.json` file contains information about the files that contain your exported table data. The file is in [JSON lines](https://jsonlines.org/) format, so newlines are used as item delimiters. In the following example, the details of one data file from a files manifest are formatted on multiple lines for the sake of readability.

```
{
"itemCount": 8,
   "md5Checksum": "sQMSpEILNgoQmarvDFonGQ==",
   "etag": "af83d6f217c19b8b0fff8023d8ca4716-1",
   "dataFileS3Key": "AWSDynamoDB/data/sgad6417s6vss4p7owp0471bcq.json.gz"
}
```

### Data files
<a name="S3DataIncrementalExport.Output_Data"></a>

DynamoDB can export your table data in two formats: DynamoDB JSON and Amazon Ion. Regardless of the format you choose, your data will be written to multiple compressed files named by the keys. These files are also listed in the `manifest-files.json` file.

The data files for incremental exports are all contained in a common data folder in your S3 bucket. Your manifest files are under your export ID folder.

```
amzn-s3-demo-bucket/DestinationPrefix
.
└── AWSDynamoDB
    ├── 01693685934212-ac809da5     // an incremental export ID
    │   ├── manifest-files.json     // manifest points to files under 'data' folder
    │   ├── manifest-files.checksum
    │   ├── manifest-summary.json   // stores metadata about request
    │   ├── manifest-summary.md5  
    │   └── _started                // empty file for permission check
    ├── 01693686034521-ac809da5
    │   ├── manifest-files.json
    │   ├── manifest-files.checksum
    │   ├── manifest-summary.json
    │   ├── manifest-summary.md5
    │   └── _started
    ├── data                        // stores all the data files for incremental exports
    │   ├── sgad6417s6vss4p7owp0471bcq.json.gz 
    │   ...
```

In you export files, each item’s output includes a timestamp that represents when that item was updated in your table and a data structure that indicates if it was an `insert`, `update`, or `delete` operation. The timestamp is based on an internal system clock and can vary from your application clock. For incremental exports, you can choose between two export view types for your output structure: **new and old images** or **new images only**.
+ **New image **provides the latest state of the item 
+ **Old image** provides the state of the item right before the specified **start date and time**

View types can be helpful if you want to see how the item was changed within the export period. It can also be useful for efficiently updating your downstream systems, especially if those downstream systems have a partition key that is not the same as your DynamoDB partition key. 

You can infer whether an item in your incremental export output was an `insert`, `update`, or `delete` by looking at the structure of the output. The incremental export structure and its corresponding operations are summarized in the table below for both export view types.


| Operation | New images only | New and old images | 
| --- | --- | --- | 
|  Insert  |  Keys \$1 new image  | Keys \$1 new image | 
|  Update  | Keys \$1 new image | Keys \$1 new image \$1 old image | 
| Delete | Keys | Keys \$1 old image | 
| Insert \$1 delete | No output | No output | 

#### DynamoDB JSON
<a name="S3DataIncrementalExport.Output_Data_DDB-JSON"></a>

A table export in DynamoDB JSON format consists of a metadata timestamp that indicates the write time of the item, followed by the keys of the item and the values. The following shows an example DynamoDB JSON output using export view type as **New and Old images**.

```
// Ex 1: Insert
//   An insert means the item did not exist before the incremental export window
//   and was added during the incremental export window

{
   "Metadata": {
     "WriteTimestampMicros": "1680109764000000"
   },
   "Keys": {
     "PK": { 
       "S": "CUST#100" 
     }
   },
   "NewImage": {
     "PK": { 
       "S": "CUST#100" 
     },
     "FirstName": {
       "S": "John"
     },
     "LastName": {
       "S": "Don"
     }
   }
}

// Ex 2: Update
//   An update means the item existed before the incremental export window
//   and was updated during the incremental export window. 
//   The OldImage would not be present if choosing "New images only".

{
   "Metadata": {
     "WriteTimestampMicros": "1680109764000000"
   },
   "Keys": {
     "PK": { 
       "S": "CUST#200" 
     }
   },
   "OldImage": {
     "PK": { 
       "S": "CUST#200" 
     },
     "FirstName": {
       "S": "Mary"
     },
     "LastName": {
       "S": "Grace"
     }
   },
   "NewImage": {
     "PK": { 
       "S": "CUST#200" 
     },
     "FirstName": {
       "S": "Mary"
     },
     "LastName": {
       "S": "Smith"
     }
   }
}

// Ex 3: Delete
//   A delete means the item existed before the incremental export window
//   and was deleted during the incremental export window
//   The OldImage would not be present if choosing "New images only".

{
   "Metadata": {
     "WriteTimestampMicros": "1680109764000000"
   },
   "Keys": {
     "PK": { 
       "S": "CUST#300" 
     }
   },
   "OldImage": {
     "PK": { 
       "S": "CUST#300" 
     },
     "FirstName": {
       "S": "Jose"
     },
     "LastName": {
       "S": "Hernandez"
     }
   }
}

// Ex 4: Insert + Delete
//   Nothing is exported if an item is inserted and deleted within the 
//   incremental export window.
```

#### Amazon Ion
<a name="S3DataIncrementalExport.Output_Data_ION"></a>

[Amazon Ion](http://amzn.github.io/ion-docs/) is a richly-typed, self-describing, hierarchical data serialization format built to address rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. DynamoDB supports exporting table data in Ion's [text format](http://amzn.github.io/ion-docs/docs/spec.html), which is a superset of JSON.

When you export a table to Ion format, the DynamoDB datatypes used in the table are mapped to [Ion datatypes](http://amzn.github.io/ion-docs/docs/spec.html). DynamoDB sets use [Ion type annotations](http://amzn.github.io/ion-docs/docs/spec.html#annot) to disambiguate the datatype used in the source table.

The following table lists the mapping of DynamoDB data types to ion data types:


| DynamoDB data type | Ion representation | 
| --- | --- | 
| String (S) | string | 
| Boolean (BOOL) | bool | 
| Number (N) | decimal | 
| Binary (B) | blob | 
| Set (SS, NS, BS) | list (with type annotation \$1dynamodb\$1SS, \$1dynamodb\$1NS, or \$1dynamodb\$1BS) | 
| List | list | 
| Map | struct | 

Items in an Ion export are delimited by newlines. Each line begins with an Ion version marker, followed by an item in Ion format. In the following example, an item from an Ion export has been formatted on multiple lines for the sake of readability.

```
$ion_1_0 {
    Record:{
        Keys:{
             ISBN:"333-3333333333"
         },
        Metadata:{
            WriteTimestampMicros:1684374845117899.
        },
        OldImage:{
            Authors:$dynamodb_SS::["Author1","Author2"],
            ISBN:"333-3333333333",
            Id:103.,
            InPublication:false,
            ProductCategory:"Book",
            Title:"Book 103 Title"
        },
        NewImage:{
            Authors:$dynamodb_SS::["Author1","Author2"],
            Dimensions:"8.5 x 11.0 x 1.5",
            ISBN:"333-3333333333",
            Id:103.,
            InPublication:true,
            PageCount:6d2,
            Price:2d3,
            ProductCategory:"Book",
            Title:"Book 103 Title"
        }
    }
}
```

# DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse
<a name="amazon-sagemaker-lakehouse-for-DynamoDB"></a>

DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse eliminates the need to build custom data movement pipelines by automatically replicating DynamoDB data to Amazon SageMaker Lakehouse. This no-code integration helps customers run analytics workloads on their DynamoDB data using Amazon SageMaker Lakehouse without consuming any DynamoDB table capacity. The integration automatically exports data from your table and keeps the target fresh, typically within 15 to 30 minutes.

**Topics**
+ [DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse](amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl.md)

# DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl"></a>

Setting up an integration between the DynamoDB table and Amazon SageMaker Lakehouse require prerequisites such as configuring IAM roles which AWS Glue uses to access data from the source and write to the target, and the use of KMS keys to encrypt the data in intermediate or the target location.

**Topics**
+ [Prerequisites before creating a DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse](#amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-prereqs)
+ [Creating a DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse](amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-getting-started.md)
+ [Viewing CloudWatch metrics for integration](#amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-cloudwatch-metrics)

## Prerequisites before creating a DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-prereqs"></a>

To configure a zero-ETL integration with an DynamoDB source, you need to set up a Resource-Based Access (RBAC) policy that allows AWS Glue to access and export data from the DynamoDB table. The policy should include specific permissions like `ExportTableToPointInTime`, `DescribeTable`, and `DescribeExport` with conditions restricting access to a specific AWS account and region. See, [Configuring an Amazon DynamoDB source](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-sources.html#zero-etl-config-source-dynamodb) for more information.

Point-in-time recovery (PITR) must be enabled for the table, and you can apply the policy using AWS CLI commands. The policy can be further refined by specifying the full integration ARN for more restrictive access control. For more information, see [Prerequisites for setting up a zero-ETL integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html).

# Creating a DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-getting-started"></a>

After completing integration prerequisites, you can create, modify, or delete the zero-ETL integration following the guidance below:

## Creating an integration
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-getting-started-creating"></a>

**To create an integration**

1. Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodbv2](https://console.aws.amazon.com/dynamodbv2).

1. In the navigation pane, choose **Integrations**. 

1. Select **Create zero-ETL integration with Amazon SageMaker Lakehouse**, and then choose **Next**.

1. To create an integration, see [Creating an integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-common-integration-tasks.html#zero-etl-creating).

1. To modify an integration, see [Modifying an integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-common-integration-tasks.html#zero-etl-modifying).

1. To delete an integration, see [Deleting an integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-common-integration-tasks.html#zero-etl-deleting).

1. To set up a cross-account integration, see [Setting up cross-account integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-cross-account-integration).

## Enabling compaction on target Amazon S3 tables
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-enabling-compaction"></a>

You can enable compaction to improve query performance in Amazon Athena.

First, complete the prerequisite setup for compaction resources, including configuring the necessary IAM role. Refer to the Lake Formation documentation for detailed IAM role configuration steps. See, [Optimizing tables for compaction](https://docs.aws.amazon.com/lake-formation/latest/dg/data-compaction.html).

To enable compaction on the AWS Glue table created during integration, follow the Lake Formation compaction enabling process. This will help optimize your table's performance and query efficiency.

## Viewing CloudWatch metrics for integration
<a name="amazon-sagemaker-lakehouse-for-DynamoDB-zero-etl-cloudwatch-metrics"></a>

Once an integration completes, you can see these CloudWatch metrics and EventBridge notifications generated in your account for each AWS Glue job. For more information, see [Monitoring an integration](https://docs.aws.amazon.com/glue/latest/dg/zero-etl-monitoring.html).

# DynamoDB zero-ETL integration with Amazon OpenSearch Service
<a name="OpenSearchIngestionForDynamoDB"></a>

Amazon DynamoDB offers a zero-ETL integration with Amazon OpenSearch Service through the **DynamoDB plugin for OpenSearch Ingestion**. Amazon OpenSearch Ingestion offers a fully managed, no-code experience for ingesting data into Amazon OpenSearch Service. 

With the DynamoDB plugin for OpenSearch Ingestion, you can use one or more DynamoDB tables as a source for ingestion to one or more OpenSearch Service indexes. You can browse and configure your OpenSearch Ingestion pipelines with DynamoDB as a source from either OpenSearch Ingestion or DynamoDB Integrations in the AWS Management Console.
+ Get started with OpenSearch Ingestion by following along in the [OpenSearch Ingestion getting started guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-getting-started-tutorials.html).
+ Learn about the prerequisites and all the configuration options for the DynamoDB plugin at [DynamoDB plugin for OpenSearch Ingestion documentation](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-ddb.html).

## How it works
<a name="opensearch-for-dynamodb-how-it-works"></a>

The plugin uses [DynamoDB export to Amazon S3](S3DataExport.HowItWorks.md) to create an initial snapshot to load into OpenSearch. After the snapshot has been loaded, the plugin uses DynamoDB Streams to replicate any further changes in near real time. Every item is processed as an event in OpenSearch Ingestion and can be modified with processor plugins. You can drop attributes or create composite attributes and send them to different indexes through routes.

You must have [point-in-time recovery (PITR)](Point-in-time-recovery.md) enabled to use export to Amazon S3. You must also have [DynamoDB Streams](streamsmain.md) enabled (with the **new & old images** option selected) to be able to use it. It's possible to create a pipeline without taking a snapshot by excluding export settings. 

You can also create a pipeline with only a snapshot and no updates by excluding streams settings. The plugin does not use read or write throughput on your table, so it is safe to use without impacting your production traffic. There are limits to the number of parallel consumers on a stream that you should consider before creating this or other integrations. For other considerations, see [Best practices for integrating with DynamoDB](bp-integration.md).

For simple pipelines, a single OpenSearch Compute Unit (OCU) can process about 1 MB per second of writes. This is the equivalent of about 1000 write request units (WCU). Depending on your pipeline's complexity and other factors, you might achieve more or less than this.

OpenSearch Ingestion supports a dead-letter queue (DLQ) for events that cause unrecoverable errors. Additionally, the pipeline can resume from where it left off without user intervention even if there's an interruption of service with either DynamoDB, the pipeline, or Amazon OpenSearch Service. 

If interruption goes on for longer than 24 hours, this can cause a loss of updates. However, the pipeline would continue to process the updates that were still available when availability is restored. You would need to do a fresh index build to fix any irregularities due to the dropped events unless they were in the dead-letter queue.

For all the settings and details for the plugin, see [OpenSearch Ingestion DynamoDB plugin documentation](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-ddb.html).

## Integrated create experience through the console
<a name="opensearch-for-dynamodb-console-create"></a>

DynamoDB and OpenSearch Service have an integrated experience in the AWS Management Console, which streamlines the getting started process. When you go through these steps, the service will automatically select the DynamoDB blueprint and add the appropriate DynamoDB information for you.

To create an integration, follow along in the [OpenSearch Ingestion getting started guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-get-started.html). When you get to [Step 3: Create a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-get-started.html#osis-get-started-pipeline), replace Steps 1 and 2 with the following steps:

1. Navigate to the DynamoDB console.

1. In the left-hand navigation pane, choose **Integration**.

1. Select the DynamoDB table that you'd like to replicate to OpenSearch.

1. Choose **Create**.

From here, you can continue on with the rest of the tutorial.

## Next steps
<a name="opensearch-for-dynamodb-next-steps"></a>

For a better understanding of how DynamoDB integrates with OpenSearch Service, see the following:
+ [Getting started with Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-getting-started-tutorials.html)
+ [DynamoDB plugin configuration and requirements](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-ddb.html)

# Handling breaking changes to your index
<a name="opensearch-for-dynamodb-change-index"></a>

OpenSearch can dynamically add new attributes to your index. However, after your mapping template has been set for a given key, you’ll need to take additional action to change it. Additionally, if your change requires you to reprocess all the data in your DynamoDB table, you’ll need to take steps to initiate a fresh export.

**Note**  
In all these options, you might still run into issues if your DynamoDB table has type conflicts with the mapping template you’ve specified. Ensure that you have a dead-letter queue (DLQ) enabled (even in development). This makes it easier to understand what might be wrong with the record that causes a conflict when it's being indexed into your index on OpenSearch.

**Topics**
+ [How it works](#opensearch-for-dynamodb-change-index-howitworks)
+ [Delete your index and reset the pipeline (pipeline-centric option)](#opensearch-for-dynamodb-change-index-delete)
+ [Recreate your index and reset the pipeline (index-centric option)](#opensearch-for-dynamodb-change-index-recreate)
+ [Create a new index and sink (online option)](#opensearch-for-dynamodb-change-index-create)
+ [Best practices for avoiding and debugging type conflicts](#opensearch-for-dynamodb-change-index-bp)

## How it works
<a name="opensearch-for-dynamodb-change-index-howitworks"></a>

Here's a quick overview of the actions taken when handling breaking changes to your index. See the step-by-step procedures in the sections that follow.
+ **Stop and start the pipeline**: This option resets the pipeline’s state, and the pipeline will restart with a new full export. It is non-destructive, so it does **not** delete your index or any data in DynamoDB. If you don’t create a fresh index before you do this, you might see a high number of errors from version conflicts because the export tries to insert older documents than the current `_version` in the index. You can safely ignore these errors. You will not be billed for the pipeline while it is stopped.
+ **Update the pipeline**: This option updates the configuration in the pipeline with a [blue/green](https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/bluegreen-deployments.html) approach, without losing any state. If you make significant changes to your pipeline (such as adding new routes, indexes, or keys to existing indexes), you might need to do a full reset of the pipeline and recreate your index. This option does **not** perform a full export.
+ **Delete and recreate the index**: This option removes your data and mapping settings on your index. You should do this before making any breaking changes to your mappings. It will break any applications that rely on the index until the index is recreated and synchronized. Deleting the index does **not** initiate a fresh export. You should delete your index only after you’ve updated your pipeline. Otherwise, your index might be recreated before you update your settings.

## Delete your index and reset the pipeline (pipeline-centric option)
<a name="opensearch-for-dynamodb-change-index-delete"></a>

This method is often the fastest option if you’re still in development. You’ll delete your index in OpenSearch Service, and then [stop and start](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline--stop-start.html) your pipeline to initiate a fresh export of all your data. This ensures that there are no mapping template conflicts with existing indexes, and no loss of data from an incomplete processed table.

1. Stop the pipeline either through the AWS Management Console, or by using the StopPipeline API operation with the AWS CLI or an SDK.

1. [Update your pipeline configuration](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/update-pipeline.html) with your new changes.

1. Delete your index in OpenSearch Service, either through a `REST` API call or your OpenSearch Dashboard.

1. Start the pipeline either through the console, or by using the `StartPipeline` API operation with the AWS CLI or an SDK.
**Note**  
This initiates a fresh full export, which will incur additional costs.

1. Monitor for any unexpected issues because a fresh export is generated to create the new index.

1. Confirm that the index matches your expectations in OpenSearch Service.

After the export has completed and it resumes reading from the stream, your DynamoDB table data will now be available in the index.

## Recreate your index and reset the pipeline (index-centric option)
<a name="opensearch-for-dynamodb-change-index-recreate"></a>

This method works well if you need to do a lot of iterations on the index design in OpenSearch Service before resuming the pipeline from DynamoDB. This can be useful for development when you want to iterate very quickly on your search patterns, and want to avoid waiting on fresh exports to complete between each iteration.

1. Stop the pipeline either through the AWS Management Console, or by calling the StopPipeline API operation with the AWS CLI or an SDK.

1. Delete and recreate your index in OpenSearch with the mapping template you want to use. You can manually insert some sample data to confirm that your searches are working as intended. If your sample data might conflict with any data from DynamoDB, be sure to delete it before moving onto the next step.

1. If you have an indexing template in your pipeline, remove it or replace it with the one you’ve created already in OpenSearch Service. Ensure that the name of your index matches the name in the pipeline.

1. Start the pipeline either through console, or by calling the `StartPipeline` API operation with the AWS CLI or an SDK.
**Note**  
This will initiate a fresh full export, which will incur additional costs.

1. Monitor for any unexpected issues because a fresh export is generated to create the new index.

After the export has completed and it resumes reading from the stream, you should be your DynamoDB table data will now be available in the index.

## Create a new index and sink (online option)
<a name="opensearch-for-dynamodb-change-index-create"></a>

This method works well if you need to update your mapping template but are currently using your index in production. This creates a brand new index, which you’ll need to move your application over to after it’s synchronized and validated.

**Note**  
This will create another consumer on the stream. This can be an issue if you also have other consumers like AWS Lambda or global tables. You might need to pause updates to your existing pipeline to create capacity to load the new index.

1. [Create a new pipeline](OpenSearchIngestionForDynamoDB.md#opensearch-for-dynamodb-console-create) with new settings and a different index name.

1. Monitor the new index for any unexpected issues.

1. Swap the application over to the new index.

1. Stop and delete the old pipeline after validating that everything is working correctly.

## Best practices for avoiding and debugging type conflicts
<a name="opensearch-for-dynamodb-change-index-bp"></a>
+ Always use a dead-letter queue (DLQ) to make it easier to debug when there are type conflicts.
+ Always use an index template with mappings and set `include_keys`. While OpenSearch Service dynamically maps new keys, this can cause issues with unexpected behaviors (such as expecting something to be a `GeoPoint`, but it’s created as a `string` or `object`) or errors (such as having a `number` that is a mix of `long` and `float` values).
+ If you need to keep your existing index working in production, you can also replace any of the previous [delete index steps](#opensearch-for-dynamodb-change-index-delete) with just renaming your index in your pipeline config file. This creates a brand new index. Your application will then need to be updated to point to the new index after it's complete.
+ If you have a type conversion issue that you fix with a processor, you can test this with `UpdatePipeline`. To do this, you’ll need to do a stop and start or [process your dead-letter queues](https://opensearch.org/docs/latest/data-prepper/pipelines/dlq/) to fix any previously skipped documents that had errors.

# Best practices for working with DynamoDB zero-ETL integration and OpenSearch Service
<a name="bp-integration-opensearch"></a>

DynamoDB has a [DynamoDB zero-ETL integration with ](OpenSearchIngestionForDynamoDB.md)Amazon OpenSearch Service. For more information, see the [DynamoDB plugin for OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-ddb.html) and [specific best practices for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html).

## Configuration
<a name="bp-integration-opensearch-configuration"></a>
+ Only index data that you need to perform searches on. Always use a mapping template (`template_type: index_template` and `template_content`) and `include_keys` to implement this.
+ Monitor your logs for errors that are related to type conflicts. OpenSearch Service expects all values for a given key to have the same type. It generates exceptions if there's a mismatch. If you encounter one of these errors, you can add a processor to catch that a given key is always be the same value.
+ Generally use the `primary_key` metadata value for the `document_id` value. In OpenSearch Service, the document ID is the equivalent of the primary key in DynamoDB. Using the primary key will make it easy to find your document and ensure that updates are consistently replicated to it without conflicts. 

  You can use the helper function `getMetadata` to get your primary key (for example, `document_id: "${getMetadata('primary_key')}"`). If you're using a composite primary key, the helper function will concatenate them together for you.
+ In general, use the `opensearch_action` metadata value for the `action` setting. This will ensure that updates are replicated in such a way that the data in OpenSearch Service matches the latest state in DynamoDB. 

  You can use the helper function `getMetadata` to get your primary key (for example, `action: "${getMetadata('opensearch_action')}"`). You can also get the stream event type through `dynamodb_event_name` for use cases like filtering. However, you should typically not use it for the `action` setting.

## Observability
<a name="bp-integration-opensearch-observability"></a>
+ Always use a dead-letter queue (DLQ) on your OpenSearch sinks to handle dropped events. DynamoDB is generally less structured than OpenSearch Service, and it's always possible for something unexpected to happen. With a dead-letter queue, you can recover individual events, and even automate the recovery process. This will help you to avoid needing to rebuild your entire index.
+ Always set alerts that your replication delay doesn't go over an expected amount. It is typically safe to assume one minute without the alert being too noisy. This can vary depending on how spiky your write traffic is and your OpenSearch Compute Unit (OCU) settings on the pipeline. 

  If your replication delay goes over 24 hours, your stream will start to drop events, and you'll have accuracy issues unless you do a full rebuild of your index from scratch.

## Scaling
<a name="bp-integration-opensearch-scaling"></a>
+ Use auto scaling for pipelines to help scale up or down the OCUs to best fit the workload.
+ For provisioned throughput tables without auto scaling, we recommend setting OCUs based on your write capacity units (WCUs) divided by 1000. Set the minimum to 1 OCU below that amount (but at least 1), and set the maximum to at least 1 OCU above that amount.
  + **Formula:**

    ```
    OCU_minimum = GREATEST((table_WCU / 1000) - 1, 1)
    OCU_maximum = (table_WCU / 1000) + 1
    ```
  + **Example:** Your table has 25000 WCUs provisioned. Your pipeline's OCUs should be set with a minimum of 24 (25000/1000 - 1) and maximum of at least 26 (25000/1000 \$1 1).
+ For provisioned throughput tables with auto scaling, we recommend setting OCUs based on your minimum and maximum WCUs, divided by 1000. Set the minimum to 1 OCU below the minimum from DynamoDB, and set the maximum to at least 1 OCU above the maximum from DynamoDB.
  + **Formula:**

    ```
    OCU_minimum = GREATEST((table_minimum_WCU / 1000) - 1, 1)
    OCU_maximum = (table_maximum_WCU / 1000) + 1
    ```
  + **Example:** Your table has an auto scaling policy with a minimum of 8000 and maximum of 14000. Your pipeline's OCUs should be set with a minimum of 7 (8000/1000 - 1) and a maximum of 15 (14000/1000 \$1 1).
+ For on-demand throughput tables, we recommend setting OCUs based on your typical peak and valley for write request units per second. You might need to average over a longer time period, depending on the aggregation that's available to you. Set the minimum to 1 OCU below the minimum from DynamoDB, and set the maximum to at least 1 OCU above the maximum from DynamoDB.
  + **Formula:**

    ```
    # Assuming we have writes aggregated at the minute level
    OCU_minimum = GREATEST((min(table_writes_1min) / (60 * 1000)) - 1, 1)
    OCU_maximum = (max(table_writes_1min) / (60 * 1000)) + 1
    ```
  + **Example:** Your table has an average valley of 300 write request units per second and an average peak of 4300. Your pipeline's OCUs should be set with a minimum of 1 (300/1000 - 1, but at least 1) and a maximum of 5 (4300/1000 \$1 1).
+ Follow best practices on scaling your destination OpenSearch Service indexes. If your indexes are under-scaled, it will slow down ingestion from DynamoDB, and might cause delays.

**Note**  
[https://docs.aws.amazon.com/redshift/latest/dg/r_GREATEST_LEAST.html](https://docs.aws.amazon.com/redshift/latest/dg/r_GREATEST_LEAST.html) is a SQL function that, given a set of arguments, returns the argument with the greatest value.

# Integrating DynamoDB with Amazon EventBridge
<a name="eventbridge-for-dynamodb"></a>

Amazon DynamoDB offers DynamoDB Streams for change data capture, enabling the capture of item-level changes in DynamoDB tables. DynamoDB Streams can invoke Lambda functions to process those changes, allowing event driven integration with other services and applications. DynamoDB Streams also supports filtering, which allows for efficient and targeted event processing.

DynamoDB Streams supports up to [two simultaneous consumers](ServiceQuotas.md#limits-dynamodb-streams) per shard and supports filtering through [Lambda event filtering](https://docs.aws.amazon.com/lambda/latest/dg/invocation-eventfiltering.html) so that only items which match specific criteria are processed. Some customers may have requirements to support more than two consumers. Others may need to enrich change events before they are processed, or use more advanced filtering and routing.

Integrating DynamoDB with EventBridge can support those requirements.

Amazon EventBridge is a serverless service that uses events to connect application components together, making it easier for you to build scalable event-driven applications. EventBridge offers native integration with Amazon DynamoDB through EventBridge Pipes, enabling seamless data flow from DynamoDB to an EventBridge bus. That bus can then fan-out to multiple applications and services through a set of rules and targets.

**Topics**
+ [How it works](#eventbridge-for-dynamodb-how-it-works)
+ [Creating an integration through the console](#eventbridge-for-dynamodb-create-integration-console)
+ [Next steps](#eventbridge-for-dynamodb-next-steps)

## How it works
<a name="eventbridge-for-dynamodb-how-it-works"></a>

The integration between DynamoDB and EventBridge pipes uses DynamoDB Streams to capture a time-ordered sequence of item-level changes in a DynamoDB table. Each record captured this way contains the data modified in the table.

![\[Image showing how DynamoDB Streams integrate with an Amazon EventBridge bus.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/eventbridge-dynamodb.png)


An EventBridge pipe consumes events from DynamoDB Streams and routes them to a target such as an EventBridge bus (an event bus is a router that receives events and delivers them to destinations, also called targets). Delivery is based on which rules match the contents of the event. Optionally, the pipe also includes the ability to filter for specific events and perform enrichments on the event data before sending it to the target.

While EventBridge supports [multiple target types](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-targets.html), a common choice when implementing a fan-out design is to use a Lambda function as the target. The following example demonstrates an integration with a Lambda function target.

## Creating an integration through the console
<a name="eventbridge-for-dynamodb-create-integration-console"></a>

Follow the steps below to create an integration through the AWS Management Console.

1. Enable DynamoDB Streams on the source table by following the steps in the [Enabling a stream](Streams.md#Streams.Enabling) section of the DynamoDB developer guide. If DynamoDB Streams is already enabled on the source table, verify that there are currently fewer than two consumers. Consumers could be Lambda functions, DynamoDB Global Tables, Amazon DynamoDB zero-ETL integrations with Amazon OpenSearch Service, or applications that read directly from streams such as through the DynamoDB Streams Kinesis adapter.

1. Create an EventBridge event bus by following the steps in the [Creating an Amazon EventBridge event bus](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-event-bus.html) section of the EventBridge user guide. 

   1. When creating the event bus, enable **Schema discovery**.

1. Create an EventBridge pipe by following the steps in the [Creating an Amazon EventBridge pipe](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-pipes-create.html) section of the EventBridge user guide.

   1. When configuring the source, in the **Source** field select *DynamoDB* and in the **DynamoDB Streams** field select the name of the source table stream.

   1. When configuring the target, in the **Target service** field select *EventBridge event bus* and in the **Event bus as target **field select the event bus created in step 2.

1. Write an example item to the source DynamoDB table to trigger an event. This will allow EventBridge to infer schema from the example item. This schema can be used to create rules for routing events. For example, if you are implementing a design pattern that involves [overloading attributes](bp-gsi-overloading.md), you may want to trigger different rules depending on the value of your sort key. Details on how to write an item to DynamoDB can be found in the [Working with items and attributes](WorkingWithItems.md#WorkingWithItems.WritingData) section of the DynamoDB developer guide.

1. Create an example Python Lambda function to be used as a target by following the steps in the [Building Lambda functions with Python](https://docs.aws.amazon.com/lambda/latest/dg/lambda-python.html) section of the Lambda developer guide. When creating your function, you can use the below example code to demonstrate the integration. When invoked, it will print the `NewImage` and `OldImage` received with the event which can be viewed in CloudWatch Logs.

   ```
   import json
   
   def lambda_handler(event, context):
       dynamodb = event.get('detail', {}).get('dynamodb', {})
       new_image = dynamodb.get('NewImage')
       old_image = dynamodb.get('OldImage')
       
       if new_image:
           print("NewImage:", json.dumps(new_image, indent=2))
       if old_image:
           print("OldImage:", json.dumps(old_image, indent=2))
       
       return {'statusCode': 200, 'body': json.dumps(event)}
   ```

1. Create an EventBridge rule that will route events to your new Lambda function by following the steps in the [Create a rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule.html) section that reacts to events EventBridge user guide.

   1. When defining the rule detail, select the name of the event bus you created in step 2 as the **Event bus**.

   1. When building the event pattern, follow the guide for **Existing schema**. Here, you can select the *discovered-schemas* registry and the discovered schema for your event. This allows you to configure an event pattern specific to your use case that only routes messages that match specific attributes. For example, if you wanted to match only on DynamoDB items where the SK begins with `“user#”`, you’d use a configuration like this.  
![\[Image showing an EventBridge rule where only DynamoDB items that have a sort key beginning with "user#" is displayed.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/eventbridge-rule-example.png)

   1. Click **Generate event pattern in JSON** after you’ve finished designing a pattern against your schema. If you instead want to match all events that appear on DynamoDB Streams, use the following JSON for the event pattern.

      ```
      {
        "source": ["aws.dynamodb"]
      }
      ```

   1. When selecting targets, follow the guide for AWS service. In the Select a target field, choose “Lambda function”. In the Function field, select the Lambda function you created in step 5.

1. You can now stop schema discovery on your event bus by following the steps in the [Starting or stopping schema discovery on event buses](https://docs.aws.amazon.com/eventbridge/latest/userguide/event-bus-update.html#event-bus-update-schema) section of the EventBridge user guide.

1. Write a second example item to the source DynamoDB table to trigger an event. Validate that the event was successfully processed at each step.

   1. View the CloudWatch metric **PutEventsApproximateSuccessCount** for your event bus by following the [Monitoring Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-monitoring.html) section of the EventBridge user guide. 

   1. View function logs for your Lambda function by following the [Monitoring and troubleshooting Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html) section of the Lambda developer guide. If your Lambda function uses the example code provided, you should see the `NewImage` and `OldImage` from DynamoDB Streams printed in the CloudWatch Logs log group.

   1. View the **Error count and success rate (%)** metric for your Lambda function by following the [Monitoring and troubleshooting Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html) section of the Lambda developer guide.

## Next steps
<a name="eventbridge-for-dynamodb-next-steps"></a>

This example provides a basic integration with a single Lambda function as a target. For a better understanding of more complex configurations, such as creating multiple rules, creating multiple targets, integrating with other services, and enriching events see the complete EventBridge user guide: [Getting started with EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-get-started.html).

**Note**  
Be aware of any EventBridge quotas that might be relevant to your application. While DynamoDB Streams capacity scales with your table, EventBridge quotas are separate. Common quotas to be aware of in a large application would be **Invocations throttle limit in transactions per second** and **PutEvents throttle limit in transactions per second**. These quotas specify the number of invocations that can be sent to targets and the number of events that can be put into the bus per second.

# Integrating DynamoDB with Amazon Managed Streaming for Apache Kafka
<a name="msk-for-dynamodb"></a>

[Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html) makes it easy to ingest and process streaming data in real time with a fully managed, highly available Apache Kafka service.

[Apache Kafka](https://kafka.apache.org/) is a distributed data store optimized for ingesting and processing streaming data in real-time. Kafka can process streams of records, effectively store streams of records in the order in which records were generated, and publish and subscribe to streams of records. 

Because of these features, Apache Kafka is often used to build real-time streaming data pipelines. A *data pipeline* reliably processes and moves data from one system to another and can be an important part of adopting a purpose-built database strategy by facilitating the use of multiple databases which each support different use cases.

Amazon DynamoDB is common target in these data pipelines to support applications that use key-value or document data models and desire limitless scalability with consistent single-digit millisecond performance. 

**Topics**
+ [How it works](#msk-for-dynamodb-how-it-works)
+ [Set up an integration between Amazon MSK and DynamoDB](#msk-for-dynamodb-example)
+ [Next steps](#msk-for-dynamodb-next-steps)

## How it works
<a name="msk-for-dynamodb-how-it-works"></a>

An Integration between Amazon MSK and DynamoDB uses a [Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) function to consume records from Amazon MSK and write them to DynamoDB.

![\[Diagram showing an integration between Amazon MSK and DynamoDB, and how Amazon MSK uses a Lambda function to consume records and write them to DynamoDB.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-diagram.png)


Lambda internally polls for new messages from Amazon MSK and then synchronously invokes the target Lambda function. The Lambda function’s event payload contains batches of messages from Amazon MSK. For the integration between Amazon MSK and DynamoDB, the Lambda function writes these messages to DynamoDB.

## Set up an integration between Amazon MSK and DynamoDB
<a name="msk-for-dynamodb-example"></a>

**Note**  
You can download the resources used in this example at the following [GitHub repository](https://github.com/aws-samples/serverless-streaming-datastore-sample).

The steps below show how to set up a sample integration between Amazon MSK and Amazon DynamoDB. The example represents data generated by Internet of Things (IoT) devices and ingested into Amazon MSK. As data is ingested into Amazon MSK, it can be integrated with analytics services or third-party tools compatible with Apache Kafka, enabling various analytics use cases. Integrating DynamoDB as well provides key value lookup of individual device records. 

This example will demonstrate how a Python script writes IoT sensor data to Amazon MSK. Then, a Lambda function writes items with the partition key "`deviceid`" to DynamoDB.

The provided CloudFormation template will create the following resources: An Amazon S3 bucket, an Amazon VPC, a Amazon MSK cluster, and an AWS CloudShell for testing data operations. 

To generate test data, create an Amazon MSK topic and then create a DynamoDB table. You can use Session Manager from the management console to log into the CloudShell's operating system and run Python scripts. 

After running the CloudFormation template, you can finish building this architecture by performing the following operations. 

1. Run the CloudFormation template `S3bucket.yaml` to create an S3 bucket. For any subsequent scripts or operations, please run them in the same Region. Enter `ForMSKTestS3` as the CloudFormation stack name.   
![\[Image showing the CloudFormation console stack creation screen.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-create-stack.png)

   After this is completed, note down the S3 bucket name output under *Outputs*. You will need the name in Step 3.  
![\[CloudFormation stack outputs showing S3 bucket name for MSK and DynamoDB sample.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-bucket-name.png)

1. Upload the downloaded ZIP file `fromMSK.zip` to the S3 bucket you just created.  
![\[Image showing where you can upload files in the S3 console.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-zip.png)

1. Run the CloudFormation template `VPC.yaml` to create a VPC, Amazon MSK cluster, and Lambda function. On the parameter input screen, enter the S3 bucket name you created in Step 1 where it asks for the S3 bucket. Set the CloudFormation stack name to `ForMSKTestVPC`.   
![\[Image showing the fields you need to fill out when specifying the CloudFormation stack details.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-vpc.png)

1. Prepare the environment for running Python scripts in CloudShell. You can use CloudShell on the AWS Management Console. For more information on using CloudShell, see [Getting started with AWS CloudShell](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html). After starting CloudShell, create a CloudShell that belongs to the VPC you have just created in order to connect to the Amazon MSK Cluster. Create the CloudShell in a private subnet. Fill in the following fields: 

   1. **Name** - can be set to any name. An example is **MSK-VPC**

   1. **VPC** - select **MSKTest**

   1. **Subnet** - select **MSKTest Private Subnet (AZ1)**

   1. **SecurityGroup** - select **ForMSKSecurityGroup**  
![\[CloudShell interface showing ap-southeast-1 environment and option to create VPC environment.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-cshell-1.png)  
![\[Image showing a CloudShell environment with the fields you have to specify.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-cshell-2.png)

   Once the CloudShell belonging to the Private Subnet has started, run the following command:

   ```
   pip install boto3 kafka-python aws-msk-iam-sasl-signer-python
   ```

1. Download Python scripts from the S3 bucket.

   ```
   aws s3 cp s3://[YOUR-BUCKET-NAME]/pythonScripts.zip ./
   unzip pythonScripts.zip
   ```

1. Check the management console and set the environment variables for the broker URL and Region value in the Python scripts. Check the Amazon MSK cluster broker endpoint in the management console.   
![\[Amazon MSKcluster summary showing active status, serverless type, and creation details.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-view-client-1.png)  
![\[TODO.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-view-client-2.png)

1. Set the environment variables on the CloudShell. If you are using the US West (Oregon):

   ```
   export AWS_REGION="us-west-2"
   export MSK_BROKER="boot-YOURMSKCLUSTER.c3.kafka-serverless.ap-southeast-1.amazonaws.com:9098"
   ```

1. Run the following Python scripts.

   Create an Amazon MSK topic:

   ```
   python ./createTopic.py
   ```

   Create a DynamoDB table:

   ```
   python ./createTable.py
   ```

   Write test data to the Amazon MSK topic:

   ```
   python ./kafkaDataGen.py
   ```

1. Check the CloudWatch metrics for the created Amazon MSK, Lambda, and DynamoDB resources, and verify the data stored in the `device_status `table using the DynamoDB Data Explorer to ensure all processes ran correctly. If each process is run without error, you can check that the test data written from CloudShell to Amazon MSK is also written to DynamoDB.  
![\[Image showing the DynamoDB console and how there are now items returned when you perform a scan.\]](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/images/msk-dynamodb-explore.png)

1. When you're done with this example, delete the resources created in this tutorial. Delete the two CloudFormation stacks: `ForMSKTestS3` and `ForMSKTestVPC`. If the stack deletion completes successfully, all resources will be deleted. 

## Next steps
<a name="msk-for-dynamodb-next-steps"></a>

**Note**  
If you created resources while following along with this example, please remember to delete them to avoid any unexpected charges.

The Integration identified an architecture that links Amazon MSK and DynamoDB to enable stream data to support OLTP workloads. From here, more complex searches can be realized by linking [DynamoDB with OpenSearch Service](OpenSearchIngestionForDynamoDB.md). Consider integrating with EventBridge for more complex event-driven needs, and extensions such as [Amazon Managed Service for Apache Flink](https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html) for higher throughput and lower latency requirements.

# Best practices for integrating with DynamoDB
<a name="bp-integration"></a>

When integrating DynamoDB with other services, you should always follow the best practices for using each individual service. However, there are some best practices specific to integration that you should consider.

**Topics**
+ [Creating a snapshot in DynamoDB](#bp-integration-snapshot)
+ [Capturing data change in DynamoDB](#bp-integration-change-data-capture)

## Creating a snapshot in DynamoDB
<a name="bp-integration-snapshot"></a>
+ Generally, we recommend using [export to Amazon S3](S3DataExport.HowItWorks.md) to create snapshots for initial replication. It is both cost effective, and won't compete with your application's traffic for throughput. You can also consider a backup and restore to a new table followed by a scan operation. This will avoid competing for throughput with your application, but will generally be substantially less cost effective than an export.
+ Always set a `StartTime` when doing an export. This makes it easy to determine where you'll start your change data capture (CDC) from.
+ When using export to S3, set a lifecycle action on the S3 bucket. Typically, an expiration action set at 7 days is safe, but you should follow any guidelines that your company might have. Even if you explicitly delete your items after ingestion, this action can help catch issues, which helps reduce unnecessary costs and prevents policy violations.

## Capturing data change in DynamoDB
<a name="bp-integration-change-data-capture"></a>
+ If you need near real-time CDC, use [DynamoDB Streams](streamsmain.md) or [Amazon Kinesis Data Streams (KDS)](kds.md). When you're deciding which one to use, generally consider which is easiest to use with the downstream service. If you need to provide in-order event processing at a partition-key level, or if you have items that are exceptionally large, use DynamoDB Streams.
+ If you don't need near real-time CDC, you can use [export to Amazon S3 with incremental exports](S3DataExport.HowItWorks.md) to export only the changes that have happened between two points in time. 

  If you used export to S3 for generating a snapshot, this can be especially helpful because you can use similar code to process incremental exports. Typically, export to S3 is slightly cheaper than the previous streaming options, but cost is typically not the main factor for which option to use.
+ You can generally only have two simultaneous consumers of a DynamoDB stream. Consider this when planning your integration strategy.
+ Don't use scans to detect changes. This might work on a small scale, but becomes impractical fairly quickly.