Create an S3 event integration to automatically copy files from Amazon S3 buckets
Note
The preview release for auto-copy has ended. Consequently, preview clusters will be automatically removed 30 days after the end of the preview period. If you plan to continue using auto-copy, we recommend re-creating your existing auto-copy jobs on another Amazon Redshift cluster. Upgrading a preview cluster to the latest Amazon Redshift version is not supported.
You can use an auto-copy job to load data into your Amazon Redshift tables from files that are stored in Amazon S3. Amazon Redshift detects when new Amazon S3 files are added to the path specified in your COPY command. A COPY command is then automatically run without you having to create an external data ingestion pipeline. Amazon Redshift keeps track of which files have been loaded. Amazon Redshift determines the number of files batched together per COPY command. You can see the resulting COPY commands in system views.
The first step to create an automatic COPY JOB is to create an S3 event integration. When a new file appears in the Amazon S3 source bucket, Amazon Redshift then manages loading the files into your database using the COPY command.
Prerequisites to creating an S3 event integration
To set up your s3 event integration, confirm the following prerequisites are completed.
Your Amazon S3 bucket must have a bucket policy that allows several Amazon S3 permissions. For example, the following example policy allows permissions for the resource bucket
amzn-s3-demo-bucket
that is hosted inus-east-1
. Both the Amazon S3 bucket and the integration are in the same AWS Region.{ "Version": "2012-10-17", "Statement": [ { "Sid": "Auto-Copy-Policy-01", "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": [ "s3:GetBucketNotification", "s3:PutBucketNotification", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::
amzn-s3-demo-bucket
", "Condition": { "StringLike": { "aws:SourceArn": "arn:aws:redshift:us-east-1
:123456789012
:integration:*", "aws:SourceAccount": "123456789012
" } } } ] }Your target Amazon Redshift provisioned cluster or Redshift Serverless namespace must have permission to the bucket. Confirm an IAM role that is associated with your cluster or serverless namesspace has a IAM policy that allows the proper permissions. The policy must allow both
s3:GetObject
for a bucket resource such as
andamzn-s3-demo-bucket
s3:ListBucket
for a bucket resource and its contents such as
.amzn-s3-demo-bucket
/*{ "Version": "2012-10-17", "Statement": [ { "Sid": "AutoCopyReadId", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::
amzn-s3-demo-bucket
", "arn:aws:s3:::amzn-s3-demo-bucket
/*" ] } ] }Add your policy to an IAM role that has a trust relationship for the role is as follows.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "redshift.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
If your target data warehouse is a provisioned cluster, you can associate an IAM role to your provisioned cluster using the Amazon Redshift console, Cluster permissions tab in your cluster details. For information about how to associate a role to your provisioned cluster, see Associating IAM roles with clusters in the Amazon Redshift Management Guide.
If your target data warehouse is Redshift Serverless, you can associate an IAM role to your serverless namespace using the Redshift Serverless console, Security and encryption tab in your namespace details. For information about how to associate a role to your serverless namespace, see Granting permissions to Amazon Redshift Serverless in the Amazon Redshift Management Guide.
Your Amazon Redshift data warehouse must also have a resource policy that allows the Amazon S3 bucket. If you use the Amazon Redshift console, when you create the s3 event integration, Amazon Redshift provides the option Fix it for me to add this policy to your Amazon Redshift data warehouse. To update a resource policy yourself, you can use the put-resource-policy AWS CLI command. For example, to attach a resource policy to your Amazon Redshift provisioned cluster for an S3 event integration with an Amazon S3 bucket, run a AWS CLI command similar to the following. The following example shows a policy for a provisioned cluster namespace in the
us-east-1
AWS Region for user account123456789012
. The bucket is namedamzn-s3-demo-bucket
.aws redshift put-resource-policy \ --policy file://rs-rp.json \ --resource-arn "arn:aws:redshift:
us-east-1
:123456789012
:namespace/cc4ffe56-ad2c-4fd1-a5a2-f29124a56433"Where
rs-rp.json
contains:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": "redshift:AuthorizeInboundIntegration", "Resource": "arn:aws:redshift:
us-east-1
:123456789012
:namespace/cc4ffe56-ad2c-4fd1-a5a2-f29124a56433", "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket
" } } }, { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012
:role/myRedshiftRole" }, "Action": "redshift:CreateInboundIntegration", "Resource": "arn:aws:redshift:us-east-1
:123456789012
:namespace/cc4ffe56-ad2c-4fd1-a5a2-f29124a56433", "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket
" } } } ] }To attach a resource policy to your Redshift Serverless namespace for an S3 event integration with an Amazon S3 bucket, run a AWS CLI command similar to the following. The following example shows a policy for a serverless namespace in the
us-east-1
AWS Region for user account123456789012
. The bucket is namedamzn-s3-demo-bucket
.aws redshift put-resource-policy \ --policy file://rs-rp.json \ --resource-arn "arn:aws:redshift-serverless:
us-east-1
:123456789012
:namespace/namespace-1"Where
rs-rp.json
contains:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": "redshift:AuthorizeInboundIntegration", "Resource": "arn:aws:redshift-serverless:
us-east-1
:123456789012
:namespace/namespace-1", "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket
" } } }, { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012
:user/myUser" }, "Action": "redshift:CreateInboundIntegration", "Resource": "arn:aws:redshift-serverless:us-east-1
:123456789012
:namespace/namespace-1", "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket
" } } } ] }
Create an S3 event integration
To set up your copy job, you first define an S3 event integration.
Amazon Redshift then creates an S3 event integration with its associated source and target, status, and information about the status of an associated auto-copy job. You can view information about an S3 event integration on the Amazon Redshift console by choosing S3 event integrations, and choosing the integration to show its details. Integrations are separated by those created In my account and From other accounts. The In my account list shows integrations where the source and target are in the same account. The From other accounts list shows integrations where the source is owned by another account.
If you delete an S3 event integration, the corresponding COPY JOB status changes from 1
(active) to 0
(inactive/pending).
However, the corresponding COPY JOB is not automatically dropped. If later you try to create a COPY JOB with the same name, there might be a conflict.
Create and monitor a COPY JOB
After the integration is created, on the S3 event integration details page for the integration you created,
choose Create autocopy job to go to Amazon Redshift query editor v2 where you can create the auto-copy job for the integration.
Amazon Redshift matches the bucket in the FROM clause in the COPY JOB CREATE statement to the bucket used in S3 event integration.
For information about how to use Amazon Redshift query editor v2, see
Querying a database using the Amazon Redshift query editor v2 in the Amazon Redshift Management Guide.
For example, run the following COPY command in query editor v2 to create an automatic COPY JOB that matches
the Amazon S3 bucket s3://amzn-s3-demo-bucket/staging-folder
to an Amazon S3 event integration.
COPY public.target_table FROM 's3://amzn-s3-demo-bucket/staging-folder' IAM_ROLE 'arn:aws:iam::123456789012:role/MyLoadRoleName' JOB CREATE my_copy_job_name AUTO ON;
You define a COPY JOB one time. The same parameters are used for future runs.
To define and manage a COPY JOB, you must have permission. For information about granting and revoking permission on a COPY JOB, see GRANT and REVOKE. For more information about granting and revoking scoped permissions for a COPY JOB, see Granting scoped permissions and Revoking scoped permissions.
You manage the load operations using options to CREATE, LIST, SHOW, DROP, ALTER, and RUN jobs. For more information, see COPY JOB.
You can query system views to see the COPY JOB status and progress. Views are provided as follows:
SYS_COPY_JOB – contains a row for each currently defined COPY JOB.
SYS_COPY_JOB_DETAIL – contains details on pending, error, and ingested files for each COPY JOB.
SYS_COPY_JOB_INFO – contains messages logged about a COPY JOB.
SYS_LOAD_HISTORY – contains details of COPY commands.
SYS_LOAD_ERROR_DETAIL – contains details of COPY command errors.
SVV_COPY_JOB_INTEGRATIONS – contains details of S3 event integrations.
STL_LOAD_ERRORS – contains errors from COPY commands.
STL_LOAD_COMMITS – contains information used to troubleshoot a COPY command data load.
For information about troubleshooting S3 event integration errors, see Troubleshooting S3 event integration and COPY JOB errors.
To get the list of files loaded by a COPY JOB, run the following SQL, but first replace <job_id>
:
SELECT job_id, job_name, data_source, copy_query, filename, status, curtime FROM sys_copy_job copyjob JOIN stl_load_commits loadcommit ON copyjob.job_id = loadcommit.copy_job_id WHERE job_id =
<job_id>
;
Limitations when creating S3 event integration for auto-copy
Consider the following when using auto-copy.
You can create a maximum of 200 COPY JOBS for each cluster or workgroup in an AWS account.
You can create a maximum of 35 copy jobs for each Amazon S3 bucket.
You can create a maximum of 50 S3 event integrations for each Amazon Redshift target.
You can't create an S3 event integration with a source Amazon S3 bucket which has a period (.) in the bucket name.