Unload data from an Amazon Redshift cluster across accounts to Amazon S3
Created by Andrew Kamel (AWS)
Code repository: aws-unload-redshift-to-s3-python | Environment: Production | Technologies: Databases; Analytics; Serverless |
Workload: Open-source | AWS services: AWS Lambda; Amazon Redshift; Amazon S3; AWS Secrets Manager |
Summary
When you test applications, it's helpful to have production data in your test environment. Using production data can give you a more accurate assessment of the application that you're developing.
This pattern extracts data from an Amazon Redshift cluster in a production environment to an Amazon Simple Storage Service (Amazon S3) bucket in a development environment on Amazon Web Services (AWS).
The pattern steps through the setup of both DEV and PROD accounts, including the following:
Required resources
AWS Identity and Access Management (IAM) roles
Network adjustments to subnets, security groups, and the virtual private cloud (VPC) to support the Amazon Redshift connection
An example AWS Lambda function with a Python runtime for testing the architecture
To grant access to the Amazon Redshift cluster, the pattern uses AWS Secrets Manager to store the relevant credentials. The benefit is having all the needed information to directly connect to the Amazon Redshift cluster without needing to know where the Amazon Redshift cluster resides. Additionally, you can monitor use of the secret.
The secret saved in Secrets Manager includes the Amazon Redshift cluster's host, database name, port, and relevant credentials.
For information about security considerations when using this pattern, see the Best practices section.
Prerequisites and limitations
Prerequisites
An Amazon Redshift cluster running in the PROD account
An S3 bucket created in the DEV account
VPC peering between the DEV and PROD accounts, with route tables adjusted accordingly
DNS hostnames and DNS resolution enabled for both peered VPCs
Limitations
Depending on the amount of data that you want to query, the Lambda function might time out.
If your run takes more time than the maximum Lambda timeout (15 minutes), use an asynchronous approach for your Lambda code. The code example for this pattern uses the psycopg2
library for Python, which doesn't currently support asynchronous processing. Some AWS services aren’t available in all AWS Regions. For Region availability, see AWS services by Region
. For specific endpoints, see the Service endpoints and quotas page, and choose the link for the service.
Architecture
The following diagram shows the target architecture, with DEV and PROD accounts.
The diagram shows the following workflow:
The Lambda function in the DEV account assumes the IAM role that's required to access the Amazon Redshift credentials in Secrets Manager in the PROD account.
The Lambda function then retrieves the Amazon Redshift cluster secret.
The Lambda function in the DEV account uses the information to connect to the Amazon Redshift cluster in the PROD account through the peered VPCs.
The Lambda function then sends an unload command to query the Amazon Redshift cluster in the PROD account.
The Amazon Redshift cluster in the PROD account assumes the relevant IAM role to access the S3 bucket in the DEV account.
The Amazon Redshift cluster unloads the queried data to the S3 bucket in the DEV account.
Querying data from Amazon Redshift
The following diagram shows the roles that are used to retrieve the Amazon Redshift credentials and connect to the Amazon Redshift cluster. The workflow is initiated by the Lambda function.
The diagram shows the following workflow:
The
CrossAccount-SM-Read-Role
in the DEV account assumes theSM-Read-Role
in the PROD account.The
SM-Read-Role
role uses the attached policy to retrieve the secret from Secrets Manager.The credentials are used to access the Amazon Redshift cluster.
Uploading data to Amazon S3
The following diagram shows the cross-account read-write process for extracting data and uploading it to Amazon S3. The workflow is initiated by the Lambda function. The pattern chains IAM roles in Amazon Redshift. The unload command that comes from the Amazon Redshift cluster assumes the CrossAccount-S3-Write-Role
, and then assumes the S3-Write-Role
. This role chaining gives Amazon Redshift access to Amazon S3.
The workflow includes the following steps:
The
CrossAccount-SM-Read-Role
in the DEV account assumes theSM-Read-Role
in the PROD account.The
SM-Read-Role
retrieves the Amazon Redshift credentials from Secrets Manager.The Lambda function connects to the Amazon Redshift cluster and sends a query.
The Amazon Redshift cluster assumes the
CrossAccount-S3-Write-Role
.The
CrossAccount-S3-Write-Role
assumes theS3-Write-Role
in the DEV account.The query results are unloaded to the S3 bucket in the DEV account.
Tools
AWS services
AWS Key Management Service (AWS KMS) helps you create and control cryptographic keys to help protect your data.
AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
Amazon Redshift is a managed petabyte-scale data warehouse service in the AWS Cloud.
AWS Secrets Manager helps you replace hardcoded credentials in your code, including passwords, with an API call to Secrets Manager to retrieve the secret programmatically.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Code repository
The code for this pattern is available in the GitHub unload-redshift-to-s3-python
Best practices
Security disclaimer
Before you implement this solution, consider the following important security recommendations:
Remember that connecting development and production accounts can increase the scope and lower overall security posture. We recommend deploying this solution only temporarily, extracting the required portion of data and then immediately destroying the deployed resources. To destroy the resources, you should delete the Lambda function, remove any IAM roles and policies created for this solution, and revoke any network access that was granted between the accounts.
Consult your security and compliance teams before copying any data from production to development environments. Personally identifiable information (PII), Protected health information (PHI), and other confidential or regulated data should generally not be copied in this manner. Copy only publicly available, non-confidential information (for example, public stock data from a shop frontend). Consider tokenizing or anonymizing data, or generating synthetic test data, instead of using production data whenever possible. One of the AWS security principles is to keep people away from data. In other words, developers should not perform operations in the production account.
Restrict access to the Lambda function in the development account because it can read data from the Amazon Redshift cluster in the production environment.
To avoid disrupting the production environment, implement the following recommendations:
Use a separate, dedicated development account for testing and development activities.
Implement strict network access controls and limit traffic between accounts to only what is necessary.
Monitor and audit access to the production environment and data sources.
Implement least-privilege access controls for all resources and services involved.
Regularly review and rotate credentials, such as AWS Secrets Manager secrets and IAM role access keys.
Refer to the following security documentation for the services used in this article:
Security is a top priority when accessing production data and resources. Always follow best practices, implement least-privilege access controls, and regularly review and update your security measures.
Epics
Task | Description | Skills required |
---|---|---|
Create a secret for the Amazon Redshift cluster. | To create the secret for the Amazon Redshift cluster, do the following:
| DevOps engineer |
Create a role to access Secrets Manager. | To create the role, do the following:
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Create a role to access the S3 bucket. | To create the role for accessing the S3 bucket, do the following:
| DevOps engineer |
Create the Amazon Redshift role. | To create the Amazon Redshift role, do the following:
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Deploy the Lambda function. | To deploy a Lambda function in the peered VPC, do the following:
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Import the required resources. | To import the required resources, run the following commands:
| App developer |
Run the Lambda handler function. | The Lambda function uses AWS Security Token Service (AWS STS) for cross-account access and temporary credential management. The function uses the AssumeRole API operation to temporarily assume the permissions of the To run the Lambda function, use the following example code:
| App developer |
Get the secret. | To get the Amazon Redshift secret, use the following example code:
| App developer |
Run the unload command. | To unload the data to the S3 bucket, use the following example code.
| App developer |
Task | Description | Skills required |
---|---|---|
Delete the Lambda function. | To avoid incurring unplanned costs, remove the resources and the connection between the DEV and PROD accounts. To remove the Lambda function, do the following:
| DevOps engineer |
Remove the IAM roles and policies. | Remove the IAM roles and policies from the DEV and PROD accounts. In the DEV account, do the following:
In the PROD account, do the following:
| DevOps engineer |
Delete the secret in Secrets Manager. | To delete the secret, do the following:
| DevOps engineer |
Remove VPC peering and security group rules. | To remove VPC peering and security group rules, do the following:
| DevOps engineer |
Remove data from the S3 bucket. | To remove the data from Amazon S3, do the following:
| DevOps engineer |
Clean up AWS KMS keys. | If you created any custom AWS KMS keys for encryption, do the following:
| DevOps engineer |
Review and delete Amazon CloudWatch logs. | To delete the CloudWatch logs, do the following:
| DevOps engineer |
Related resources
Additional information
After you unload the data from Amazon Redshift to Amazon S3, you can analyze it by using Amazon Athena.
Amazon Athena is a big data query service that's beneficial when you need to access large volumes of data. You can use Athena without having to provision servers or databases. Athena supports complex queries, and you can run it on different objects.
As with most AWS services, the main benefit to using Athena is that it provides great flexibility in how you run queries without the added complexity. When you use Athena, you can query different data types, such as CSV and JSON, in Amazon S3 without changing the data type. You can query data from various sources, including outside AWS. Athena reduces complexity because you don't have to manage servers. Athena reads data directly from Amazon S3 without loading or changing the data before you run the query.