Execute Amazon Redshift SQL queries by using Terraform
Created by Sylvia Qi (AWS) and Aditya Ambati (AWS)
Code repository: terraform-execute-redshift-sql | Environment: PoC or pilot | Technologies: DevOps; Analytics; Big data; Data lakes |
Workload: Open-source | AWS services: Amazon Redshift; AWS CLI |
Summary
Using infrastructure as code (IaC) for the deployment and management of Amazon Redshift is a prevalent practice within DevOps. IaC facilitates the deployment and configuration of various Amazon Redshift resources, such as clusters, snapshots, and parameter groups. However, IaC doesn’t extend to the management of database resources like tables, schemas, views, and stored procedures. These database elements are managed through SQL queries and are not directly supported by IaC tools. Although solutions and tools exist for managing these resources, you might prefer not to introduce additional tools into your technology stack.
This pattern outlines a methodology that uses Terraform to deploy Amazon Redshift database resources, including tables, schemas, views, and stored procedures. The pattern distinguishes between two types of SQL queries:
Nonrepeatable queries – These queries are executed once during the initial Amazon Redshift deployment to establish the essential database components.
Repeatable queries – These queries are immutable and can be rerun without impacting the database. The solution uses Terraform to monitor changes in repeatable queries and apply them accordingly.
For more details, see Solution walkthrough in Additional information.
Prerequisites and limitations
Prerequisites
You must have an active AWS account and install the following on your deployment machine:
AWS Command Line Interface (AWS CLI)
An AWS CLI profile configured with Amazon Redshift read/write permissions
Terraform
version 1.6.2 or later
Limitations
This solution supports a single Amazon Redshift database because Terraform only allows for the creation of one database during cluster creation.
This pattern doesn’t include tests to validate changes to the repeatable queries before applying them. We recommend that you incorporate such tests for enhanced reliability.
To illustrate the solution, this pattern provides a sample
redshift.tf
file that uses a local Terraform state file. However, for production environments, we strongly recommend that you employ a remote state file with a locking mechanism for enhanced stability and collaboration.Some AWS services aren’t available in all AWS Regions. For Region availability, see AWS services by Region
. For specific endpoints, see Service endpoints and quotas, and choose the link for the service.
Product versions
This solution is developed and tested on Amazon Redshift patch 179.
Code repository
The code for this pattern is available in the GitHub amazon-redshift-sql-deploy-terraform
Architecture
The following diagram illustrates how Terraform manages the Amazon Redshift database resources by handling both nonrepeatable and repeatable SQL queries.
The diagram shows the following steps:
Terraform applies nonrepeatable SQL queries during the initial Amazon Redshift cluster deployment.
The developer commits changes to the repeatable SQL queries.
Terraform monitors changes in the repeatable SQL queries.
Terraform applies repeatable SQL queries to the Amazon Redshift database.
The solution provided by this pattern is built based on the Terraform module for Amazon Redshiftterraform_data
resources, which invokes a custom Python script to execute SQL queries using the Amazon Redshift ExecuteStatement API operation. As a result, the module can do the following:
Deploy any number of database resources by using SQL queries after the database is provisioned.
Monitor continuously for changes in the repeatable SQL queries and apply those changes using Terraform.
For more details, see Solution walkthrough in Additional information.
Tools
AWS services
Amazon Redshift is a fully managed petabyte-scale data warehouse service in the AWS Cloud.
Other tools
Best practices
Epics
Task | Description | Skills required |
---|---|---|
Clone the repository. | To clone the Git repository containing the Terraform code for provisioning an Amazon Redshift cluster, use the following command.
| DevOps engineer |
Update the Terraform variables. | To customize the Amazon Redshift cluster deployment according to your specific requirements, update the following parameters in the
| DevOps engineer |
Deploy the resources using Terraform. |
| DevOps engineer |
(Optional) Execute additional SQL queries. | The sample repository provides several SQL queries for demo purposes. To execute your own SQL queries, add them to the following folders:
|
Task | Description | Skills required |
---|---|---|
Monitor the deployment of SQL statements. | You can monitor the results of the SQL executions to an Amazon Redshift cluster. For examples of output that show a failed and a successful SQL execution, see Example SQL statements in Additional information. | DBA, DevOps engineer |
Clean up resources. | To delete all the resources deployed by Terraform, run the following command.
| DevOps engineer |
Task | Description | Skills required |
---|---|---|
Validate the data in the Amazon Redshift cluster. |
| DBA, AWS DevOps |
Related resources
AWS documentation
Other resources
Command: apply
(Terraform documentation)
Additional information
Solution walkthrough
To use the solution, you must organize your Amazon Redshift SQL queries in a specific way. All SQL queries must be stored in files with a .sql
extension.
In the code example provided with this pattern, the SQL queries are organized in the following folder structure. You can modify the code (sql-queries.tf
and sql-queries.py
) to work with any structure that fits your unique use case.
/bootstrap |- Any # of files |- Any # of sub-folders /nonrepeatable |- Any # of files |- Any # of sub-folders /repeatable /udf |- Any # of files |- Any # of sub-folders /table |- Any # of files |- Any # of sub-folders /view |- Any # of files |- Any # of sub-folders /stored-procedure |- Any # of files |- Any # of sub-folders /finalize |- Any # of files |- Any # of sub-folders
Given the preceding folder structure, during Amazon Redshift cluster deployment, Terraform executes the queries in the following order:
/bootstrap
/nonrepeatable
/repeatable
/finalize
The /repeatable
folder contains four subfolders: /udf
, /table
, /view
, and /stored-procedure
. These subfolders indicate the order in which Terraform executes the SQL queries.
The Python script that executes the SQL queries is sql-queries.py
. First, the script reads all the files and subfolders of a specific source directory, for example, the sql_path_bootstrap
parameter. Then the script executes the queries by calling the Amazon Redshift ExecuteStatement API operation. You might have one or more SQL queries in a file. The following code snippet shows the Python function that executes SQL statements stored in a file against an Amazon Redshift cluster.
def execute_sql_statement(filename, cluster_id, db_name, secret_arn, aws_region): """Execute SQL statements in a file""" redshift_client = boto3.client( 'redshift-data', region_name=aws_region) contents = get_contents_from_file(filename), response = redshift_client.execute_statement( Sql=contents[0], ClusterIdentifier=cluster_id, Database=db_name, WithEvent=True, StatementName=filename, SecretArn=secret_arn ) ...
The Terraform script sql-queries.tf
creates the terraform_datasql-queries.py
script. There is a terraform_data
resource for each of the four folders: /bootstrap
, /nonrepeatable
, /repeatable
, and /finalize
. The following code snippet shows the terraform_data
resource that execute the SQL queries in the /bootstrap
folder.
locals { program = "${path.module}/sql-queries.py" redshift_cluster_name = try(aws_redshift_cluster.this[0].id, null) } resource "terraform_data" "run_bootstrap_queries" { count = var.create && var.run_nonrepeatable_queries && (var.sql_path_bootstrap != "") && (var.snapshot_identifier == null) ? 1 : 0 depends_on = [aws_redshift_cluster.this[0]] provisioner "local-exec" { command = "python3 ${local.program} ${var.sql_path_bootstrap} ${local.redshift_cluster_name} ${var.database_name} ${var.redshift_secret_arn} ${local.aws_region}" } }
You can control whether to run these queries by using the following variables. If you don’t want to run queries in sql_path_bootstrap
, sql_path_nonrepeatable
, sql_path_repeatable
, or sql_path_finalize
, set their values to ""
.
run_nonrepeatable_queries = true run_repeatable_queries = true sql_path_bootstrap = "src/redshift/bootstrap" sql_path_nonrepeatable = "src/redshift/nonrepeatable" sql_path_repeatable = "src/redshift/repeatable" sql_path_finalize = "src/redshift/finalize"
When you run terraform apply
, Terraform considers the terraform_data
resource added after the script is completed, regardless of the results of the script. If some SQL queries failed, and you want to rerun them, you can manually remove the resource from the Terraform state, and run terraform apply
again. For example, the following command removes the run_bootstrap_queries
resource from the Terraform state.
terraform state rm module.redshift.terraform_data.run_bootstrap_queries[0]
The following code example shows how the run_repeatable_queries
resource monitors changes in the repeatable
folder by using the sha256 hashterraform apply
.
resource "terraform_data" "run_repeatable_queries" { count = var.create_redshift && var.run_repeatable_queries && (var.sql_path_repeatable != "") ? 1 : 0 depends_on = [terraform_data.run_nonrepeatable_queries] # Continuously monitor and apply changes in the repeatable folder triggers_replace = { dir_sha256 = sha256(join("", [for f in fileset("${var.sql_path_repeatable}", "**") : filesha256("${var.sql_path_repeatable}/${f}")])) } provisioner "local-exec" { command = "python3 ${local.sql_queries} ${var.sql_path_repeatable} ${local.redshift_cluster_name} ${var.database_name} ${var.redshift_secret_arn}" } }
To refine the code, you can implement a mechanism to detect and apply changes only to the files that have been updated within the repeatable
folder, rather than applying changes to all files indiscriminately.
Example SQL statements
The following output shows a failed SQL execution, along with an error message.
module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/redshift/sql-queries.py src/redshift/nonrepeatable testcluster-1 db1 arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:/redshift/master_user/password-8RapGH us-east-1"] module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): src/redshift/nonrepeatable/table/admin/admin.application_family.sql module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Status: FAILED module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): SQL execution failed. module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Error message: ERROR: syntax error at or near ")" module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Position: 244 module.redshift.terraform_data.run_nonrepeatable_queries[0]: Creation complete after 3s [id=ee50ba6c-11ae-5b64-7e2f-86fd8caa8b76]
The following output shows a successful SQL execution.
module.redshift.terraform_data.run_bootstrap_queries[0]: Provisioning with 'local-exec'... module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/redshift/sql-queries.py src/redshift/bootstrap testcluster-1 db1 arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:/redshift/master_user/password-8RapGH us-east-1"] module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): src/redshift/bootstrap/db.sql module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): Status: FINISHED module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): SQL execution successful. module.redshift.terraform_data.run_bootstrap_queries[0]: Creation complete after 2s [id=d565ef6d-be86-8afd-8e90-111e5ea4a1be]