DR Orchestrator Framework overview
DR Orchestrator Framework provides a one-click solution to orchestrate and automate
cross-Region DR for AWS databases. It uses AWS Step Functions and AWS Lambda to perform the required steps during the
failover and failback. The Step Functions state machines provide the basis for decision making
within the orchestrator design. The API operations for performing a failover or failback actions
are coded into Lambda functions that are called from within the state machine. The Lambda
functions run AWS SDK for Python (Boto3)
DR Orchestrator Framework contains two main state machines that correspond to the failover and failback phases.
For Amazon RDS, the failover phase promotes a cross-Region RDS read replica into a standalone DB instance. For Amazon Aurora, when the primary Region is down during a rare, unexpected outage, its writer node isn't available. Replication between the writer node and the secondary clusters stops. You must detach the secondary cluster from the global database and promote it as a standalone cluster. Applications can connect and send write traffic to the standalone cluster. You can use this same process to switch over the primary DB cluster of the global database to the secondary Regions. Use this approach for controlled scenarios such as the following:
-
Operational maintenance
-
Planned operational procedures
-
Promotion of an Amazon ElastiCache (Redis OSS) secondary cluster as your new primary cluster
The failback phase establishes live replication of data between a live primary Region and a new secondary Region.
It's critical to understand that DR Orchestrator applies to databases only. All the applications that reference these databases and are in the same Region might need a separate, tandem failover solution. After the databases fail over to the secondary Region, the applications need to be updated to connect to the new database instances, which will serve as the data source.
The failover process
To perform a failover, run the DR Orchestrator FAILOVER
state machine. At
this stage, a secondary database is already present in the secondary Region, either as a read
replica (Amazon RDS) or as a secondary cluster (Amazon Aurora). When you run the DR Orchestrator
FAILOVER
state machine, it promotes the secondary database to become the
primary.
DR Orchestrator FAILOVER
architecture
The following diagram shows the concepts of the failover process for Amazon Aurora when using DR Orchestrator. Amazon Aurora and Amazon ElastiCache use the same workflow but with different state machines and Lambda functions.
-
The
DR Orchestrator FAILOVER
state machine reads the input JSON parameters. -
Based on the
resourceType
parameter, the state machine calls other state machines:Promote RDS Read Replica
,Failover Aurora Cluster
, orFailover ElastiCache
. If more than one resource is passed in the input, these state machines run in parallel. -
The
Failover Aurora Cluster
state machine calls Lambda functions in each of the following three steps. -
The
Resolve imports
Lambda function resolves"! import <export-variable-name>"
with the actual values from theApp-Stack
AWS CloudFormation template. -
The
Failover Aurora Cluster
Lambda function promotes the read replica as a standalone DB instance. -
The
Check Failover Status
Lambda function checks the status of the promoted DB instance. After the status is AVAILABLE, the Lambda function sends a success token back to the calling state machine and completes. -
You can redirect your applications to the standalone database in the DR Region (
us-west-2
), which is now the primary database.
The failback process
After your former primary Region (us-east-1
) is up again, you can fail back
to it, so that the database in us-east-1
becomes the primary again. To start the
failback, run the DR Orchestrator FAILBACK
state machine. As the name indicates,
this state machine starts replicating changes in your new primary Region
(us-west-2
) back to the former primary Region (us-east-1
), which
acts as the current secondary.
After replication is established between the two Regions, you can initiate the failback.
To failback and return to your original primary Region (us-east-1
), run the
DR Orchestrator FAILOVER
state machine in the current secondary Region
(us-east-1
) to promote it to the primary Region.
DR Orchestrator FAILBACK
architecture
The following diagram shows the concepts of the failback process for Amazon Aurora when using DR Orchestrator.
-
Before beginning failback, take a manual DB snapshot to use when performing root cause analysis (RCA).
Also, disable the
DeletionProtection
for the Aurora cluster in the previous primary Region (us-east-1
). -
The
DR Orchestrator FAILBACK
state machine reads the input JSON parameters. -
Based on the
resourceType
, theDR Orchestrator FAILBACK
state machine calls theCreate Aurora Secondary DB Cluster
state machine. -
The
Create Aurora Secondary DB Cluster
state machine calls Lambda functions in each of the following five steps. -
The
Resolve import
Lambda function resolves"! import <export-variable-name>"
with the actual values from theApp-Stack
CloudFormation template. -
The
Delete DB Instance
Lambda function deletes the former primary instance. -
The
Check DB instance status
Lambda function checks theDelete DB Instance status
until the DB is deleted. -
The
Create Read Replica
Lambda function creates a read replica in the secondary Region from the DB instance that's in the new primary Region. -
The
Check DB instance status
Lambda function checks the read replica DB instance status. When the status is AVAILABLE, the Lambda function sends a success token back to the calling state machine, which is completed.
DR Orchestrator FAILOVER
Use the DR Orchestrator FAILOVER
state machine in the DR event when the
primary Region (us-east-1
) is down or during planned events such as operational
maintenance.
The function can be called to fail over single or multiple databases in parallel.
The state machine accepts parameters in the JSON format as shown in the following code:
{ "StatePayload": [ { "layer": 1, "resources": [ { "resourceType": "PromoteRDSReadReplica", "resourceName": "Promote RDS MySQL Read Replica", "parameters": { "RDSInstanceIdentifier": "!Import rds-mysql-instance-identifier", "TargetClusterIdentifier": "!Import rds-mysql-instance-global-arn" } }, { "resourceType": "FailoverElastiCacheCluster", "resourceName": "Failover ElastiCache Cluster", "parameters": { "GlobalReplicationGroupId": "!Import demo-redis-cluster-global-replication-group-id", "TargetRegion": "!Import demo-redis-cluster-target-region", "TargetReplicationGroupId": "!Import demo-redis-cluster-target-replication-group-id" } } ] } ] }
Parameter details
The following table shows the parameters used by the DR Orchestrator
FAILOVER
state machine.
Parameter name | Description | Expected values |
---|---|---|
layer (required: number) |
The processing sequence. All the resources defined in layer 1 must be run before the layer 2 resources are run. | 1 or 2, and so on |
resources (required: array of dictionary) | All the resources within a single layer run in parallel. |
|
resourceType (required: string) |
Type of the resource to identify the resource | PromoteRDSReadReplica or
FailoverElastiCacheCluster |
resourceName (optional: string) |
To identify which application portfolio these resources belong to | Promote RDS for MySQL Read Replica |
parameters (required: array of dictionary) | List of parameters required to fail over or fail back the AWS database |
|
DR Orchestrator FAILBACK
Use the DR Orchestrator FAILBACK
state machine after the DR event, when the
former primary Region (us-east-1
) is up. You can create the read
replica for Amazon RDS in the former primary Region from the new primary Region
(us-west-2
) to be compliant with your DR strategy. Because this is a planned
event, you can schedule this activity over the weekend or during off-peak business hours with
an estimated downtime.
The state machine accepts parameters in the JSON format as shown in the following code:
{ "StatePayload": [ { "layer": 1, "resources": [ { "resourceType": "CreateRDSReadReplica", "resourceName": "Create RDS for MySQL Read Replica", "parameters": { "RDSInstanceIdentifier": "!Import rds-mysql-instance-identifier", "TargetClusterIdentifier": "!Import rds-mysql-instance-global-arn", "SourceRDSInstanceIdentifier": "!Import rds-mysql-instance-source-identifier", "SourceRegion": "!Import rds-mysql-instance-SourceRegion", "MultiAZ": "!Import rds-mysql-instance-MultiAZ", "DBInstanceClass": "!Import rds-mysql-instance-DBInstanceClass", "DBSubnetGroup": "!Import rds-mysql-instance-DBSubnetGroup", "DBSecurityGroup": "!Import rds-mysql-instance-DBSecurityGroup", "KmsKeyId": "!Import rds-mysql-instance-KmsKeyId", "BackupRetentionPeriod": "7", "MonitoringInterval": "60", "StorageEncrypted": "True", "EnableIAMDatabaseAuthentication": "True", "DeletionProtection": "True", "CopyTagsToSnapshot": "True", "AutoMinorVersionUpgrade": "True", "Port": "!Import rds-mysql-instance-DBPortNumber", "MonitoringRoleArn": "!Import rds-mysql-instance-RDSMonitoringRole" } } ] } ] }