Best practices for Elastic Disaster Recovery - AWS Elastic Disaster Recovery

Best practices for Elastic Disaster Recovery

For a more complete discussion of best practices for planning, implementing, and maintaining disaster recovery for on-premises applications using AWS, see this white paper.

Planning

  1. Being ready for a real recovery event requires pre-planning. Simply having your servers replicating to AWS, and even having launched them once is not enough. You should have a written recovery plan of what to do in the event of a real recovery event. To learn more, read this Checklist for your IT disaster recovery plan.

  2. Once your source servers have reached the Healthy state (after initial sync has completed), you should launch Drill instances for each of your applications and ensure that each application as a whole is working as expected when running in your recovery AWS Region. As you go through this process, you will likely create the necessary network resources required (together with security groups and other related resources). While you can keep these recovery networks (and related resources) up and running even when not in use, it is recommended that once you have them set up properly, create a CloudFormation template that can create them on demand, should the need arise. You should discover and record the order in which servers, and applications need to be launched, and record this in the recovery plan.

Drilling

Regular drills are an integral part of any Disaster Recovery solution. With DRS, drilling is simple and nondisruptive (both to the servers at the source, and to the replication process itself). We recommend drilling as often as is practical, and at least several times a year, and updating the recovery plan with any findings and required changes. Testing and understanding failback is also important. Be sure to include it in your initial drill, and in at least some of your regular drills.

Regular testing can help ensure that your resources are properly prepared for both disasters and scheduled drills. Before conducting large-scale scheduled drills, make sure you meet all the prerequisites and run the required tests. To allow our support team to assist you in case of misconfiguration or other issues, conduct the preliminary testing a week or 2 before the scheduled drill.

Note

While your drill instances are up and running, you are paying for them as per your standard Amazon EC2 rates. Make sure to terminate the drill instances when the drill is done, and include this as a step in your recovery plan.

Monitoring

You can monitor the health of the ongoing replication using the DRS console or programmatically. In the AWS DRS console, go to the Servers list page, and look at the Ready for recovery column. Any server that is not showing as Ready with a green checkmark, may require attention. Servers that show stalled in the Data replication status column require your intervention to resolve. Servers that are showing Lag, may resolve themselves (unless they are also stalled). You should monitor and explore to see if the Lag is a persistent problem (for example, due to insufficient network bandwidth). You can use a scripted solution and the DRS API to respond to servers becoming stalled, or going into lag, or you can use Amazon EventBridge and the EventBridge events generated by AWS DRS.

Limits

Due to Amazon EBS limits on the rate at which EBS snapshots can be taken, the maximum number of servers that can be replicated using DRS in a single AWS account is limited to 300. To replicate more than the maximum number of servers, use multiple AWS accounts, or multiple target AWS Regions (you will need to set up DRS separately for each account/ Region.

Protecting Point-In-Time snapshots

DRS uses EBS snapshots to maintain recovery Points-In-Time. If these are deleted, then you can only recover from the latest state, as maintained on the replication server (and if it is terminated, then you can no longer recover at all). In the event of a breach, which includes not just corruption of your data at source, but also access to your AWS account, then the malicious actor could delete your Point-In-Time snapshots, unless you take extra measures to protect them.

Controlling agent installation permissions

You should control who can install the AWS Replication Agent in your account. Once an agent is installed you immediately begin accruing charges for DRS, and for replication resources (such as EBS, etc.) The agent installation permissions should be as limited as is practical. The recommended way for controlling who can install agents is to create an IAM role, and to allow users to assume the role.

  1. Create an IAM role (IAM docs link | IAM console link), based on the DRS managed permission for agent installation. If this role is to be used by someone outside of your AWS account make sure to use the external ID functionality. Send the role ARN to the users who need to install agents (ARN is not secret and can be sent via email). Use permission boundaries to further limit what can be done using that role. For example, you can control which AWS Region it can be used for, how long the temporary credentials created with the role are good for, specify tags that must be provided (or may not be provided) during agent installation, and more.

  2. Users who install the agents assumes that role (must be a user of an AWS account, either yours, or another; you configure who the role is for in step 1). This creates temporary IAM credentials for that users which are used for agent installation. These credentials are limited to only the permissions required for agent installation (and further limited by the permission boundaries you defined), yet are associated with the user (for example, so their usage can be tracked using CloudTrail).

Recovery best practices

  1. Overview: DRS makes successful failover possible, by handling ongoing replication, and the on-demand launching of actual Recovery instances. The re-routing of data is not done via DRS, and should be done using your preferred DNS routing service, such as Amazon Route 53. Your recovery plan should include details of which service to use, who in your organization owns this service, and what conditions must be met to perform the re-routing (for example: launch Recovery instances using DRS, perform successful launch-validation test, wait for system X, Y, and Z to also launch and pass test, then re-route).

  2. Termination protection for recovery instances: When you launch recovery instances in case of a real event, you should prevent them from being inadvertently terminated. This should be done after you have performed launch-validation test, and before data re-routing. You can turn on termination protection directly from the Amazon EC2 console, by selecting the instances, and from the Actions menu choosing Instance settings, change termination protection, and choosing Yes, Enable. You should document this step in you recovery plan. Learn more about termination protection.

  3. Understanding failover costs: Your EC2 recovery instances are created according to the launch settings you have configured for each source server. Recovery instances accrue EC2 and EBS charges as per AWS rates for your account in the target AWS Region. While you use the Recovery instances, you also continue paying for DRS, and the replication resources it created.

  4. Failover dos and don’ts: Do not use the Disconnect from AWS action in the DRS console for servers for which you launched Recovery instances, even in the case of a real recovery event. Performing a disconnect will terminate all replication resources related to these source servers, including your Point-In-Time (PIT) recovery points. You may need these PITs while you are in failover state, for regulatory reasons, or to re-launch a Recovery instances for any reason (for instance if you discover that the PIT from which you launched includes corrupt or malicious data, and you want to relaunch from an earlier PIT). You should realize that while you you use your Recovery instances as your primary, and new data is presumably written to them, these recovery instances are not themselves being replicated, and you are not creating any new PITs for these changes. It is possible to configure the Recovery instances as new source servers and replicate them cross-Region, so as to have DR for your recovery site (this carries with it additional costs, as is detailed in the linked page).

  5. Using recovery for migration: Once you launch and use recovery instances on AWS for a real event, you may wish to go on using them permanently, instead of your original servers. The primary additional steps you need to do are:

    1. Set up cross region replication, so that these recovery instances become new source servers;

    2. Wait for these new source servers to have to full number of daily PITs that you need to maintain;

    3. Perform the Disconnect from AWS action on the original source servers, so as to avoid confusion, and to stop paying for DRS and related replication resources for these original source servers. You can also then choose Delete from the Actions menu, and this will cause DRS to forget everything it knows about these source servers, and for them to no longer appear in the Elastic Disaster Recovery console.

  6. Recover into existing instance: In case you would like to recover into an instance that already exists, instead of launching a new one for recovery, drill or failback. Instance to recover into must be of the same operating system platform (Linux or Windows) as the source instance, it must be stopped and it must have the tag key AWSDRS and tag value AllowLaunchingIntoThisInstance. Learn more about recover into existing instance.

Failback best practices

  1. Mass failback: If you are failing back more than several servers, and your source environment is VMware vCenter, then consider using DRS Mass Failback Automation client.

  2. Return to normal operation: make sure that the failed-back servers at the source are replicating back to AWS, and appear as source servers in the DRS console. If they do appear in the DRS console and are not replicating, explore the reason (such as firewall settings, etc.) If they do not appear in the DRS console you may need to install / re-install the AWS Replication Agent on them. Make that you do not end up with two source server entities in the DRS console, one representing the original server, and one the failed-back server.

  3. Cleanup after return to normal operation: Once you have completed failback, there may be multiple AWS resources left behind that you no longer need and that are costly to maintain:

    After performing a failback to on-premises environment, perform the following steps:

    • Clean Recovery instances: Terminate these instances from the Recovery instances page of the DRS Console.

    • Source servers: These appear in the Source Servers page of the DRS console. Make sure that you only have one source server in the DRS console for each actual server at the source. Source servers are billed by DRS and consume replication resources (billed by other AWS services) until you perform the Disconnect from AWS action. If you do have duplicate source servers, do not disconnect/delete the original ones until the new ones have accumulated all the Point-In-Time recovery points (PITs) you need. Performing the disconnect from AWS action will cause the PITs from the original sources servers to be discarded. If your source is also in AWS, then you will have more resources that need to be cleaned up. Learn more about cleaning up these resources.

    Note

    The cleanup process following a cross-region failback is different. Learn how to perform a cleanup following a cross-region failback.

Security best practices

You can review security best practices in the Security chapter.