Write-ahead logs (WAL) for Amazon EMR
With Amazon EMR 6.15 and higher, you can write your Apache HBase write-ahead logs (WAL) to the Amazon EMR WAL. With lower Amazon EMR releases, when you create a cluster with the HBase on Amazon S3 option, WAL is the only Apache HBase component that gets stored in the local disk for clusters, and you can store other components such as the root directory, store files (HFiles), table metadata, and data on Amazon S3.
You can use Amazon EMR WAL to recover data that didn't flush to Amazon S3. To fully back up your
HBase clusters, opt in to use the Amazon EMR WAL service. Behind the scenes,
RegionServer
writes your HBase write-ahead logs (WAL) to the WAL for
Amazon EMR.
In the event that your cluster or the AZ becomes unhealthy or unavailable, you can create a new cluster, point it to the same S3 root directory and Amazon EMR WAL workspace, and automatically recover the data in WAL within a few minutes. For more information, see Restoring from Amazon EMR WAL.
Starting with Amazon EMR releases 7.3.0 and higher,
Amazon EMR creates multiple EMR WALs for each server and groups multiple HBase regions into one Amazon EMR WAL. Doing so
enhances Apache HBase WAL to improve log utilization and optimize costs. To configure the number of Amazon EMR WAL
instances per HBase RegionServer
, use the parameter hbase.wal.regiongrouping.numgroups
.
By default, this parameter is set to 2.
If you run a release lower than Amazon EMR 7.3.0, we recommend that you manually disable the tables in the old HBase cluster to
make sure that all data in the Amazon EMR WAL flushes
to Amazon S3. Then, delete the old Amazon EMR WAL, terminate the old cluster, and set up a new cluster that runs the latest release.
If you run into issues and can't disable the tables on the old cluster, you can directly terminate the old cluster
and set emr.wal.multiplex.migrate
to true
. on the new cluster. If set to true, HBase will attempt
to replay the data from old Amazon EMR WAL instances during HBase region initialization and delete the old WALs after replay.
This replay process incurs additional costs for reads. After migration, we recommend that you configure the cluster
and set emr.wal.multiplex.migrate
to false
. Alternatively, you can remove the parameter
to speed up HBase region initialization.
Note
Amazon EMR WAL deletes the data after HBase flushes it. If HBase doesn't flush the data, Amazon EMR WAL retains the data for a maximum of 30 days. After 30 days, Amazon EMR WAL automatically deletes the data. Amazon EMR keeps WAL instances for up to 30 days from when you terminate an EMR cluster. However, if you launch a new WAL-enabled cluster from the same S3 root directory within those 30 days, Amazon EMR won't delete any of the WAL instances from your previous cluster. For more information, see Restoring from Amazon EMR WAL.
The following sections describe how to set up and use Amazon EMR WAL with your HBase-enabled EMR cluster.
Topics
- Amazon EMR WAL workspaces
- Required permissions for Amazon EMR WAL
- Enabling Amazon EMR WAL
- Restoring from Amazon EMR WAL
- Using security configurations with Amazon EMR WAL
- Access Amazon EMR WAL through AWS PrivateLink
- Understanding Amazon EMR WAL pricing and metrics
- Tagging WAL workspaces
- EMR WAL cross-cluster replication
- Considerations and Regions for Amazon EMR WAL
- Amazon EMR WAL (EMRWAL) CLI reference