HBase application specifics for earlier AMI versions of Amazon EMR
Supported HBase versions
HBase version | AMI version | AWS CLI configuration parameters | HBase version details |
---|---|---|---|
0.94.18 |
3.1.0 and later |
|
|
0.94.7 |
3.0-3.0.4 |
|
|
0.92 |
2.2 and later |
|
HBase cluster prerequisites
A cluster created using Amazon EMR AMI versions 2.x and 3.x should meet the following requirements for HBase.
-
The AWS CLI (optional)—To interact with HBase using the command line, download and install the latest version of the AWS CLI. For more information, see Installing the AWS Command Line Interface in the AWS Command Line Interface User Guide.
-
At least two instances (optional)—The cluster's master node runs the HBase master server and Zookeeper, and task nodes run the HBase region servers. For best performance, HBase clusters should run on at least two EC2 instances, but you can run HBase on a single node for evaluation purposes.
-
Long-running cluster—HBase only runs on long-running clusters. By default, the CLI and Amazon EMR console create long-running clusters.
-
An Amazon EC2 key pair set (recommended)—To use the Secure Shell (SSH) network protocol to connect with the master node and run HBase shell commands, you must use an Amazon EC2 key pair when you create the cluster.
-
The correct AMI and Hadoop versions—HBase clusters are currently supported only on Hadoop 20.205 or later.
-
Ganglia (optional)—To monitor HBase performance metrics, install Ganglia when you create the cluster.
-
An Amazon S3 bucket for logs (optional)—The logs for HBase are available on the master node. If you'd like these logs copied to Amazon S3, specify an S3 bucket to receive log files when you create the cluster.
Creating a cluster with HBase
The following table lists options that are available when using the console to create a cluster with HBase using an Amazon EMR AMI release version.
Field | Action |
---|---|
Restore from backup | Specify whether to pre-load the HBase cluster with data stored in Amazon S3. |
Backup location | Specify the URI where the backup from which to restore resides in Amazon S3. |
Backup version | Optionally, specify the version name of the backup at Backup Location to use. If you leave this field blank, Amazon EMR uses the latest backup at Backup Location to populate the new HBase cluster. |
Schedule Regular Backups | Specify whether to schedule automatic incremental backups. The first backup is a full backup to create a baseline for future incremental backups. |
Consistent backup | Specify whether the backups should be consistent. A consistent backup is one that pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes. |
Backup frequency | The number of days/hours/minutes between scheduled backups. |
Backup location | The Amazon S3 URI where backups are stored. The backup location for each HBase cluster should be different to ensure that differential backups stay correct. |
Backup start time | Specify when the first backup should occur. You can set this to
now , which causes the first backup to start as soon
as the cluster is running, or enter a date and time in ISO format |
The following example AWS CLI command launches a cluster with HBase and other applications:
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.3
\ --applications Name=Hue
Name=Hive
Name=Pig
Name=HBase
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typec1.xlarge
--instance-count3
--termination-protected
After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive.
The following example, when run from the Hive prompt, creates an external table
that references data stored in an HBase table called inputTable
. You
can then reference inputTable
in Hive statements to query and modify
data stored in the HBase cluster.
Note
The following example uses protobuf-java-2.4.0a.jar in
AMI 2.3.3, but you should modify the example to match your version. To check
which version of the Protocol Buffers JAR you have, run the command at the Hive
command prompt: ! ls /home/hadoop/lib;
.
add jar lib/emr-metrics-1.0.jar ; add jar lib/protobuf-java-2.4.0a.jar ; set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ; create external table inputTable (key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,f1:col1") tblproperties ("hbase.table.name" = "t1"); select count(*) from inputTable ;
Customizing HBase configuration
Although the default settings should work for most applications, you have the flexibility to modify your HBase configuration settings. To do this, run one of two bootstrap action scripts:
-
configure-hbase-daemons—Configures properties of the master, regionserver, and zookeeper daemons. These properties include heap size and options to pass to the Java Virtual Machine (JVM) when the HBase daemon starts. You set these properties as arguments in the bootstrap action. This bootstrap action modifies the /home/hadoop/conf/hbase-user-env.sh configuration file on the HBase cluster.
-
configure-hbase—Configures HBase site-specific settings such as the port the HBase master should bind to and the maximum number of times the client CLI client should retry an action. You can set these one-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configuration file in Amazon S3. This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configuration file on the HBase cluster.
Note
These scripts, like other bootstrap actions, can only be run when the cluster is created; you cannot use them to change the configuration of an HBase cluster that is currently running.
When you run the configure-hbase or configure-hbase-daemons bootstrap actions, the values you specify override the default values. Any values that you don't explicitly set receive the default values.
Configuring HBase with these bootstrap actions is analogous to using bootstrap
actions in Amazon EMR to configure Hadoop settings and Hadoop daemon properties. The
difference is that HBase does not have per-process memory options. Instead, memory
options are set using the --
argument, where daemon
-optsdaemon
is replaced by the name of the
daemon to configure.
Configure HBase daemons
Amazon EMR provides a bootstrap action,
s3://
,
that you can use to change the configuration of HBase daemons, where
region
.elasticmapreduce/bootstrap-actions/configure-hbase-daemonsregion
is the region into which you're launching
your HBase cluster.
To configure HBase daemons using the AWS CLI, add the
configure-hbase-daemons
bootstrap action when you launch the
cluster to configure one or more HBase daemons. You can set the following
properties.
Property | Description |
---|---|
hbase-master-opts |
Options that control how the JVM runs the master daemon. If set, these override the default HBASE_MASTER_OPTS variables. |
regionserver-opts |
Options that control how the JVM runs the region server daemon. If set, these override the default HBASE_REGIONSERVER_OPTS variables. |
zookeeper-opts |
Options that control how the JVM runs the zookeeper daemon. If set, these override the default HBASE_ZOOKEEPER_OPTS variables. |
For more information about these options, see hbase-env.sh
A bootstrap action to configure values for zookeeper-opts
and
hbase-master-opts
is shown in the following example.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.3
\ --applications Name=Hue
Name=Hive
Name=Pig
Name=HBase
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typec1.xlarge
--instance-count3
--termination-protected \ --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons
,\ Args=["--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19","--hbase-master-opts=-Xmx2048m","--hbase-regionserver-opts=-Xmx4096m"
]
Configure HBase site settings
Amazon EMR provides a bootstrap action,
s3://elasticmapreduce/bootstrap-actions/configure-hbase
, that
you can use to change the configuration of HBase. You can set configuration
values one-by-one, as arguments in the bootstrap action, or you can specify the
location of an XML configuration file in Amazon S3. Setting configuration values
one-by-one is useful if you only need to set a few configuration settings.
Setting them using an XML file is useful if you have many changes to make, or if
you want to save your configuration settings for reuse.
Note
You can prefix the Amazon S3 bucket name with a region prefix, such as
s3://
,
where region
.elasticmapreduce/bootstrap-actions/configure-hbaseregion
is the region into which you're
launching your HBase cluster.
This bootstrap action modifies the
/home/hadoop/conf/hbase-site.xml
configuration file on the
HBase cluster. The bootstrap action can only be run when the HBase cluster is
launched.
For more information about the HBase site settings that you can configure, see
Default
configuration
Set the configure-hbase
bootstrap action when you launch the
HBase cluster and specify the values in hbase-site.xml
to
change.
To specify individual HBase site settings using the AWS CLI
-
To change the
hbase.hregion.max.filesize
setting, type the following command and replacemyKey
with the name of your Amazon EC2 key pair.Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.3
\ --applications Name=Hue
Name=Hive
Name=Pig
Name=HBase
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typec1.xlarge
--instance-count3
--termination-protected \ --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase
,Args=["-s","hbase.hregion.max.filesize=52428800"
]
To specify HBase site settings with an XML file using the AWS CLI
-
Create a custom version of
hbase-site.xml
. Your custom file must be valid XML. To reduce the chance of introducing errors, start with the default copy ofhbase-site.xml
, located on the Amazon EMR HBase master node at/home/hadoop/conf/hbase-site.xml
, and edit a copy of that file instead of creating a file from scratch. You can give your new file a new name, or leave it ashbase-site.xml
. -
Upload your custom
hbase-site.xml
file to an Amazon S3 bucket. It should have permissions set so the AWS account that launches the cluster can access the file. If the AWS account launching the cluster also owns the Amazon S3 bucket, it has access. -
Set the configure-hbase bootstrap action when you launch the HBase cluster, and include the location of your custom
hbase-site.xml
file. The following example sets the HBase site configuration values to those specified in the files3://amzn-s3-demo-bucket/my-hbase-site.xml
. Type the following command, replacemyKey
with the name of your EC2 key pair, and replaceamzn-s3-demo-bucket
with the name of your Amazon S3 bucket.Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.3
\ --applications Name=Hue
Name=Hive
Name=Pig
Name=HBase
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typec1.xlarge
--instance-count3
--termination-protected \ --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase
,Args=["--site-config-file","s3://amzn-s3-demo-bucket/config.xml"
]If you specify more than one option to customize HBase operation, you must prepend each key-value pair with a
-s
option switch, as shown in the following example:--bootstrap-actions
s3://elasticmapreduce/bootstrap-actions/configure-hbase
,Args=["-s","zookeeper.session.timeout=60000"
]
With the proxy set and the SSH connection open, you can view the HBase UI by
opening a browser window with
http://master-public-dns-name
:60010/master-status,
where master-public-dns-name
is the public DNS
address of the master node in the HBase cluster.
You can view the current HBase logs by using SSH to connect to the master
node, and navigating to the mnt/var/log/hbase
directory. These logs
are not available after the cluster is terminated unless you enable logging to
Amazon S3 when the cluster is launched.
Back up and restore HBase
Amazon EMR provides the ability to back up your HBase data to Amazon S3, either manually or on an automated schedule. You can perform both full and incremental backups. After you have a backed-up version of HBase data, you can restore that version to an HBase cluster. You can restore to an HBase cluster that is currently running, or launch a new cluster pre-populated with backed-up data.
During the backup process, HBase continues to execute write commands. Although this ensures that your cluster remains available throughout the backup, there is the risk of inconsistency between the data being backed up and any write operations being executed in parallel. To understand the inconsistencies that might arise, you have to consider that HBase distributes write operations across the nodes in its cluster. If a write operation happens after a particular node is polled, that data is not included in the backup archive. You may even find that earlier writes to the HBase cluster (sent to a node that has already been polled) might not be in the backup archive, whereas later writes (sent to a node before it was polled) are included.
If a consistent backup is required, you must pause writes to HBase during the
initial portion of the backup process, synchronization across nodes. You can do this
by specifying the --consistent
parameter when requesting a backup. With
this parameter, writes during this period are queued and executed as soon as the
synchronization completes. You can also schedule recurring backups, which resolves
any inconsistencies over time, as data that is missed on one backup pass is backed
up on the following pass.
When you back up HBase data, you should specify a different backup directory for
each cluster. An easy way to do this is to use the cluster identifier as part of the
path specified for the backup directory. For example,
s3://amzn-s3-demo-bucket/backups/j-3AEXXXXXX16F2
. This ensures that any future
incremental backups reference the correct HBase cluster.
When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. After the full backup is done, you can navigate to the backup location and manually delete the old backup files.
The HBase backup process uses S3DistCp for the copy operation, which has certain limitations regarding temporary file storage space.
Back up and restore HBase using the console
The console provides the ability to launch a new cluster and populate it with data from a previous HBase backup. It also gives you the ability to schedule periodic incremental backups of HBase data. Additional backup and restore functionality, such as the ability to restore data to an already running cluster, do manual backups, and schedule automated full backups, is available using the CLI.
To populate a new cluster with archived HBase data using the console
Navigate to the new Amazon EMR console and select Switch to the old console from the side navigation. For more information on what to expect when you switch to the old console, see Using the old console.
-
Choose Create cluster.
-
In the Software Configuration section, for Additional Applications, choose HBase and Configure and add.
-
On the Add Application dialog box, check Restore From Backup.
-
For Backup Location, specify the location of the backup yto load into the new HBase cluster. This should be an Amazon S3 URL of the form
s3://amzn-s3-demo-bucket/backups/
. -
For Backup Version, you have the option to specify the name of a backup version to load by setting a value. If you do not set a value for Backup Version, Amazon EMR loads the latest backup in the specified location.
-
Choose Add and proceed to create the cluster with other options as desired.
To schedule automated backups of HBase data using the console
-
In the Software Configuration section, for Additional Applications, choose HBase and Configure and add.
-
Choose Schedule Regular Backups.
-
Specify whether the backups should be consistent. A consistent backup is one that pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when the synchronization completes.
-
Set how often backups should occur by entering a number for Backup Frequency and choosing Days, Hours, or Minutes. The first automated backup that runs is a full backup; after that, Amazon EMR saves incremental backups based on the schedule that you specify.
-
Specify the location in Amazon S3 where the backups should be stored. Each HBase cluster should be backed up to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly.
-
Specify when the first backup should occur by setting a value for Backup Start Time. You can set this to
now
, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2013-09-26T20:00Z, sets the start time to September 26, 2013 at 8PM UTC. -
Choose Add.
-
Proceed with creating the cluster with other options as desired.
Monitor HBase with CloudWatch
Amazon EMR reports three metrics to CloudWatch that you can use to monitor your HBase backups. These metrics are pushed to CloudWatch at five-minute intervals, and are provided without charge.
Metric | Description |
---|---|
HBaseBackupFailed |
Whether the last backup failed. This is set to 0 by default and updated to 1 if the previous backup attempt failed. This metric is only reported for HBase clusters. Use case: Monitor HBase backups Units: Count |
HBaseMostRecentBackupDuration |
The amount of time it took the previous backup to complete. This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. This metric is only reported for HBase clusters. Use case: Monitor HBase Backups Units: Minutes |
HBaseTimeSinceLastSuccessfulBackup |
The number of elapsed minutes after the last successful HBase backup started on your cluster. This metric is only reported for HBase clusters. Use case: Monitor HBase backups Units: Minutes |
Configure Ganglia for HBase
You configure Ganglia for HBase using the configure-hbase-for-ganglia bootstrap action. This bootstrap action configures HBase to publish metrics to Ganglia.
You must configure HBase and Ganglia when you launch the cluster; Ganglia reporting cannot be added to a running cluster.
Ganglia also stores log files on the server at
/mnt/var/log/ganglia/rrds
. If you configured your cluster to
persist log files to an Amazon S3 bucket, the Ganglia log files are persisted there as
well.
To launch a cluster with Ganglia for HBase, use the configure-hbase-for-ganglia bootstrap action as shown in the following example.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "
Test cluster
" --ami-version3.3
\ --applications Name=Hue
Name=Hive
Name=Pig
Name=HBase
Name=Ganglia
\ --use-default-roles --ec2-attributes KeyName=myKey
\ --instance-typec1.xlarge
--instance-count3
--termination-protected \ --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase-for-ganglia
After the cluster is launched with Ganglia configured, you can access the Ganglia graphs and reports using the graphical interface running on the master node.