Apache HBase - Amazon EMR

Apache HBase

HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation's Hadoop project. HBase runs on top of Hadoop Distributed File System (HDFS) to provide non-relational database capabilities for the Hadoop ecosystem. HBase is included with Amazon EMR release version 4.6.0 and later.

HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to the MapReduce framework and execution engine. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). For more information about HBase, see Apache HBase and HBase documentation on the Apache website. For an example of how to use HBase with Hive, see the AWS Big Data Blog post Combine NoSQL and massively parallel analytics using Apache HBase and Apache Hive on Amazon EMR.

With HBase on Amazon EMR, you can also back up your HBase data directly to Amazon Simple Storage Service (Amazon S3), and restore from a previously created backup when launching an HBase cluster. Amazon EMR offers additional options to integrate with Amazon S3 for data persistence and disaster recovery.

  • HBase on Amazon S3 - With Amazon EMR version 5.2.0 and later, you can use HBase on Amazon S3 to store a cluster's HBase root directory and metadata directly to Amazon S3. You can subsequently start a new cluster, pointing it to the root directory location in Amazon S3. Only one cluster at a time can use the HBase location in Amazon S3, with the exception of a read-replica cluster. For more information, see HBase on Amazon S3 (Amazon S3 storage mode).

  • HBase read-replicas - Amazon EMR version 5.7.0 and later with HBase on Amazon S3 supports read-replica clusters. A read-replica cluster provides read-only access to a primary cluster's store files and metadata for read-only operations. For more information, see Using a read-replica cluster.

  • HBase Snapshots - As an alternative to HBase on Amazon S3, with EMR version 4.0 and later you can create snapshots of your HBase data directly to Amazon S3 and then recover data using the snapshots. For more information, see Using HBase snapshots.

Important

For Amazon EMR HBase cluster scaling, we do not recommend using managed scaling or scaling with custom policies with HBase clusters.

The following table lists the version of HBase included in the latest release of the Amazon EMR 7.x series, along with the components that Amazon EMR installs with HBase.

For the version of components installed with HBase in this release, see Release 7.4.0 Component Versions.

HBase version information for emr-7.4.0
Amazon EMR Release Label HBase Version Components Installed With HBase

emr-7.4.0

HBase 2.5.5

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, emr-wal-cli, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-mapred, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hbase-hmaster, hbase-client, hbase-region-server, hbase-rest-server, hbase-thrift-server, hbase-operator-tools, zookeeper-client, zookeeper-server

The following table lists the version of HBase included in the latest release of the Amazon EMR 6.x series, along with the components that Amazon EMR installs with HBase.

For the version of components installed with HBase in this release, see Release 6.15.0 Component Versions.

HBase version information for emr-6.15.0
Amazon EMR Release Label HBase Version Components Installed With HBase

emr-6.15.0

HBase 2.4.17

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, emr-wal-cli, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-mapred, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hbase-hmaster, hbase-client, hbase-region-server, hbase-rest-server, hbase-thrift-server, hbase-operator-tools, zookeeper-client, zookeeper-server

Note

Apache HBase HBCK2 is a separate operational tool for repairing HBase regions and system tables. In Amazon EMR version 6.1.0 and later, the hbase-hbck2.jar is provided in /usr/lib/hbase-operator-tools/ on the primary node. For more information about how to build and use the tool, see HBase HBCK2.

The following table lists the version of HBase included in the latest release of the Amazon EMR 5.x series, along with the components that Amazon EMR installs with HBase.

For the version of components installed with HBase in this release, see Release 5.36.2 Component Versions.

HBase version information for emr-5.36.2
Amazon EMR Release Label HBase Version Components Installed With HBase

emr-5.36.2

HBase 1.4.13

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-mapred, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hbase-hmaster, hbase-client, hbase-region-server, hbase-rest-server, hbase-thrift-server, zookeeper-client, zookeeper-server