Preparing your AWS account for performance testing Preparing Amazon S3 for your HBase workload Amazon EMR cluster setup Troubleshooting

Preparing the test environment

Preparing your AWS account for performance testing

To identify the optimal configuration of your HBase on Amazon S3 cluster, you will need to iterate on several configuration values during a testing stage. Not only will you make changes to HBase configurations but also to the type and family of the cluster's Amazon EC2 instances.

To avoid any impact on existing workloads on the account used for testing or production, we recommend that you raise the limits identified in this section according to your testing or production account needs.

Increasing Amazon EC2 and Amazon EBS Limits

To avoid any delays during performance tests, raise the following limits in your AWS account since you may need to deploy several clusters at the same time (replicas, clusters pointing to different HBase root directories, and so on). If your cluster size is small, the default values may be sufficient. For more details about the current limits applied into your account, refer to Trusted Advisor (login required). If your cluster is expected to have more than 100 instances, open an AWS Support Case (login required) to have the following Amazon EC2 and Amazon EBS limits increased:

R5/R5d family: increase the limit to 200% of your clusters estimated size for xl, 2xl and 4xl.
Total volume storage of General Purpose SSD (gp2) volumes: increase the limit with additional capacity (4x the total dataset size).
For example: if dataset is 40 TB, the SSD available (instance store or Amazon EBS volumes) must be at least 40 TB. Account for additional storage because you may need to deploy several clusters at the same time (replicas, clusters pointing to different Apache HBase root directories). Refer to the Sizing local storage section of this document for more details.

Increasing AWS KMS limits

Amazon S3 encryption works with EMRFS objects read from and written to Amazon S3. If you do not have a security requirement for data at rest, then you can skip this section. If your cluster is small, the default values may be sufficient. For additional details about AWS KMS limits, refer to Quotas.

Preparing Amazon S3 for your HBase workload

Amazon S3 can scale to support very high request rates to support your HBase on Amazon S3 cluster. It’s valuable to understand the exact performance characteristics of your HBase workloads when migrating to a new storage layer, especially when moving to an object store such as Amazon S3.

Amazon S3 automatically scales to high request rates and currently supports up to 3500 PUT/POST/DELETE requests per second and 5500 GET requests per second per prefix in a bucket. If your request rate grows steadily, Amazon S3 automatically scales beyond these rates as needed.

If you expect the request rate per prefix to be higher than the preceding request rate, or if you expect the request rate to rapidly increase instead of gradually increase, the Amazon S3 bucket must be prepared to support the workloads of your HBase on Amazon S3 cluster. For more details on how to prepare the Amazon S3 bucket, refer to the Preparing Amazon S3 for production load section of this document.

This helps minimize throttling on Amazon S3. To understand how you can recognize that Amazon S3 is throttling the requests from your cluster, refer to the Troubleshooting section of this document.

Enabling Amazon S3 metrics for the HBase on Amazon S3 root directory

The Amazon CloudWatch request metrics for Amazon S3 enable the collection of Amazon S3 API metrics for a specific bucket. These metrics provide a good understanding of the TPS driven by your HBase cluster and they can be helpful to identify any operational issues.

Note

Amazon CloudWatch metrics incur a cost. For more information, refer to How Do I Configure Request Metrics for an S3 Bucket? and Monitoring Metrics with Amazon CloudWatch.

Enabling Amazon S3 lifecycle rules to end and clean up incomplete multipart uploads

HBase on Amazon S3 via EMRFS uses Amazon S3 Multipart API. The Multipart upload API enables EMRFS to upload large objects in parts. For more details on the Multipart API, refer to Uploading and copying objects using multipart upload.

Note

After you initiate a multipart upload and upload one or more parts, you must either complete or abort the multipart upload to stop storage charges of the uploaded parts. Only after you either complete or abort a multipart upload will Amazon S3 free up the parts storage and stop charging you for the parts storage.

Amazon S3 provides a lifecycle rule that, when configured, automatically removes incomplete multipart uploads. For complete steps on how to create a Bucket Lifecycle Policy and apply it to the HBase root directory bucket, refer to Uploading and copying objects using multipart upload.

Alternatively, you can use the AWS Management Console and configure the Lifecycle policy. For more details, refer to Amazon S3 Lifecycle Management Update – Support for Multipart Uploads and Delete Markers. We recommend that you configure the lifecycle policy to end and clean up incomplete multipart uploads after three days.

Amazon EMR cluster setup

Selecting an Amazon EMR release

We strongly recommended that you use the latest release of Amazon EMR when possible. Refer to Amazon EMR 5.x release versions to find the software versions available at the latest Amazon EMR release. For more details, refer to Migrating from previous HBase versions.

We also recommend that you deploy the cluster with only the required applications. This is especially important in production so you can properly use the full resources of the cluster.

Applying HBase and EMRFS configurations to the cluster

Amazon EMR allows the configuration of applications by supplying a JSON object with any changes to default values. For more information, refer to Configure applications.

Applying HBase configurations

This section includes guidelines on how to construct the JSON object that can be supplied to the cluster during cluster deployment. Most of these properties are configured on the hbase-site.xml file.

Other settings of HBase, such as Region and Master server heap size and logging settings, have their own configuration file and their own classification when setting up the JSON object.

For an example JSON object to configure the properties written to hbase-site.xml, refer to Configure HBase. In addition to hbase-site classification, you may need to use classification hbase-log4j to change values in HBase's hbase-log4j.properties file and classification hbase-env to change values in HBase’s environment.

Configuring HBase to expose metrics via JMX

An example JSON object to configure HBase to expose metrics via JMX follows.


[
  {
    "Classification": "hbase-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HBASE_REGIONSERVER_OPTS": " -
Dcom.sun.management.jmxremote.ssl=false -
Dcom.sun.management.jmxremote.authenticate=false -
Dcom.sun.management.jmxremote.port=10102",
          "HBASE_MASTER_OPTS": “-
Dcom.sun.management.jmxremote.ssl=false -
Dcom.sun.management.jmxremote.authenticate=false -
Dcom.sun.management.jmxremote.port=10101"
        },
        "Configurations": [
        ]
      }
    ]
  }
]

Configuring the log level for HBase


  {
    "Classification": "hbase-log4j",
    "Properties": {
       "log4j.logger.org.apache.hadoop.hbase": "DEBUG"
    }
  }

Applying EMRFS configurations


  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.maxConnections": "10000"
    }
  }

Sizing the cluster compute and local storage resources

Sizing compute capacity, selecting an instance type

When sizing your cluster, you can consider having a large cluster with a smaller instance type or having a small cluster with a more powerful instance type. We recommend extensive testing to find the correct instance type for your application SLA. As a starting point, you can use the latest generation of memory-optimized instance types (R5/R5d) and the same region density per RegionServer as in your HDFS backed cluster. R5d instances share the same specifications as R5 instances, and also include up to 3.6TB of local NVMe storage. For more details on these instance types, refer to Now Available: R5, R5d, and z1d Instances. As you progress to tune the read and write path, first establish a configuration that supports the SLA of your application. Then, increase the region density by reducing the number of nodes in the cluster.

Sizing local storage

The disk requirements of the cluster depend on your application SLA and access patterns. As a rule of thumb, read intensive applications benefit from caching data on the BucketCache. For this reason, the disk size should be large enough to cover all caching requirements, HDFS requirements (write path), and OS and Apache Hadoop requirements.

Storage options on Amazon EMR

On Amazon EMR, you have the option to choose an Amazon EBS volume or the instance store. The AMI used by your cluster dictates whether the root device volume uses the instance store or an Amazon EBS volume. Some AMIs use Amazon EC2 instance store, and some use Amazon EBS. When you configure instance types in Amazon EMR, you can add Amazon EBS volumes, which contribute to the total capacity together with instance store (if present) and the default Amazon EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price to support multiple analytic and business needs. For a detailed description of storage options on Amazon EMR, refer to Instance storage.

Selecting and sizing local storage for the BucketCache

Most HBase workloads perform well with General Purpose volumes (GP2) volumes. The volume mix per Amazon EMR Core instances can be either two or more large volumes, or multiple small volumes, in addition to the root volume.

Note that when your instance has multiple volumes, the BucketCache is divided across n-1 volumes. The first volume stores logs and temporary data. Refer to the Tuning the size of the BucketCache section of this document for details on how to choose a starting value for the size of the BucketCache and the stark disk requirements for your Amazon EMR Core/Task nodes.

Applying security configurations to Amazon EMR and EMRFS

You can use Security Configurations to apply the configurations that support at- rest data encryption, in-transit data encryption, and authentication. For more details, refer to Create a security configuration.

Depending on the strategy you choose for authorizing access to HBase, HBase configurations can be applied via the same process included in the Applying HBase and EMRFS configurations to the cluster section of this document.

Due to performance issues reported when Block encryption is using 3DES, Transparent Encryption is preferred over encrypting block data transfer. For more details on Transparent Encryption, refer to Appendix C: Transparent encryption reference.

Troubleshooting

Error message excerpt Description/Solution

Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown…)

Error message excerpt	Description/Solution
Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown…)	Amazon S3 is throttling requests from your cluster due to an excessive number of transactions per second to specific object prefixes. Find the request rate and prepare the Amazon S3 bucket for that request rate. Use the metrics for the Amazon S3 bucket location for the HBase root directory to review the number of requests for the previous hour (request rate). See the Preparing Amazon S3 for your HBase workload and Preparing Amazon S3 for production load sections for details on how to prepare the Amazon S3 bucket location for the HBase root directory for your request rate.
Unable to execute HTTP request: Timeout waiting for connection from pool	Increase the value of the `fs.s3.maxConnections` property. See the Setting the total number of connections used by EMRFS to read/write data from/to Amazon S3 section for more details on how to tune this property.

Amazon S3 is throttling requests from your cluster due to an excessive number of transactions per second to specific object prefixes.

Find the request rate and prepare the Amazon S3 bucket for that request rate. Use the metrics for the Amazon S3 bucket location for the HBase root directory to review the number of requests for the previous hour (request rate). See the Preparing Amazon S3 for your HBase workload and Preparing Amazon S3 for production load sections for details on how to prepare the Amazon S3 bucket location for the HBase root directory for your request rate.

Unable to execute HTTP request: Timeout waiting for connection from pool

Increase the value of the fs.s3.maxConnections property.

See the Setting the total number of connections used by EMRFS to read/write data from/to Amazon S3 section for more details on how to tune this property.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Testing Apache HBase and EMRFS configuration values

Migrating and restoring Apache HBase tables on Apache HBase on Amazon S3