Preparing the test environment
Preparing your AWS account for performance testing
To identify the optimal configuration of your HBase on Amazon S3 cluster, you will need to iterate on several configuration values during a testing stage. Not only will you make changes to HBase configurations but also to the type and family of the cluster's Amazon EC2 instances.
To avoid any impact on existing workloads on the account used for testing or production, we recommend that you raise the limits identified in this section according to your testing or production account needs.
Increasing Amazon EC2 and Amazon EBS Limits
To avoid any delays during performance tests, raise the
following limits in your AWS account since you may need to
deploy several clusters at the same time (replicas, clusters
pointing to different HBase root directories, and so on). If
your cluster size is small, the default values may be
sufficient. For more details about the current limits applied
into your account, refer to
Trusted
Advisor
-
R5/R5d family: increase the limit to 200% of your clusters estimated size for xl, 2xl and 4xl.
-
Total volume storage of General Purpose SSD (gp2) volumes: increase the limit with additional capacity (4x the total dataset size).
-
For example: if dataset is 40 TB, the SSD available (instance store or Amazon EBS volumes) must be at least 40 TB. Account for additional storage because you may need to deploy several clusters at the same time (replicas, clusters pointing to different Apache HBase root directories). Refer to the Sizing local storage section of this document for more details.
Increasing AWS KMS limits
Amazon S3 encryption works with EMRFS objects read from and written to Amazon S3. If you do not have a security requirement for data at rest, then you can skip this section. If your cluster is small, the default values may be sufficient. For additional details about AWS KMS limits, refer to Quotas.
Preparing Amazon S3 for your HBase workload
Amazon S3 can scale to support very high request rates to support your HBase on Amazon S3 cluster. It’s valuable to understand the exact performance characteristics of your HBase workloads when migrating to a new storage layer, especially when moving to an object store such as Amazon S3.
Amazon S3 automatically scales to high request rates and currently
supports up to 3500 PUT/POST/DELETE
requests per second and 5500
GET requests per second per prefix in a bucket. If your request
rate grows steadily, Amazon S3 automatically scales beyond these
rates as needed.
If you expect the request rate per prefix to be higher than the preceding request rate, or if you expect the request rate to rapidly increase instead of gradually increase, the Amazon S3 bucket must be prepared to support the workloads of your HBase on Amazon S3 cluster. For more details on how to prepare the Amazon S3 bucket, refer to the Preparing Amazon S3 for production load section of this document.
This helps minimize throttling on Amazon S3. To understand how you can recognize that Amazon S3 is throttling the requests from your cluster, refer to the Troubleshooting section of this document.
Enabling Amazon S3 metrics for the HBase on Amazon S3 root directory
The Amazon CloudWatch request metrics for Amazon S3 enable the collection of Amazon S3 API metrics for a specific bucket. These metrics provide a good understanding of the TPS driven by your HBase cluster and they can be helpful to identify any operational issues.
Note
Amazon CloudWatch metrics incur a cost. For more information, refer to How Do I Configure Request Metrics for an S3 Bucket? and Monitoring Metrics with Amazon CloudWatch.
Enabling Amazon S3 lifecycle rules to end and clean up incomplete multipart uploads
HBase on Amazon S3 via EMRFS uses Amazon S3 Multipart API. The Multipart upload API enables EMRFS to upload large objects in parts. For more details on the Multipart API, refer to Uploading and copying objects using multipart upload.
Note
After you initiate a multipart upload and upload one or more parts, you must either complete or abort the multipart upload to stop storage charges of the uploaded parts. Only after you either complete or abort a multipart upload will Amazon S3 free up the parts storage and stop charging you for the parts storage.
Amazon S3 provides a lifecycle rule that, when configured, automatically removes incomplete multipart uploads. For complete steps on how to create a Bucket Lifecycle Policy and apply it to the HBase root directory bucket, refer to Uploading and copying objects using multipart upload.
Alternatively, you can use the AWS Management Console and configure the
Lifecycle policy. For more details, refer to
Amazon S3 Lifecycle Management Update – Support for Multipart Uploads
and Delete Markers
Amazon EMR cluster setup
Selecting an Amazon EMR release
We strongly recommended that you use the latest release of Amazon EMR when possible. Refer to Amazon EMR 5.x release versions to find the software versions available at the latest Amazon EMR release. For more details, refer to Migrating from previous HBase versions.
We also recommend that you deploy the cluster with only the required applications. This is especially important in production so you can properly use the full resources of the cluster.
Applying HBase and EMRFS configurations to the cluster
Amazon EMR allows the configuration of applications by supplying a JSON object with any changes to default values. For more information, refer to Configure applications.
Applying HBase configurations
This section includes guidelines on how to construct the JSON
object that can be supplied to the cluster during cluster
deployment. Most of these properties are configured on the
hbase-site.xml
file.
Other settings of HBase, such as Region and Master server heap size and logging settings, have their own configuration file and their own classification when setting up the JSON object.
For an example JSON object to configure the properties written
to hbase-site.xml
, refer to
Configure
HBase. In addition to hbase-site classification, you
may need to use classification hbase-log4j
to change values in
HBase's hbase-log4j.properties
file and classification
hbase-env
to change values in HBase’s environment.
Configuring HBase to expose metrics via JMX
An example JSON object to configure HBase to expose metrics via JMX follows.
[ { "Classification": "hbase-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "HBASE_REGIONSERVER_OPTS": " - Dcom.sun.management.jmxremote.ssl=false - Dcom.sun.management.jmxremote.authenticate=false - Dcom.sun.management.jmxremote.port=10102", "HBASE_MASTER_OPTS": “- Dcom.sun.management.jmxremote.ssl=false - Dcom.sun.management.jmxremote.authenticate=false - Dcom.sun.management.jmxremote.port=10101" }, "Configurations": [ ] } ] } ]
Configuring the log level for HBase
{ "Classification": "hbase-log4j", "Properties": { "log4j.logger.org.apache.hadoop.hbase": "DEBUG" } }
Applying EMRFS configurations
{ "Classification": "emrfs-site", "Properties": { "fs.s3.maxConnections": "10000" } }
Sizing the cluster compute and local storage resources
Sizing compute capacity, selecting an instance type
When sizing your cluster, you can consider having a large
cluster with a smaller instance type or having a small cluster
with a more powerful instance type. We recommend extensive
testing to find the correct instance type for your application
SLA. As a starting point, you can use the latest generation of
memory-optimized instance types (R5/R5d) and the same region
density per RegionServer as in your HDFS backed cluster. R5d
instances share the same specifications as R5 instances, and
also include up to 3.6TB of local NVMe storage. For more
details on these instance types, refer to
Now
Available: R5, R5d, and z1d Instances
Sizing local storage
The disk requirements of the cluster depend on your application SLA and access patterns. As a rule of thumb, read intensive applications benefit from caching data on the BucketCache. For this reason, the disk size should be large enough to cover all caching requirements, HDFS requirements (write path), and OS and Apache Hadoop requirements.
Storage options on Amazon EMR
On Amazon EMR, you have the option to choose an Amazon EBS volume or the instance store. The AMI used by your cluster dictates whether the root device volume uses the instance store or an Amazon EBS volume. Some AMIs use Amazon EC2 instance store, and some use Amazon EBS. When you configure instance types in Amazon EMR, you can add Amazon EBS volumes, which contribute to the total capacity together with instance store (if present) and the default Amazon EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price to support multiple analytic and business needs. For a detailed description of storage options on Amazon EMR, refer to Instance storage.
Selecting and sizing local storage for the BucketCache
Most HBase workloads perform well with General Purpose volumes (GP2) volumes. The volume mix per Amazon EMR Core instances can be either two or more large volumes, or multiple small volumes, in addition to the root volume.
Note that when your instance has multiple volumes, the BucketCache is divided across n-1 volumes. The first volume stores logs and temporary data. Refer to the Tuning the size of the BucketCache section of this document for details on how to choose a starting value for the size of the BucketCache and the stark disk requirements for your Amazon EMR Core/Task nodes.
Applying security configurations to Amazon EMR and EMRFS
You can use Security Configurations to apply the configurations that support at- rest data encryption, in-transit data encryption, and authentication. For more details, refer to Create a security configuration.
Depending on the strategy you choose for authorizing access to HBase, HBase configurations can be applied via the same process included in the Applying HBase and EMRFS configurations to the cluster section of this document.
Due to performance issues reported when Block encryption is using 3DES, Transparent Encryption is preferred over encrypting block data transfer. For more details on Transparent Encryption, refer to Appendix C: Transparent encryption reference.
Troubleshooting
Error message excerpt | Description/Solution |
---|---|
Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown…) |
Amazon S3 is throttling requests from your cluster due to an excessive number of transactions per second to specific object prefixes. Find the request rate and prepare the Amazon S3 bucket for that request rate. Use the metrics for the Amazon S3 bucket location for the HBase root directory to review the number of requests for the previous hour (request rate). See the Preparing Amazon S3 for your HBase workload and Preparing Amazon S3 for production load sections for details on how to prepare the Amazon S3 bucket location for the HBase root directory for your request rate. |
Unable to execute HTTP request: Timeout waiting for connection from pool |
Increase the value of the See the Setting the total number of connections used by EMRFS to read/write data from/to Amazon S3 section for more details on how to tune this property. |