Managing the production environment

Operationalization tasks

Node decommissioning

When a node is gracefully decommissioned by the YARN Resource Manager (during a user initiated shrink operation or node failures such as bad disk), the regions are first closed and then shut down by the RegionServer. You can also gracefully decommission a RegionServer on any active node by stopping the daemon manually. This step may be required while troubleshooting a particular RegionServer in the cluster.


sudo stop hbase-regionserver

During shutdown, the RegionServer’s Znode expires. The HMaster notices this event and considers that RegionServer as a crashed server. The HMaster then reassigns the regions the RegionServer used to serve to other online RegionServers. Depending on the prefetch settings, the RegionServer warms the cache on the new RegionServer that is now assigned to serve the region.

Rolling restart

A rolling restart restarts HMaster process on the master node and HRegionServer process on all the core nodes.

Check for any inconsistencies and make sure that the HBase balancer is turned off so that the load balancer does not interfere with region deployments.

Use the shell to disable HBase balancer:


hbase(main):001:0> balance_switch false
true
0 row(s) in 0.2970 seconds

The following is a sample script that performs a rolling restart on an Apache HBase cluster. This script should be executed on the Amazon EMR Master node that has the Amazon EC2 Key Pair (.pem extension) file to log in to the Amazon EMR Core nodes.



#!/bin/bash
sudo stop hbase-master; sudo start hbase-master
for node in $(yarn node -list | grep -i ip- | cut -f2 -d: | cut
-f2 -d'G' | xargs) ; do
        ssh -i ~/hadoop.pem -t -o "StrictHostKeyChecking no"
hadoop@$node "sudo stop hbase-regionserver;sudo start hbase-
regionserver"
done
sudo stop hbase-master; sudo start hbase-master #Restart HMaster
again to clear out dead servers list and reenable the balancer
hbase hbck #Run hbck utility to make sure HBase is consistent

Cluster resize

Nodes can be added or removed from the HBase clusters on Amazon S3 by performing a resize operation on the cluster. If an AutoScaling policy was set based on a specific CloudWatch metric (such as IsIdle), the resize operation happens based on that policy. All these operations are performed gracefully.

Backup and restore

With HBase on Amazon S3 you can still consider taking snapshots of your tables every few hours (and deleting them after some days) so you have a point in time recovery option available to you. Refer to the Running the balancer for specific periods to minimize the impact of region movements on snapshots section of this document.

Cluster ending without data loss

If you want to end the current cluster and build a new one on the same Amazon S3 root directory, we recommend that you disable all of the tables in the current cluster. This ensures that all of the data that have not been written to Amazon S3 yet are flushed from MemStore cache to Amazon S3 in the form of new store files. To do so, the script below uses an existing script (/usr/lib/hbase/bin/disable_all_tables.sh) to disable the tables.


#!/bin/bash
clusterID=$(cat /mnt/var/lib/info/job-flow.json | jq -r
".jobFlowId")
#call disable_all_tables.sh
bash /usr/lib/hbase/bin/disable_all_tables.sh
#Store the output of "list" command in a temp file
echo "list" | hbase shell > tableListSummary.txt
#fetch only the list of tables and store it in an another temp
file
tail -1 tableListSummary.txt | tr ',' '\n' | tr -d '"' | tr -d [
| tr -d ] | tr -d ' ' > tableList.txt

#prepare for iteration
while true; do
  while read line; do
    flag="N"
    echo "is_enabled '$line'" | hbase shell > bool.txt
    bool=$(tail -3 bool.txt | head -1)
    if [ "$bool" = "true" ]; then
      flag="Y"
      break
    fi
  done < tableList.txt
echo "flag: "$flag
if [ "$flag" = "N" ]; then
      aws emr terminate-clusters --cluster-ids $clusterID
      break
else
      echo "Tables aren't disabled yet. Sleeping for 5 seconds
to try again"
fi
sleep 5
done

#cleanup temporary files
rm tableListSummary.txt tableList.txt bool.txt

The preceding script can be place on a file and named disable_and_terminate.sh. Note that the script does not exist on the instance. You can add an Amazon EMR step to first copy the script to the instance and then run the step to disable and end the cluster. To run the script, you can use the following Amazon EMR Step properties.


Name="Disable all tables",Jar="command-
runner.jar",Args=["/bin/bash","/home/hadoop/disable_and_terminat
e.sh"]

OS and Apache HBase patching

Similar to AMI upgrades on Amazon EC2, the Amazon EMR service team plans for application upgrades with every new Amazon EMR version release. This removes any OS and Apache HBase patching activities from your team. The latest version of Amazon EMR (5.17.0 as of this paper) runs Apache HBase version 1.4.6. Details of each Amazon EMR version release can be found on Amazon EMR 5.x release versions.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Deploying into production

Conclusion