Managing the production environment
Operationalization tasks
Node decommissioning
When a node is gracefully decommissioned by the YARN Resource Manager (during a user initiated shrink operation or node failures such as bad disk), the regions are first closed and then shut down by the RegionServer. You can also gracefully decommission a RegionServer on any active node by stopping the daemon manually. This step may be required while troubleshooting a particular RegionServer in the cluster.
sudo stop hbase-regionserver
During shutdown, the RegionServer’s Znode expires. The HMaster notices this event and considers that RegionServer as a crashed server. The HMaster then reassigns the regions the RegionServer used to serve to other online RegionServers. Depending on the prefetch settings, the RegionServer warms the cache on the new RegionServer that is now assigned to serve the region.
Rolling restart
A rolling restart restarts HMaster process on the master node and HRegionServer process on all the core nodes.
Check for any inconsistencies and make sure that the HBase balancer is turned off so that the load balancer does not interfere with region deployments.
Use the shell to disable HBase balancer:
hbase(main):001:0> balance_switch false true 0 row(s) in 0.2970 seconds
The following is a sample script that performs a rolling restart on an Apache HBase cluster. This script should be executed on the Amazon EMR Master node that has the Amazon EC2 Key Pair (.pem extension) file to log in to the Amazon EMR Core nodes.
#!/bin/bash sudo stop hbase-master; sudo start hbase-master for node in $(yarn node -list | grep -i ip- | cut -f2 -d: | cut -f2 -d'G' | xargs) ; do ssh -i ~/hadoop.pem -t -o "StrictHostKeyChecking no" hadoop@$node "sudo stop hbase-regionserver;sudo start hbase- regionserver" done sudo stop hbase-master; sudo start hbase-master #Restart HMaster again to clear out dead servers list and reenable the balancer hbase hbck #Run hbck utility to make sure HBase is consistent
Cluster resize
Nodes can be added or removed from the HBase clusters on Amazon S3 by performing a resize operation on the cluster. If an AutoScaling policy was set based on a specific CloudWatch metric (such as IsIdle), the resize operation happens based on that policy. All these operations are performed gracefully.
Backup and restore
With HBase on Amazon S3 you can still consider taking snapshots of your tables every few hours (and deleting them after some days) so you have a point in time recovery option available to you. Refer to the Running the balancer for specific periods to minimize the impact of region movements on snapshots section of this document.
Cluster ending without data loss
If you want to end the current cluster and build a new one on
the same Amazon S3 root directory, we recommend that you disable
all of the tables in the current cluster. This ensures that all
of the data that have not been written to Amazon S3 yet are
flushed from MemStore cache to Amazon S3 in the form of new
store files. To do so, the script below uses an existing script
(/usr/lib/hbase/bin/disable_all_tables.sh
) to disable the
tables.
#!/bin/bash clusterID=$(cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId") #call disable_all_tables.sh bash /usr/lib/hbase/bin/disable_all_tables.sh #Store the output of "list" command in a temp file echo "list" | hbase shell > tableListSummary.txt #fetch only the list of tables and store it in an another temp file tail -1 tableListSummary.txt | tr ',' '\n' | tr -d '"' | tr -d [ | tr -d ] | tr -d ' ' > tableList.txt #prepare for iteration while true; do while read line; do flag="N" echo "is_enabled '$line'" | hbase shell > bool.txt bool=$(tail -3 bool.txt | head -1) if [ "$bool" = "true" ]; then flag="Y" break fi done < tableList.txt echo "flag: "$flag if [ "$flag" = "N" ]; then aws emr terminate-clusters --cluster-ids $clusterID break else echo "Tables aren't disabled yet. Sleeping for 5 seconds to try again" fi sleep 5 done #cleanup temporary files rm tableListSummary.txt tableList.txt bool.txt
The preceding script can be place on a file and named
disable_and_terminate.sh
. Note that the script does not exist on
the instance. You can add an Amazon EMR step to first copy the
script to the instance and then run the step to disable and end
the cluster. To run the script, you can use the following Amazon EMR Step properties.
Name="Disable all tables",Jar="command- runner.jar",Args=["/bin/bash","/home/hadoop/disable_and_terminat e.sh"]
OS and Apache HBase patching
Similar to AMI upgrades on Amazon EC2, the Amazon EMR service team plans for application upgrades with every new Amazon EMR version release. This removes any OS and Apache HBase patching activities from your team. The latest version of Amazon EMR (5.17.0 as of this paper) runs Apache HBase version 1.4.6. Details of each Amazon EMR version release can be found on Amazon EMR 5.x release versions.