Simulating a cluster network failure - SAP HANA on AWS

Simulating a cluster network failure

Description — Simulate a network failure to test the cluster behavior in case of a split brain.

Run node — Can be run on any node. In this test case, this is done on node B.

Run steps:

  • Drop all the traffic coming from and going to the subnet of the secondary node using the following command. This ensures that traffic is stopped on both the primary and secondary ring.

    iptables -A INPUT -s <<Subnet_CIDR>> -j DROP; iptables -A OUTPUT -d <<Subnet_CIDR>> -j DROP sechana:~ # crm status Stack: corosync Current DC: prihana (version 1.1.18+20180430.b12c320f5-3.24.1-b12c320f5) - partition with quorum Last updated: Fri Jan 22 02:16:28 2021 Last change: Fri Jan 22 02:16:27 2021 by root via crm_attribute on sechana 2 nodes configured 6 resources configured Online: [ prihana sechana ] Full list of resources: res_AWS_STONITH (stonith:external/ec2): Started prihana res_AWS_IP (ocf::suse:aws-vpc-move-ip): Started sechana Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00] Started: [ prihana sechana ] Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] Masters: [ prihana ] Slaves: [ sechana ] sechana:~ # iptables -A INPUT -s 11.0.1.132 -j DROP; iptables -A OUTPUT -d 11.0.1.132 -j DROP

Expected result:

  • The cluster detects network failure and fence node 1. It promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.

    sechana:~ # crm status Stack: corosync Current DC: prihana (version 1.1.18+20180430.b12c320f5-3.24.1-b12c320f5) - partition with quorum Last updated: Fri Jan 22 17:08:09 2021 Last change: Fri Jan 22 17:07:46 2021 by root via crm_attribute on sechana 2 nodes configured 6 resources configured Online: [ prihana sechana ] Full list of resources: res_AWS_STONITH (stonith:external/ec2): Started prihana res_AWS_IP (ocf::suse:aws-vpc-move-ip): Started sechana Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00] rsc_SAPHanaTopology_HDB_HDB00 (ocf::suse:SAPHanaTopology): Started prihana (Monitoring) Started: [ sechana ] Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00] Masters: [ sechana ] Stopped: [ prihana ] Failed Actions: * rsc_SAPHanaTopology_HDB_HDB00_monitor_10000 on prihana 'unknown error' (1): call=317, status=Timed Out, exitreason='', last-rc-change='Fri Jan 22 16:58:19 2021', queued=0ms, exec=300001ms * rsc_SAPHana_HDB_HDB00_start_0 on prihana 'unknown error' (1): call=28, status=Timed Out, exitreason='', last-rc-change='Fri Jan 22 02:40:38 2021', queued=0ms, exec=3600001ms

Recovery procedure:

  • Clean up the cluster “failed actions”.