Simulating a cluster network failure - SAP HANA on AWS

Simulating a cluster network failure

Description —To simulate a network failure to test the cluster behavior in case of a split brain.

Run node: Can be run on any node. In this test case, this is done on node B.

Run steps:

  • Drop all the traffic coming from and going to node A with the following command:

    iptables -A INPUT -s <<Primary IP address of Node A>> -j DROP; iptables -A OUTPUT -d <<Primary IP address of Node A>> -j DROP
    [root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: prihana(version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 14:45:24 2021 Last change: Fri Jan 22 14:45:11 2021 by hacluster via crmd on sechana 2 nodes configured 6 resources configured Online: [ prihana sechana ] Full list of resources: clusterfence (stonith:fence_aws): Started prihana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ prihana sechana ] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ prihana] Slaves: [ sechana ] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started prihana Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@ sechana ~]#sechana:~ # iptables -A INPUT -s xxx.xxx.xxx.xxx -j DROP; iptables -A OUTPUT -d xxx.xxx.xxx.xxx -j DROP

Expected result:

  • The cluster detects network failure and fences node 1. The cluster promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.

    [root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: sechana (version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 15:11:43 2021 Last change: Fri Jan 22 15:10:48 2021 by root via crm_attribute on sechana 2 nodes configured 6 resources configured Online: [ sechana ] OFFLINE: [ prihana] Full list of resources: clusterfence (stonith:fence_aws): Started sechana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ sechana ] Stopped: [ prihana] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ sechana ] Stopped: [ prihana] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started sechana Failed Actions: * clusterfence_monitor_60000 on sechana 'unknown error' (1): call=-1, status=Timed Out, exitreason='', last-rc-change='Fri Jan 22 14:59:14 2021', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@sechana ~]#

Recovery procedure:

  • Clean up the cluster “failed actions”.