Simulating a cluster network failure
Description —To simulate a network failure to test the cluster behavior in case of a split brain.
Run node: Can be run on any node. In this test case, this is done on node B.
Run steps:
-
Drop all the traffic coming from and going to node A with the following command:
iptables -A INPUT -s <<Primary IP address of Node A>> -j DROP; iptables -A OUTPUT -d <<Primary IP address of Node A>> -j DROP
[root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: prihana(version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 14:45:24 2021 Last change: Fri Jan 22 14:45:11 2021 by hacluster via crmd on sechana 2 nodes configured 6 resources configured Online: [ prihana sechana ] Full list of resources: clusterfence (stonith:fence_aws): Started prihana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ prihana sechana ] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ prihana] Slaves: [ sechana ] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started prihana Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@ sechana ~]#sechana:~ # iptables -A INPUT -s xxx.xxx.xxx.xxx -j DROP; iptables -A OUTPUT -d xxx.xxx.xxx.xxx -j DROP
Expected result:
-
The cluster detects network failure and fences node 1. The cluster promotes the secondary SAP HANA database (on node 2) to take over as primary without going to a split brain situation.
[root@sechana ~]# pcs status Cluster name: rhelhanaha Stack: corosync Current DC: sechana (version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum Last updated: Fri Jan 22 15:11:43 2021 Last change: Fri Jan 22 15:10:48 2021 by root via crm_attribute on sechana 2 nodes configured 6 resources configured Online: [ sechana ] OFFLINE: [ prihana] Full list of resources: clusterfence (stonith:fence_aws): Started sechana Clone Set: SAPHanaTopology_DRL_00-clone [SAPHanaTopology_DRL_00] Started: [ sechana ] Stopped: [ prihana] Master/Slave Set: SAPHana_DRL_00-master [SAPHana_DRL_00] Masters: [ sechana ] Stopped: [ prihana] hana-oip (ocf::heartbeat:aws-vpc-move-ip): Started sechana Failed Actions: * clusterfence_monitor_60000 on sechana 'unknown error' (1): call=-1, status=Timed Out, exitreason='', last-rc-change='Fri Jan 22 14:59:14 2021', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@sechana ~]#
Recovery procedure:
-
Clean up the cluster “
failed actions
”.