(Optional) Test EFA
You can demonstrate EFA-enabled communication between two nodes in a compute node group by
running the fi_pingpong
program, which is included in the EFA software installation.
If this test is successful, it is likely that EFA is configured properly.
To start, you need two running instances in the compute node group. If your compute node
group uses static capacity, there should be already be instances available. For a compute node
group that uses dynamic capacity, you can launch two nodes using the salloc
command.
Here is an example from a cluster with a dynamic node group named hpc7g
associated
with a queue named all
.
% salloc --nodes 2 -p all salloc: Granted job allocation 6 salloc: Waiting for resource configuration ... a few minutes pass ... salloc: Nodes hpc7g-[1-2] are ready for job
Find out the IP address for the two allocated nodes using scontrol
. In the
example that follows, the addresses are 10.3.140.69
for hpc7g-1
and
10.3.132.211
for hpc7g-2
.
% scontrol show nodes hpc7g-[1-2] NodeName=hpc7g-1 Arch=aarch64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00 AvailableFeatures=hpc7g ActiveFeatures=hpc7g Gres=(null) NodeAddr=10.3.140.69 NodeHostName=ip-10-3-140-69 Version=23.11.8 OS=Linux 5.10.218-208.862.amzn2.aarch64 #1 SMP Tue Jun 4 16:52:10 UTC 2024 RealMemory=124518 AllocMem=0 FreeMem=110763 Sockets=64 Boards=1 State=IDLE+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=efa BootTime=2024-07-02T19:00:09 SlurmdStartTime=2024-07-08T19:33:25 LastBusyTime=2024-07-08T19:33:25 ResumeAfterTime=None CfgTRES=cpu=64,mem=124518M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a Reason=Maintain Minimum Number Of Instances [root@2024-07-02T18:59:00] InstanceId=i-04927897a9ce3c143 InstanceType=hpc7g.16xlarge NodeName=hpc7g-2 Arch=aarch64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00 AvailableFeatures=hpc7g ActiveFeatures=hpc7g Gres=(null) NodeAddr=10.3.132.211 NodeHostName=ip-10-3-132-211 Version=23.11.8 OS=Linux 5.10.218-208.862.amzn2.aarch64 #1 SMP Tue Jun 4 16:52:10 UTC 2024 RealMemory=124518 AllocMem=0 FreeMem=110759 Sockets=64 Boards=1 State=IDLE+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=efa BootTime=2024-07-02T19:00:09 SlurmdStartTime=2024-07-08T19:33:25 LastBusyTime=2024-07-08T19:33:25 ResumeAfterTime=None CfgTRES=cpu=64,mem=124518M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a Reason=Maintain Minimum Number Of Instances [root@2024-07-02T18:59:00] InstanceId=i-0a2c82623cb1393a7 InstanceType=hpc7g.16xlarge
Connect to one of the nodes (in this example case, hpc7g-1
) using SSH (or SSM).
Note that this is an internal IP address, so you may need to connect from one of your login nodes
if you use SSH. Also be aware that the instance needs to be configured with an SSH key by way of
the compute node group launch template.
% ssh ec2-user@10.3.140.69
Now, launch fi_pingpong
in server mode.
/opt/amazon/efa/bin/fi_pingpong -p efa
Connect to the second instance (hpc7g-2
).
% ssh ec2-user@10.3.132.211
Run fi_pingpong
in client mode, connecting to the server on
hpc7g-1
. You should see output that resembles the example below.
% /opt/amazon/efa/bin/fi_pingpong -p efa 10.3.140.69 bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec 64 10 =10 1.2k 0.00s 3.08 20.75 0.05 256 10 =10 5k 0.00s 21.24 12.05 0.08 1k 10 =10 20k 0.00s 82.91 12.35 0.08 4k 10 =10 80k 0.00s 311.48 13.15 0.08 [error] util/pingpong.c:1876: fi_close (-22) fid 0