Trying to run a job
The following section provides possible troubleshooting solutions if you run in to issues while trying to run a job.
srun
interactive job fails with error srun: error: fwd_tree_thread: can't find address for <host>, check slurm.conf
-
Why did it fail?
You ran the
srun
command to submit a job, and then you increased the size of a queue by using thepcluster update-cluster
command without restarting the Slurm daemons after the update completed.Slurm organizes Slurm daemons in a tree hierarchy to optimize communication. This hierarchy is only updated when the daemons start.
Suppose you use
srun
to launch a job and then run thepcluster update-cluster
command to increase the size of the queue. New compute nodes launch as part of the update. Then, Slurm queues your job to one of the new compute nodes. In this case, both the Slurm daemons andsrun
don't detect the new compute nodes.srun
returns an error because it doesn't detect the new nodes. -
How to resolve?
Restart the Slurm daemons on all of the compute nodes, and then use
srun
to submit your job. You can schedule the Slurm daemons restart by running thescontrol reboot
command that restarts the compute nodes. For more information, see scontrol rebootin the Slurm documentation. You can also manually restart the Slurm daemons on the compute nodes by requesting a restart of the corresponding systemd
services.
Job is stuck in CF
state with squeue
command
This might be an issue with dynamic nodes powering up. For more information, see Seeing errors in compute node initializations.
Running large scale jobs and seeing nfsd: too many open connections, consider increasing
the number of threads in /var/log/messages
With a networked file system, when network limits are reached, I/O wait time also increases. This can result in soft lockups because the network is used to write data for both networking and I/O metrics.
With 5th generation instances, we use the ENA driver to expose packet counters. These counters count the packets shaped by AWS when the
network reaches instance bandwidth limits. You can check these counters to see if they are greater than 0. If they are, then you have exceeded
your bandwidth limits. You can view these counters by running ethtool -S eth0 | grep exceeded
.
Exceeding network limits is often a result of supporting too many NFS connections. This is one of the first things to check when you reach or exceed network limits.
For example, the following output shows dropped packages:
$
ethtool -S eth0 | grep exceeded
bw_in_allowance_exceeded: 38750610 bw_out_allowance_exceeded: 1165693 pps_allowance_exceeded: 103 conntrack_allowance_exceeded: 0 linklocal_allowance_exceeded: 0
To avoid getting this message, consider changing the head node instance type to a more performant instance type. Consider moving your data
storage to shared storage file systems that aren't exported as an NFS share, such a Amazon EFS or Amazon FSx. For more information, see Shared storage and the Best Practices
Running an MPI job
Enabling debug mode
To enable OpenMPI debug mode, see What controls does Open MPI have that aid in debugging
To enable IntelMPI debug mode, see Other Environment Variables
Seeing MPI_ERRORS_ARE_FATAL
and OPAL ERROR
in the job output
These error codes come from the MPI layer in your application. To learn how to get MPI debug logs from your application, see Enabling debug mode.
A possible cause for this error is that your application has been compiled for a specific MPI implementation, such as OpenMPI, and you are trying to run it with a different MPI implementation, such as IntelMPI. Make sure you are both compiling and running your application with the same MPI implementation.
Using mpirun
with managed DNS disabled
For clusters created with SlurmSettings / Dns / DisableManagedDns and UseEc2Hostnames set to true
, the Slurm node name isn't
resolved by the DNS. Slurm can bootstrap MPI processes when nodenames
aren't enabled and if the MPI job is run in a Slurm
context. We recommend following the guidance in the Slurm MPI User's Guide