Trying to run a job - AWS ParallelCluster

Trying to run a job

srun interactive job fails with error srun: error: fwd_tree_thread: can't find address for <host>, check slurm.conf

  • Why did it fail?

    You ran the srun command to submit a job, and then you increased the size of a queue by using the pcluster update-cluster command without restarting the Slurm daemons after the update completed.

    Slurm organizes Slurm daemons in a tree hierarchy to optimize communication. This hierarchy is only updated when the daemons start.

    Suppose you use srun to launch a job and then run the pcluster update-cluster command to increase the size of the queue. New compute nodes launch as part of the update. Then, Slurm queues your job to one of the new compute nodes. In this case, both the Slurm daemons and srun don't detect the new compute nodes. srun returns an error because it doesn't detect the new nodes.

  • How to resolve?

    Restart the Slurm daemons on all of the compute nodes, and then use srun to submit your job. You can schedule the Slurm daemons restart by running the scontrol reboot command that restarts the compute nodes. For more information, see scontrol reboot in the Slurm documentation. You can also manually restart the Slurm daemons on the compute nodes by requesting a restart of the corresponding systemd services.

Job is stuck in CF state with squeue command

This might be an issue with dynamic nodes powering up. For more information, see Seeing errors in compute node initializations.

Running large scale jobs and seeing nfsd: too many open connections, consider increasing the number of threads in /var/log/messages

With a networked file system, when network limits are reached, I/O wait time also increases. This can result in soft lockups because the network is used to write data for both networking and I/O metrics.

With 5th generation instances, we use the ENA driver to expose packet counters. These counters count the packets shaped by AWS when the network reaches instance bandwidth limits. You can check these counters to see if they are greater than 0. If they are, then you have exceeded your bandwidth limits. You can view these counters by running ethtool -S eth0 | grep exceeded.

Exceeding network limits is often a result of supporting too many NFS connections. This is one of the first things to check when you reach or exceed network limits.

For example, the following output shows dropped packages:

$ ethtool -S eth0 | grep exceeded bw_in_allowance_exceeded: 38750610 bw_out_allowance_exceeded: 1165693 pps_allowance_exceeded: 103 conntrack_allowance_exceeded: 0 linklocal_allowance_exceeded: 0

To avoid getting this message, consider changing the head node instance type to a more performant instance type. Consider moving your data storage to shared storage file systems that aren't exported as an NFS share, such a Amazon EFS or Amazon FSx. For more information, see Shared storage and the Best Practices at the AWS ParallelCluster Wiki on GitHub.

Running an MPI job

Enabling debug mode

To enable OpenMPI debug mode, see What controls does Open MPI have that aid in debugging.

To enable IntelMPI debug mode, see Other Environment Variables.

Seeing MPI_ERRORS_ARE_FATAL and OPAL ERROR in the job output

These error codes come from the MPI layer in your application. To learn how to get MPI debug logs from your application, see Enabling debug mode.

A possible cause for this error is that your application has been compiled for a specific MPI implementation, such as OpenMPI, and you are trying to run it with a different MPI implementation, such as IntelMPI. Make sure you are both compiling and running your application with the same MPI implementation.

Using mpirun with managed DNS disabled

For clusters created with SlurmSettings / Dns / DisableManagedDns and UseEc2Hostnames set to true, the Slurm node name isn't resolved by the DNS. Slurm can bootstrap MPI processes when nodenames aren't enabled and if the MPI job is run in a Slurm context. We recommend following the guidance in the Slurm MPI User's Guide to run MPI jobs with Slurm.