Slurm cluster protected mode
When a cluster runs with protected mode enabled, AWS ParallelCluster monitors and tracks compute node bootstrap failures as the compute nodes are being launched. It does this to detect whether these failures are occurring continuously.
If the following is detected in a queue (partition), the cluster enters protected status:
-
Consecutive compute node bootstrap failures occur continuously with no successful compute node launches.
-
The failure count reaches a predefined threshold.
After the cluster enters protected status, AWS ParallelCluster disables queues with failures at or above the predefined threshold.
Slurm cluster protected mode was added in AWS ParallelCluster version 3.0.0.
You can use protected mode to reduce the time and resources spent on compute node bootstrap failure cycling.
Protected mode parameter
protected_failure_count
protected_failure_count
specifies the number of consecutive failures in a queue (partition) that activate cluster protected status.
The default protected_failure_count
is 10 and protected mode is enabled.
If protected_failure_count
is greater than zero, protected mode is enabled.
If protected_failure_count
is less than or equal to zero, protected mode is disabled.
You can change the protected_failure_count
value by adding the parameter in the clustermgtd
config file that's
located at /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
in the HeadNode
.
You can update this parameter anytime and you don't need to stop the compute fleet to do so. If a launch succeeds in a queue before the
failure count reaches protected_failure_count
, the failure count is reset to zero.
Check cluster status in protected status
When a cluster is in protected status, you can check the compute fleet status and node states.
Compute fleet status
The status of the compute fleet is PROTECTED
in a cluster running in protected status.
$
pcluster describe-compute-fleet --cluster-name <cluster-name>
--region <region-id>
{
"status": "PROTECTED",
"lastStatusUpdatedTime": "2022-04-22T00:31:24.000Z"
}
Node status
To learn which queues (partitions) have bootstrap failures that have activated protected status, log in to the cluster and run the
sinfo
command. Partitions with bootstrap failures at or above protected_failure_count
are in the INACTIVE
state. Partitions without bootstrap failures at or above protected_failure_count
are in the UP
state and work as
expected.
PROTECTED
status doesn't impact running jobs. If jobs are running on a partition
with bootstrap failures at or above protected_failure_count
, the partition is set to INACTIVE
after the running jobs complete.
Consider the node states shown in the following example.
$
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* inact infinite 10 down% queue1-dy-c5xlarge-[1-10] queue1* inact infinite 3490 idle~ queue1-dy-c5xlarge-[11-3500] queue2 up infinite 10 idle~ queue2-dy-c5xlarge-[1-10]
Partition queue1
is INACTIVE
because 10 consecutive compute node bootstrap failures were
detected.
Instances behind nodes queue1-dy-c5xlarge-[1-10]
launched but failed to join the cluster because of an unhealthy status.
The cluster is in protected status.
Partition queue2
isn't impacted by the bootstrap failures in queue1
.
It's in the UP
state and can still run jobs.
How to deactivate protected status
After the bootstrap error has been resolved, you can run the following command to take the cluster out of protected status.
$
pcluster update-compute-fleet --cluster-name<cluster-name>
\ --region<region-id>
\ --status START_REQUESTED
Bootstrap failures that activate protected status
Bootstrap errors that activate protected status are subdivided into the following three types. To identify the type and issue, you can check if AWS ParallelCluster generated logs. If logs were generated, you can check them for error details. For more information, see Retrieving and preserving logs.
-
Bootstrap error that causes an instance to self-terminate.
An instance fails early in the bootstrap process, such as an instance that self-terminates because of errors in the SlurmQueues \ CustomActions \ OnNodeStart | OnNodeConfigured script.
For dynamic nodes, look for errors similar to the following:
Node bootstrap error: Node ... is in power up state without valid backing instance
For static nodes, look in the
clustermgtd
log (/var/log/parallelcluster/clustermgtd
) for errors similar to the following:Node bootstrap error: Node ... is in power up state without valid backing instance
-
Nodes
resume_timeout
ornode_replacement_timeout
expires.An instance can't join the cluster within the
resume_timeout
(for dynamic nodes) ornode_replacement_timeout
(for static nodes). It doesn't self-terminate before the timeout. For example, networking isn't set up correctly for the cluster and the node is set to theDOWN
state by Slurm after the timeout expires.For dynamic nodes, look for errors similar to the following:
Node bootstrap error: Resume timeout expires for node
For static nodes, look in the
clustermgtd
log (/var/log/parallelcluster/clustermgtd
) for errors similar to the following:Node bootstrap error: Replacement timeout expires for node ... in replacement.
-
Nodes fail health check.
An instance behind the node fails an Amazon EC2 health check or scheduled event health check, and the nodes are treated as bootstrap failure nodes. In this case, the instance terminates for a reason outside the control of AWS ParallelCluster.
Look in the
clustermgtd
log (/var/log/parallelcluster/clustermgtd
) for errors similar to the following:Node bootstrap error: Node %s failed during bootstrap when performing health check.
-
Compute nodes fail Slurm registration.
The registration of the
slurmd
daemon with the Slurm control daemon (slurmctld
) fails and causes the compute node state to change to theINVALID_REG
state. Incorrectly configured Slurm compute nodes can cause this error, such as computed nodes configured with CustomSlurmSettings compute node specification errors.Look in the
slurmctld
log file (/var/log/slurmctld.log
) on the head node, or look in theslurmd
log file (/var/log/slurmd.log
) of the failed compute node for errors similar to the following:Setting node %s to INVAL with reason: ...
How to debug protected mode
If your cluster is in protected status, and if AWS ParallelCluster generated clustermgtd
logs from the HeadNode
and
the cloud-init-output
logs from problematic compute nodes, then you can check the logs for error details. For more information
about how to retrieve logs, see Retrieving and preserving logs.
clustermgtd
log(/var/log/parallelcluster/clustermgtd
) on the head node
Log messages show which partitions have bootstrap failures and the corresponding bootstrap failure count.
[slurm_plugin.clustermgtd:_handle_protected_mode_process] - INFO - Partitions bootstrap failure count: {'queue1': 2}, cluster will be set into protected mode if protected failure count reach threshold.
In the clustermgtd
log, search for Found the following bootstrap failure nodes
to find which node failed to
bootstrap.
[slurm_plugin.clustermgtd:_handle_protected_mode_process] - WARNING - Found the following bootstrap failure nodes: (x2) ['queue1-st-c5large-1(192.168.110.155)', 'broken-st-c5large-2(192.168.65.215)']
In the clustermgtd
log, search for Node bootstrap error
to find the reason for the failure.
[slurm_plugin.clustermgtd:_is_node_bootstrap_failure] - WARNING - Node bootstrap error: Node broken-st-c5large-2(192.168.65.215) is currently in replacement and no backing instance
cloud-init-output
log(/var/log/cloud-init-output.log
) on the compute nodes
After obtaining the bootstrap failure node private IP address in the clustermgtd
log, you can find the corresponding compute
node log by either logging into the compute node or by following the guidance in Retrieving and preserving logs to retrieve logs. In most cases, the /var/log/cloud-init-output
log from the
problematic node shows the step that caused the compute node bootstrap failure.