Troubleshooting scaling issues
This section is relevant to clusters that were installed using AWS ParallelCluster version 3.0.0 and later with the Slurm job scheduler. For more information about configuring multiple queues, see Configuration of multiple queues.
If one of your running clusters is experiencing issues, place the cluster in a STOPPED
state by running the following command
before you begin to troubleshoot. This prevents incurring any unexpected costs.
$
pcluster update-compute-fleet --cluster-name
mycluster
\ --status STOP_REQUESTED
You can list the log streams available from the cluster nodes by using the pcluster list-cluster-log-streams command and filtering by using the private-dns-name
of one of the failing nodes or
the head node:
$
pcluster list-cluster-log-streams --cluster-name
mycluster
--regioneu-west-1
\ --filters 'Name=private-dns-name,Values=ip-10-0-0-101'
Then, you can retrieve the content of the log stream to analyze it by using the pcluster get-cluster-log-events command and passing the --log-stream-name
corresponding to one of the key logs
mentioned in the following section:
$
pcluster get-cluster-log-events --cluster-name
mycluster
\ --regioneu-west-1
--log-stream-nameip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init
AWS ParallelCluster creates cluster CloudWatch log streams in log groups. You can view these logs in the CloudWatch console Custom Dashboards or Log groups. For more information, see Integration with Amazon CloudWatch Logs and Amazon CloudWatch dashboard.
Topics
- Key logs for debugging
- Seeing InsufficientInstanceCapacity error in slurm_resume.log when I fail to run a job, or in clustermgtd.log when I fail to create a cluster
- Troubleshooting node initialization issues
- Troubleshooting unexpected node replacements and terminations
- Replacing, terminating, or powering down problematic instances and nodes
- Queue (partition) Inactive status
- Troubleshooting other known node and job issues
Key logs for debugging
The following table provides an overview of the key logs for the head node:
-
/var/log/cfn-init.log
- This is the AWS CloudFormation init log. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues. -
/var/log/chef-client.log
- This is the Chef client log. It contains all commands that were run through Chef/CINC. Use it to troubleshoot initialization issues. -
/var/log/parallelcluster/slurm_resume.log
- This is aResumeProgram
log. It launches instances for dynamic nodes. Use it to troubleshoot dynamic nodes launch issues. -
/var/log/parallelcluster/slurm_suspend.log
- This is theSuspendProgram
log. It's called when instances are terminated for dynamic nodes. Use it to troubleshoot dynamic nodes termination issues. When you check this log, you should also check theclustermgtd
log. -
/var/log/parallelcluster/clustermgtd
- This is theclustermgtd
log. It runs as the centralized daemon that manages most cluster operation actions. Use it to troubleshoot any launch, termination, or cluster operation issues. -
/var/log/slurmctld.log
- This is the Slurm control daemon log. AWS ParallelCluster doesn't make scaling decisions. Rather, it only attempts to launch resources to satisfy the Slurm requirements. It's useful for scaling and allocation issues, job-related issues, and any scheduler-related launch and termination issues. -
/var/log/parallelcluster/compute_console_output
- This log records the console output from a sample subset of static compute nodes that have unexpectedly terminated. Use this log if static compute nodes terminate and the compute node logs aren't available in CloudWatch. Thecompute_console_output log
content you receive is the same when you use the Amazon EC2 console or AWS CLI to retrieve the instance console output.
These are the key logs for the compute nodes:
-
/var/log/cloud-init-output.log
- This is the cloud-initlog. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues. -
/var/log/parallelcluster/computemgtd
- This is thecomputemgtd
log. It runs on each compute node to monitor the node in the uncommon event thatclustermgtd
daemon on the head node is offline. Use it to troubleshoot unexpected termination issues. -
/var/log/slurmd.log
- This is the Slurm compute daemon log. Use it to troubleshoot initialization and compute failure issues.
Seeing InsufficientInstanceCapacity
error in
slurm_resume.log
when I fail to run a job, or in clustermgtd.log
when I fail to create a
cluster
If the cluster uses a Slurm scheduler, you are experiencing an insufficient capacity issue. If there aren't enough instances available when
an instance launch request is made, an InsufficientInstanceCapacity
error is returned.
For static instance capacity, you can find the error in the clustermgtd
log at /var/log/parallelcluster/clustermgtd
.
For dynamic instance capacity, you can find the error in the ResumeProgram
log at /var/log/parallelcluster/slurm_resume.log
.
The message looks similar to the following example:
An error occurred (InsufficientInstanceCapacity) when calling the RunInstances/CreateFleet operation...
Based on your use case, consider using one of the following methods to avoid getting these types of error messages:
-
Disable the placement group if it's enabled. For more information, see Placement groups and instance launch issues.
-
Reserve capacity for the instances and launch them with ODCR (On-Demand Capacity Reservations). For more information, see Launch instances with On-Demand Capacity Reservations (ODCR).
-
Configure multiple compute resources with different instance types. If your workload doesn't require a specific instance type, you can leverage fast insufficient capacity fail over with multiple compute resources. For more information, see Slurm cluster fast insufficient capacity fail-over.
-
Configure multiple instance types in the same compute resource, and leverage the multiple instance type allocation. For more information about configuring multiple instances, see Multiple instance type allocation with Slurm and Scheduling / SlurmQueues / ComputeResources / Instances.
-
Move the queue to a different Availability Zone by changing the subnet ID in the cluster configuration Scheduling / SlurmQueues / Networking / SubnetIds.
-
If your workload isn't tightly coupled, span the queue across different Availability Zones. For more information about configuring multiple subnets, see Scheduling / SlurmQueues / Networking / SubnetIds.
Troubleshooting node initialization issues
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join a cluster.
Topics
Head node
Applicable logs:
-
/var/log/cfn-init.log
-
/var/log/chef-client.log
-
/var/log/parallelcluster/clustermgtd
-
/var/log/parallelcluster/slurm_resume.log
-
/var/log/slurmctld.log
Check the /var/log/cfn-init.log
and /var/log/chef-client.log
logs or corresponding log streams. These logs contain
all the actions that were run when the head node was set up. Most errors that occur during setup should have error messages located in the
/var/log/chef-client.log
log. If OnNodeStart
or OnNodeConfigured
scripts are specified in the
configuration of the cluster, double check that the script runs successfully through log messages.
When a cluster is created, the head node must wait for the compute nodes to join the cluster before it can join the cluster. Because of this, if the compute nodes fail to join the cluster, then the head node also fails. You can follow one of these sets of procedures, depending on the type of compute notes you use, to troubleshoot this type of issue:
Compute nodes
-
Applicable logs:
-
/var/log/cloud-init-output.log
-
/var/log/slurmd.log
-
-
If a compute node is launched, first check
/var/log/cloud-init-output.log
, which should contain the setup logs similar to the/var/log/chef-client.log
log on the head node. Most errors that occur during setup should have error messages located at the/var/log/cloud-init-output.log
log. If pre-install or post-install scripts are specified in cluster configuration, check that they ran successfully. -
If you’re using a custom AMI with modification to the Slurm configuration, then there might be a Slurm-related error that prevents the compute node from joining the cluster. For scheduler-related errors, check the
/var/log/slurmd.log
log.
Dynamic compute nodes:
-
Search the
ResumeProgram
log (/var/log/parallelcluster/slurm_resume.log
) for your compute node name to see ifResumeProgram
was ever called with the node. (IfResumeProgram
wasn’t ever called, you can check theslurmctld
log (/var/log/slurmctld.log
) to determine if Slurm ever tried to callResumeProgram
with the node). -
Note that incorrect permissions for
ResumeProgram
might causeResumeProgram
to fail silently. If you’re using a custom AMI with modification toResumeProgram
setup, check that theResumeProgram
is owned by theslurm
user and has the744
(rwxr--r--
) permission. -
If
ResumeProgram
is called, check to see if an instance is launched for the node. If no instance was launched, you can see an error message that describes the launch failure. -
If the instance is launched, then there might have been a problem during the setup process. You should see the corresponding private IP address and instance ID from the
ResumeProgram
log. Moreover, you can look at corresponding setup logs for the specific instance. For more information about troubleshooting a setup error with a compute node, see the next section.
Static compute nodes:
-
Check the
clustermgtd
(/var/log/parallelcluster/clustermgtd
) log to see if instances were launched for the node. If they weren’t launched, there should be clear error message detailing the launch failure. -
If instance is launched, there's some issue during setup process. You should see the corresponding private IP address and instance ID from the
ResumeProgram
log. Moreover, you can look at the corresponding setup logs for the specific instance.
Compute nodes backed by Spot Instances:
-
If it's the first time you use Spot Instances and the job remains in a PD (pending state), double check the
/var/log/parallelcluster/slurm_resume.log
file. You'll probably find an error like the following:2022-05-20 13:06:24,796 - [slurm_plugin.common:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['spot-dy-t2micro-2']: An error occurred (AuthFailure.ServiceLinkedRoleCreationNotPermitted) when calling the RunInstances operation: The provided credentials do not have permission to create the service-linked role for Amazon EC2 Spot Instances.
When using Spot Instances, an
AWSServiceRoleForEC2Spot
service-linked role must exist in your account. To create this role in your account using the AWS CLI, run the following command:$
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
For more information, see Working with Spot Instances in the AWS ParallelCluster User Guide and Service-linked role for Spot Instance requests in the Amazon EC2 User Guide.
Troubleshooting unexpected node replacements and terminations
This section continues to explore how you can troubleshoot node related issues, specifically when a node is replaced or terminated unexpectedly.
-
Applicable logs:
-
/var/log/parallelcluster/clustermgtd
(head node) -
/var/log/slurmctld.log
(head node) -
/var/log/parallelcluster/computemgtd
(compute node)
-
Nodes replaced or terminated unexpectedly
-
Check in the
clustermgtd
log (/var/log/parallelcluster/clustermgtd
) to see ifclustermgtd
replaced or terminated a node. Note thatclustermgtd
handles all normal node maintenance action. -
If
clustermgtd
replaced or terminated the node, there should be a message detailing why this action was taken on the node. If the reason is scheduler related (for example, because the node is inDOWN
), check inslurmctld
log for more information. If the reason is Amazon EC2 related, there should be informative message detailing the Amazon EC2 related issue that required the replacement. -
If
clustermgtd
didn’t terminate the node, first check if this was an expected termination by Amazon EC2 , more specifically a spot termination.computemgtd
, running on a compute node, can also terminate a node ifclustermgtd
is determined as unhealthy. Checkcomputemgtd
log (/var/log/parallelcluster/computemgtd
) to see ifcomputemgtd
terminated the node.
Nodes failed
-
Check in
slurmctld
log (/var/log/slurmctld.log
) to see why a job or a node failed. Note that jobs are automatically re-queued if a node failed. -
If
slurm_resume
reports that node is launched andclustermgtd
reports after several minutes that there’s no corresponding instance in Amazon EC2 for that node, the node might fail during setup. To retrieve the log from a compute (/var/log/cloud-init-output.log
), do the following steps:-
Submit a job to let Slurm spin up a new node.
-
Wait for the compute node to start.
-
Modify the instance initiated shutdown behavior so that a failing compute node will be stopped rather than terminated.
$
aws ec2 modify-instance-attribute \ --instance-id
i-1234567890abcdef0
\ --instance-initiated-shutdown-behavior "{\"Value\": \"stop\"}" -
Enable termination protection.
$
aws ec2 modify-instance-attribute \ --instance-id
i-1234567890abcdef0
\ --disable-api-termination -
Tag the node to be easily identifiable.
$
aws ec2 create-tags \ --resources
i-1234567890abcdef0
\ --tags Key=Name,Value=QUARANTINED-Compute -
Detach the node from the cluster by changing the
parallelcluster:cluster-name
tag.$
aws ec2 create-tags \ --resources
i-1234567890abcdef0
\ --tags Key=parallelcluster:clustername,Value=QUARANTINED-ClusterName -
Retrieve the console output from the node with this command.
$
aws ec2 get-console-output --instance-id
i-1234567890abcdef0
--output text
-
Replacing, terminating, or powering down problematic instances and nodes
-
Applicable logs:
-
/var/log/parallelcluster/clustermgtd
(head node) -
/var/log/parallelcluster/slurm_suspend.log
(head node)
-
-
In most cases,
clustermgtd
handles all expected instance termination action. Check in theclustermgtd
log to see why it failed to replace or terminate a node. -
For dynamic nodes failing SlurmSettings Properties, check in the
SuspendProgram
log to see ifSuspendProgram
was called byslurmctld
with the specific node as argument. Note thatSuspendProgram
doesn’t actually perform any action. Rather, it only logs when it’s called. All instance termination andNodeAddr
reset is done byclustermgtd
. Slurm puts nodes back into aPOWER_SAVING
state afterSuspendTimeout
automatically. -
If compute nodes are failing continuously due to bootstrap failures, verify if they are being launched with Slurm cluster protected mode enabled. If protected mode isn't enabled, modify the protected mode settings to enable protected mode. Troubleshoot and fix the bootstrap script.
Queue (partition) Inactive
status
If you run sinfo
and the output shows queues with AVAIL
status of inact
, your cluster might have Slurm cluster protected mode enabled and the queue has been set to the INACTIVE
state
for a pre-defined period of time.
Troubleshooting other known node and job issues
Another type of known issue is that AWS ParallelCluster might fail to allocate jobs or make scaling decisions. With this type of issue,
AWS ParallelCluster only launches, terminates, or maintains resources according to Slurm instructions. For these issues, check the
slurmctld
log to troubleshoot them.