Retrieving and preserving logs
AWS ParallelCluster creates Amazon EC2 metrics for HeadNode and Compute instances and storage. You can view the metrics in the CloudWatch console Custom Dashboards. AWS ParallelCluster also creates cluster CloudWatch log streams in log groups. You can view these logs in the CloudWatch console Custom Dashboards or Log groups. The Monitoring cluster configuration section describes how you can modify the cluster CloudWatch logs and dashboard. For more information, see Integration with Amazon CloudWatch Logs and Amazon CloudWatch dashboard.
Logs are a useful resource for troubleshooting issues. For example, if you want to delete a failing cluster, it might be useful to first create an archive of the cluster logs. Follow the steps in Archive logs to create an archive.
Cluster logs unavailable in CloudWatch
If cluster logs aren't available in CloudWatch, check to make sure you haven't overwritten the AWS ParallelCluster CloudWatch log configuration when adding custom logs to the configuration.
To add custom logs to the CloudWatch configuration, make sure you append to the configuration rather than fetch and overwrite it. For more
information on fetch-config
and append-config
, see Multiple
CloudWatch agent configuration files in the CloudWatch User Guide.
To restore the AWS ParallelCluster CloudWatch log configuration, you can run the following commands inside an AWS ParallelCluster node:
$
PLATFORM="$(ohai platform | jq -r ".[]")" LOG_GROUP_NAME="$(cat /etc/chef/dna.json | jq -r ".cluster.log_group_name")" SCHEDULER="$(cat /etc/chef/dna.json | jq -r ".cluster.scheduler")" NODE_ROLE="$(cat /etc/chef/dna.json | jq -r ".cluster.node_type")" CONFIG_DATA_PATH="/usr/local/etc/cloudwatch_agent_config.json" /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/python /usr/local/bin/write_cloudwatch_agent_json.py --platform $PLATFORM --config $CONFIG_DATA_PATH --log-group $LOG_GROUP_NAME --scheduler $SCHEDULER --node-role $NODE_ROLE /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s
Archive logs
You can archive the logs in Amazon S3 or in a local file (depending on the --output-file
parameter).
Note
Add permissions to the Amazon S3 bucket policy to grant CloudWatch access. For more information, see Set permissions on an Amazon S3 bucket in the CloudWatch Logs User Guide.
$
pcluster export-cluster-logs --cluster-name
mycluster
--regioneu-west-1
\ --bucketbucketname
--bucket-prefixlogs
{ "url": "https://bucketname.s3.eu-west-1.amazonaws.com/export-log/mycluster-logs-202109071136.tar.gz?..." }
# use the --output-file parameter to save the logs locally$
pcluster export-cluster-logs --cluster-name
mycluster
--regioneu-west-1
\ --bucketbucketname
--bucket-prefixlogs
--output-file/tmp/archive.tar.gz
{ "path": "/tmp/archive.tar.gz" }
The archive contains the Amazon CloudWatch Logs streams and AWS CloudFormation stack events from the head node and compute nodes for the last 14 days, unless specified
explicitly in the configuration or in the parameters for the export-cluster-logs
command. The time it takes for the command to finish
depends on the number of nodes in the cluster and the number of log streams available in CloudWatch Logs. For more information about the available log
streams, see Integration with Amazon CloudWatch Logs.
Preserved logs
Starting from version 3.0.0, AWS ParallelCluster preserves CloudWatch Logs by default when a cluster is deleted. If you want to delete a cluster and
preserve its logs, make sure that Monitoring / Logs / CloudWatch / DeletionPolicy isn't set to Delete
in the cluster
configuration. Otherwise, change the value for this field to Retain
, and run the pcluster update-cluster
command. Then,
run pcluster delete-cluster --cluster-name
to delete the cluster, but retain the log
group that’s stored in Amazon CloudWatch.<cluster_name>
Terminated node logs
If a static compute node unexpectedly terminates and CloudWatch has no logs for it, check if AWS ParallelCluster has recorded the console output for
that compute node on the head node in the /var/log/parallelcluster/compute_console_output
log. For more information, see
Key logs for debugging.
If the /var/log/parallelcluster/compute_console_output
log isn't available or doesn't contain the output for the node,
use the AWS CLI to retrieve the console output from the failed node. Log in to the cluster head node and get the failed node
instance-id
from the /var/log/parallelcluster/slurm_resume.log
file.
Retrieve the console output by using the following command with the instance-id
:
$
aws ec2 get-console-output --instance-id
i-abcdef01234567890
If a dynamic compute node self-terminates after launching and CloudWatch has no logs for it, submit a job that activates a cluster scaling action. Wait for the instance to fail and retrieve the instance console log.
Log in to the cluster head node and get the compute node instance-id
from the
/var/log/parallelcluster/slurm_resume.log
file.
Retrieve the instance console log by using the following command:
$
aws ec2 get-console-output --instance-id
i-abcdef01234567890
The console output log can help you debug the root cause of a compute node failure when the compute node log isn't available.