Summary Prerequisites and limitations Architecture Tools Epics Troubleshooting Related resources

Set up a Grafana monitoring dashboard for AWS ParallelCluster

Created by Dario La Porta (AWS) and William Lu (AWS)

Summary

AWS ParallelCluster helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers. Although AWS ParallelCluster is integrated with Amazon CloudWatch for logging and metrics, it doesn't provide a monitoring dashboard for the workload.

The Grafana dashboard for AWS ParallelCluster (GitHub) is a monitoring dashboard for AWS ParallelCluster. It provides job scheduler insights and detailed monitoring metrics at the operating system (OS) level. For more information about the dashboards included in this solution, see Example Dashboards in the GitHub repository. These metrics help you better understand the HPC workload and its performance. However, the dashboard code is not updated for the latest versions of AWS ParallelCluster or the open source packages that are used in solution. This pattern enhances the solution to provide the following benefits:

Supports AWS ParallelCluster v3
Uses the latest version of open source packages, including Prometheus, Grafana, Prometheus Slurm Exporter, and NVIDIA DCGM-Exporter
Increases the number of CPU cores and GPUs that the Slurm jobs use
Adds a job monitoring dashboard
Enhances the GPU node monitoring dashboard for nodes with 4 or 8 graphics processing units (GPUs)

This version of the enhanced solution has been implemented and verified in an AWS customer's HPC production environment.

Prerequisites and limitations

Prerequisites

AWS ParallelCluster CLI, installed and configured.
A supported network configuration for AWS ParallelCluster. This pattern uses the AWS ParallelCluster using two subnets configuration, which requires a public subnet, private subnet, internet gateway, and NAT gateway.
All AWS ParallelCluster cluster nodes must have internet access. This is required so that the installation scripts can download the open source software and Docker images.
A key pair in Amazon Elastic Compute Cloud (Amazon EC2). Resources that have this key pair have Secure Shell (SSH) access to the head node.

Limitations

This pattern is designed to support Ubuntu 20.04 LTS. If you're using a different version of Ubuntu or if you use Amazon Linux or CentOS, then you need to modify the scripts provided with this solution. These modifications are not included in this pattern.

Product versions

Ubuntu 20.04 LTS
ParallelCluster 3.X

Billing and cost considerations

The solution deployed in this pattern is not covered by the free tier. Charges apply for Amazon EC2, Amazon FSx for Lustre, the NAT gateway in Amazon VPC, and Amazon Route 53.

Architecture

Target architecture

The following diagram shows how a user can access the monitoring dashboard for AWS ParallelCluster on the head node. The head node runs NICE DCV, Prometheus, Grafana, Prometheus Slurm Exporter, Prometheus Node Exporter, and NGINX Open Source. The compute nodes run Prometheus Node Exporter, and they also run NVIDIA DCGM-Exporter if the node contains GPUs. The head node retrieves information from the compute nodes and displays that data in the Grafana dashboard.

Accessing the monitoring dashboard for AWS ParallelCluster on the head node.

In most cases, the head node is not heavily loaded because the job scheduler doesn't require a significant amount of CPU or memory. Users access the dashboard on the head node by using SSL on port 443.

All authorized viewers can anonymously view the monitoring dashboards. Only the Grafana administrator can modify dashboards. You configure a password for the Grafana administrator in the aws-parallelcluster-monitoring/docker-compose/docker-compose.head.yml file.

Tools

AWS services

NICE DCV is a high-performance remote display protocol that helps you deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions.
AWS ParallelCluster helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Amazon Virtual Private Cloud (Amazon VPC) helps you launch AWS resources into a virtual network that you’ve defined.

Other tools

Docker is a set of platform as a service (PaaS) products that use virtualization at the operating-system level to deliver software in containers.
Grafana is an open source software that helps you query, visualize, alert on, and explore metrics, logs, and traces.
NGINX Open Source is an open source web server and reverse proxy.
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA data center graphics processing units (GPUs) in cluster environments. In this pattern, you use DCGM-Exporter, which helps you export GPU metrics from Prometheus.
Prometheus is an open source system-monitoring toolkit that collects and stores its metrics as time-series data with associated key-value pairs, which are called labels. In this pattern, you also use Prometheus Slurm Exporter to collect and export metrics, and you use Prometheus Node Exporter to export metrics from the compute nodes.
Ubuntu is an open source, Linux-based operating system that is designed for enterprise servers, desktops, cloud environments, and IoT.

Code repository

The code for this pattern is available in the GitHub pcluster-monitoring-dashboard repository.

Epics

Task	Description	Skills required
Create an S3 bucket.	Create an Amazon S3 bucket. You use this bucket to store the configuration scripts. For instructions, see Creating a bucket in the Amazon S3 documentation.	General AWS
Clone the repository.	Clone the GitHub pcluster-monitoring-dashboard repo by running the following command. `git clone https://github.com/aws-samples/parallelcluster-monitoring-dashboard.git`	DevOps engineer
Create an admin password.	Choose the `aws-parallelcluster-monitoring` folder, choose the `docker-compose` folder, and then open the docker-compose.head.yml file. In the `GF_SECURITY_ADMIN_PASSWORD` variable, replace `Grafana4PC!` with a password of your choice. This is the administrative password that you use to manage the Grafana account. Save and close the docker-compose.head.yml file.	Linux Shell scripting
Copy the required files into the S3 bucket.	Copy the post_install.sh script and the aws-parallelcluster-monitoring folder into the S3 bucket you created. For instructions, see Uploading objects in the Amazon S3 documentation.	General AWS
Configure an additional security group for the head node.	Create a security group for the head node. This security group will allow inbound traffic to the monitoring dashboards on the head node. For instructions, see Create a security group in the Amazon VPC documentation. Add an inbound rule to the security group. For instructions, see Add rules to a security group in the Amazon VPC documentation. Use the following parameters for the rule: Type – HTTPS Protocol – TCP Port range – 443 Source – Enter your IP address Description – Allow users to access the monitoring dashboard	AWS administrator
Configure an IAM policy for the head node.	Create an identity-based policy for the head node. This policy allows the node to retrieve metric data from Amazon CloudWatch. The GitHub repo contains an example policy. For instructions, see Creating IAM policies in the AWS Identity and Access Management (IAM) documentation.	AWS administrator
Configure an IAM policy for the compute nodes.	Create an identity-based policy for the compute nodes. This policy allows the node to create the tags that contain the job ID and job owner. The GitHub repo contains an example policy. For instructions, see Creating IAM policies in the IAM documentation. If you use the provided example file, replace the following values: <REGION> – The AWS Region where the cluster is hosted <ACCOUNT_ID> – The AWS account ID	AWS administrator

Task Description Skills required

Task	Description	Skills required
Modify the provided cluster template file.	Create the AWS ParallelCluster cluster. Use the provided cluster.yaml AWS CloudFormation template file as a starting point to create the cluster. Replace the following values in the provided template: <REGION> – The AWS Region where the cluster is hosted. <HEADNODE_SUBNET> – The public subnet of the VPC. <ADDITIONAL_HEAD_NODE_SG> – The name of the security group that you created for the head node. <KEY_NAME> – Enter the name of an existing Amazon EC2 key pair. Resources that have this key pair have Secure Shell (SSH) access to the head node. <ALLOWED_IPS> -–Enter the CIDR-formatted IP address range that is allowed to make SSH connections to the head node. <ADDITIONAL_HEAD_NODE_POLICY> – Enter the name of the IAM policy that you created for the head node. <BUCKET_NAME> – Enter the name of the S3 bucket you created. <COMPUTE_SUBNET> – Enter the name of the private subnet in the VPC. <ADDITIONAL_COMPUTE_NODE_POLICY> – Enter the name of the IAM policy that you created for the compute node.	AWS administrator
Create the cluster.	In the AWS ParallelCluster CLI, enter the following command. This deploys the CloudFormation template and creates the cluster. For more information about this command, see pcluster create-cluster in the AWS ParallelCluster documentation. `pcluster create-cluster -n <cluster_name> -c cluster.yaml`	AWS administrator
Monitor the cluster creation.	Enter the following command to monitor the cluster creation. For more information about this command, see pcluster describe-cluster in the AWS ParallelCluster documentation. `pcluster describe-cluster -n <cluster_name>`	AWS administrator

Modify the provided cluster template file.

Create the AWS ParallelCluster cluster. Use the provided cluster.yaml AWS CloudFormation template file as a starting point to create the cluster. Replace the following values in the provided template:

<REGION> – The AWS Region where the cluster is hosted.
<HEADNODE_SUBNET> – The public subnet of the VPC.
<ADDITIONAL_HEAD_NODE_SG> – The name of the security group that you created for the head node.
<KEY_NAME> – Enter the name of an existing Amazon EC2 key pair. Resources that have this key pair have Secure Shell (SSH) access to the head node.
<ALLOWED_IPS> -–Enter the CIDR-formatted IP address range that is allowed to make SSH connections to the head node.
<ADDITIONAL_HEAD_NODE_POLICY> – Enter the name of the IAM policy that you created for the head node.
<BUCKET_NAME> – Enter the name of the S3 bucket you created.
<COMPUTE_SUBNET> – Enter the name of the private subnet in the VPC.
<ADDITIONAL_COMPUTE_NODE_POLICY> – Enter the name of the IAM policy that you created for the compute node.

AWS administrator

Create the cluster.

In the AWS ParallelCluster CLI, enter the following command. This deploys the CloudFormation template and creates the cluster. For more information about this command, see pcluster create-cluster in the AWS ParallelCluster documentation.


pcluster create-cluster -n <cluster_name> -c cluster.yaml

AWS administrator

Monitor the cluster creation.

Enter the following command to monitor the cluster creation. For more information about this command, see pcluster describe-cluster in the AWS ParallelCluster documentation.


pcluster describe-cluster -n <cluster_name>

AWS administrator

Task	Description	Skills required
Access to the Grafana portal.	Enter the following command to retrieve the public IP address of the head node. `pcluster describe-cluster -n <cluster_name> --query headNode.publicIpAddress` In a web browser, navigate to the following URL in order to access the Grafana dashboard. https://<head_node_public_ip_address> On the Grafana front page, choose the 4-square Dashboard icon on the left menu, and then choose General. This shows a list of configured dashboards. The following dashboards are available in Grafana: Cluster Cost – Contains information about the cost of the cluster Cluster Logs – Contains information about the logs of the cluster Compute Node Details – Contains information about usage statistics of the compute nodes Compute Node List – Contains the list of the compute nodes of the cluster GPU Nodes – Contains information about usage statistics of the GPU nodes Jobs Details – Contains information about the jobs resources utilization Head Node Details – Contains information about usage statistics of the head node ParallelCluster Summary – Contains information about cluster usage	AWS administrator

Task	Description	Skills required
Delete the cluster.	Enter the following command to delete the cluster. For more information about this command, see pcluster delete-cluster in the AWS ParallelCluster documentation. `pcluster delete-cluster -n <cluster_name>`	AWS administrator
Delete the IAM policies.	Delete the policies that you created for the head node and compute node. For more information about deleting policies, see Deleting IAM policies in the IAM documentation.	AWS administrator
Delete the security group and rule.	Delete the security group that you created for the head node. For more information, see Delete security group rules and Delete a security group in the Amazon VPC documentation.	AWS administrator
Delete the S3 bucket.	Delete the S3 bucket that you created to store the configuration scripts. For more information, see Deleting a bucket in the Amazon S3 documentation.	General AWS

Troubleshooting

Issue	Solution
The head node is not accessible in the browser.	Check the security group and confirm that the inbound port 443 is open.
Grafana doesn't open.	On the head node, check the container log for `docker logs Grafana`.
Some metrics have no data.	On the head node, check the container logs of all containers.

Related resources

AWS documentation

IAM policies for Amazon EC2

Other AWS resources

AWS ParallelCluster
Monitoring dashboard for AWS ParallelCluster (AWS blog post)

Other resources

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Deploy a Lustre file system by using Terraform and DRA

Set up an auto scaling VDI using NICE DCV