

# High-performance computing
<a name="highperformancecomputing-pattern-list"></a>

**Topics**
+ [Deploy a Lustre file system for high-performance data processing by using Terraform and DRA](deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.md)
+ [Set up a Grafana monitoring dashboard for AWS ParallelCluster](set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.md)
+ [More patterns](highperformancecomputing-more-patterns-pattern-list.md)

# Deploy a Lustre file system for high-performance data processing by using Terraform and DRA
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra"></a>

*Arun Bagal and Ishwar Chauthaiwale, Amazon Web Services*

## Summary
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-summary"></a>

This pattern automatically deploys a Lustre file system on AWS and integrates it with Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

This solution helps you quickly set up a high performance computing (HPC) environment with integrated storage, compute resources, and Amazon S3 data access. It combines Lustre's storage capabilities with the flexible compute options provided by Amazon EC2 and the scalable object storage in Amazon S3, so you can tackle data-intensive workloads in machine learning, HPC, and big data analytics.

The pattern uses a HashiCorp Terraform module and Amazon FSx for Lustre to streamline the following process:
+ Provisioning a Lustre file system
+ Establishing a data repository association (DRA) between FSx for Lustre and an S3 bucket to link the Lustre file system with Amazon S3 objects
+ Creating an EC2 instance
+ Mounting the Lustre file system with the Amazon S3-linked DRA on the EC2 instance

The benefits of this solution include:
+ Modular design. You can easily maintain and update the individual components of this solution.
+ Scalability. You can quickly deploy consistent environments across AWS accounts or Regions.
+ Flexibility. You can customize the deployment to fit your specific needs.
+ Best practices. This pattern uses preconfigured modules that follow AWS best practices.

For more information about Lustre file systems, see the [Lustre website](https://www.lustre.org/).

## Prerequisites and limitations
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ A least privilege AWS Identity and Access Management (IAM) policy (see [instructions](https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/))

**Limitations**

FSx for Lustre limits the Lustre file system to a single Availability Zone, which could be a concern if you have high availability requirements. If the Availability Zone that contains the file system fails, access to the file system is lost until recovery. To achieve high availability, you can use DRA to link the Lustre file system with Amazon S3, and transfer data between Availability Zones.

**Product versions**
+ [Terraform version 1.9.3 or later](https://developer.hashicorp.com/terraform/install?product_intent=terraform)
+ [HashiCorp AWS Provider version 4.0.0 or later](https://registry.terraform.io/providers/hashicorp/aws/latest)

## Architecture
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-architecture"></a>

The following diagram shows the architecture for FSx for Lustre and complementary AWS services in the AWS Cloud.

![\[FSx for Lustre deployment with AWS KMS, Amazon EC2, Amazon CloudWatch Logs, and Amazon S3.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/51d38589-e752-42cd-9f46-59c3c8d0bfd3/images/c1c21952-fd6f-4b1d-9bf8-09b2f4f4459f.png)


The architecture includes the following:
+ An S3 bucket is used as a durable, scalable, and cost-effective storage location for data. The integration between FSx for Lustre and Amazon S3 provides a high-performance file system that is seamlessly linked with Amazon S3.
+ FSx for Lustre runs and manages the Lustre file system.
+ Amazon CloudWatch Logs collects and monitors log data from the file system. These logs provide insights into the performance, health, and activity of your Lustre file system.
+ Amazon EC2 is used to access Lustre file systems by using the open source Lustre client. EC2 instances can access file systems from other Availability Zones within the same virtual private cloud (VPC). The networking configuration allows for access across subnets within the VPC. After the Lustre file system is mounted on the instance, you can work with its files and directories just as you would use a local file system.
+ AWS Key Management Service (AWS KMS)  enhances the security of the file system by providing encryption for data at rest.

**Automation and scale**

Terraform makes it easier to deploy, manage, and scale your Lustre file systems across multiple environments. In FSx for Lustre, a single file system has size limitations, so you might need to horizontally scale by creating multiple file systems. You can use Terraform to provision multiple Lustre file systems based on your workload needs.

## Tools
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-tools"></a>

**AWS services**
+ [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) helps you centralize the logs from all your systems, applications, and AWS services so you can monitor them and archive them securely.
+ [Amazon Elastic Compute Cloud (Amazon EC2)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html) provides scalable computing capacity in the AWS Cloud. You can launch as many virtual servers as you need and quickly scale them up or down.
+ [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) makes it easy and cost-effective to launch, run, and scale a high-performance Lustre file system.
+ [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) helps you create and control cryptographic keys to help protect your data.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

**Code repository**

The code for this pattern is available in the GitHub [Provision FSx for Lustre Filesystem using Terraform](https://github.com/aws-samples/provision-fsx-lustre-with-terraform) repository.

## Best practices
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-best-practices"></a>
+ The following variables define the Lustre file system. Make sure to configure these correctly based on your environment, as instructed in the [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) section.
  + `storage_capacity` – The storage capacity of the Lustre file system, in GiBs. The minimum and default setting is 1200 GiB.
  + `deployment_type` – The deployment type for the Lustre file system. For an explanation of the two options, `PERSISTENT_1` and `PERSISTENT_2` (default), see the [FSx for Lustre documentation](https://docs.aws.amazon.com/fsx/latest/LustreGuide/using-fsx-lustre.html#persistent-file-system).
  + `per_unit_storage_throughput` – The read and write throughput, in MBs per second per TiB.  
  + `subnet_id` – The ID of the private subnet where you want to deploy FSx for Lustre.
  + `vpc_id` – The ID of your virtual private cloud on AWS where you want to deploy FSx for Lustre.
  + `data_repository_path` – The path to the S3 bucket that will be linked to the Lustre file system.
  + `iam_instance_profile` – The IAM instance profile to use to launch the EC2 instance.
  + `kms_key_id` – The Amazon Resource Name (ARN) of the AWS KMS key that will be used for data encryption.
+ Ensure proper network access and placement within the VPC by using the `security_group` and `vpc_id` variables.
+ Run the `terraform plan` command as described in the [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) section to preview and verify changes before applying them. This helps catch potential issues and ensures that you are aware of what will be deployed.
+ Use the `terraform validate` command as described in the [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) section to check for syntax errors and to confirm that your configuration is correct.

## Epics
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics"></a>

### Set up your environment
<a name="set-up-your-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Install Terraform. | To install Terraform on your local machine, follow the instructions in the [Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli). | AWS DevOps, DevOps engineer | 
| Set up AWS credentials. | To set up the AWS Command Line Interface (AWS CLI) profile for the account, follow the instructions in the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). | AWS DevOps, DevOps engineer | 
| Clone the GitHub repository. | To clone the GitHub repository, run the command:<pre>git clone https://github.com/aws-samples/provision-fsx-lustre-with-terraform.git</pre> | AWS DevOps, DevOps engineer | 

### Configure and deploy FSx for Lustre
<a name="configure-and-deploy-fsxlustre"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Update the deployment configuration. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.html) | AWS DevOps, DevOps engineer | 
| Initialize the Terraform environment. | To initialize your environment to run the Terraform `fsx_deployment` module, run:<pre>terraform init</pre> | AWS DevOps, DevOps engineer | 
| Validate the Terraform syntax. | To check for syntax errors and to confirm that your configuration is correct, run:<pre>terraform validate </pre> | AWS DevOps, DevOps engineer | 
| Validate the Terraform configuration. | To create a Terraform execution plan and preview the deployment, run:<pre>terraform plan -var-file terraform.tfvars</pre> | AWS DevOps, DevOps engineer | 
| Deploy the Terraform module. | To deploy the FSx for Lustre resources, run:<pre>terraform apply -var-file terraform.tfvars</pre> | AWS DevOps, DevOps engineer | 

### Clean up AWS resources
<a name="clean-up-aws-resources"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Remove AWS resources. | After you finish using your FSx for Lustre environment, you can remove the AWS resources deployed by Terraform to avoid incurring unnecessary charges. The Terraform module provided in the code repository automates this cleanup.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.html) | AWS DevOps, DevOps engineer | 

## Troubleshooting
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| FSx for Lustre returns errors. | For help with FSx for Lustre issues, see [Troubleshooting Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/troubleshooting.html) in the FSx for Lustre documentation. | 

## Related resources
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-resources"></a>
+ [Building Amazon FSx for Lustre by using Terraform](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fsx_lustre_file_system) (AWS Provider reference in the Terraform documentation)
+ [Getting started with Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started.html) (FSx for Lustre documentation)
+ [AWS blog posts about Amazon FSx for Lustre](https://aws.amazon.com/blogs/storage/tag/amazon-fsx-for-lustre/)

# Set up a Grafana monitoring dashboard for AWS ParallelCluster
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster"></a>

*Dario La Porta and William Lu, Amazon Web Services*

## Summary
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-summary"></a>

AWS ParallelCluster helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers. Although AWS ParallelCluster is integrated with Amazon CloudWatch for logging and metrics, it doesn't provide a monitoring dashboard for the workload.

The [Grafana dashboard for AWS ParallelCluster](https://github.com/aws-samples/aws-parallelcluster-monitoring) (GitHub) is a monitoring dashboard for AWS ParallelCluster. It provides job scheduler insights and detailed monitoring metrics at the operating system (OS) level. For more information about the dashboards included in this solution, see [Example Dashboards](https://github.com/aws-samples/aws-parallelcluster-monitoring#example-dashboards) in the GitHub repository. These metrics help you better understand the HPC workload and its performance. However, the dashboard code is not updated for the latest versions of AWS ParallelCluster or the open source packages that are used in solution. This pattern enhances the solution to provide the following benefits:
+ Supports AWS ParallelCluster v3
+ Uses the latest version of open source packages, including Prometheus, Grafana, Prometheus Slurm Exporter, and NVIDIA DCGM-Exporter
+ Increases the number of CPU cores and GPUs that the Slurm jobs use
+ Adds a job monitoring dashboard
+ Enhances the GPU node monitoring dashboard for nodes with 4 or 8 graphics processing units (GPUs)

This version of the enhanced solution has been implemented and verified in an AWS customer's HPC production environment.

## Prerequisites and limitations
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-prereqs"></a>

**Prerequisites**
+ [AWS ParallelCluster CLI](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster-v3.html), installed and configured.
+ A supported [network configuration](https://docs.aws.amazon.com/parallelcluster/latest/ug/iam-roles-in-parallelcluster-v3.html) for AWS ParallelCluster. This pattern uses the [AWS ParallelCluster using two subnets](https://docs.aws.amazon.com/parallelcluster/latest/ug/network-configuration-v3.html#network-configuration-v3-two-subnets) configuration, which requires a public subnet, private subnet, internet gateway, and NAT gateway.
+ All AWS ParallelCluster cluster nodes must have internet access. This is required so that the installation scripts can download the open source software and Docker images.
+ A [key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) in Amazon Elastic Compute Cloud (Amazon EC2). Resources that have this key pair have Secure Shell (SSH) access to the head node.

**Limitations**
+ This pattern is designed to support Ubuntu 20.04 LTS. If you're using a different version of Ubuntu or if you use Amazon Linux or CentOS, then you need to modify the scripts provided with this solution. These modifications are not included in this pattern.

**Product versions**
+ Ubuntu 20.04 LTS
+ ParallelCluster 3.X

**Billing and cost considerations**
+ The solution deployed in this pattern is not covered by the free tier. Charges apply for Amazon EC2, Amazon FSx for Lustre, the NAT gateway in Amazon VPC, and Amazon Route 53.

## Architecture
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-architecture"></a>

**Target architecture**

The following diagram shows how a user can access the monitoring dashboard for AWS ParallelCluster on the head node. The head node runs NICE DCV, Prometheus, Grafana, Prometheus Slurm Exporter, Prometheus Node Exporter, and NGINX Open Source. The compute nodes run Prometheus Node Exporter, and they also run NVIDIA DCGM-Exporter if the node contains GPUs. The head node retrieves information from the compute nodes and displays that data in the Grafana dashboard.

![\[Accessing the monitoring dashboard for AWS ParallelCluster on the head node.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/a2132c94-98e0-4b90-8be0-99ebfa546442/images/d2255792-f66a-4ef2-8f04-cc3d5482db5f.png)


In most cases, the head node is not heavily loaded because the job scheduler doesn't require a significant amount of CPU or memory. Users access the dashboard on the head node by using SSL on port 443.

All authorized viewers can anonymously view the monitoring dashboards. Only the Grafana administrator can modify dashboards. You configure a password for the Grafana administrator in the `aws-parallelcluster-monitoring/docker-compose/docker-compose.head.yml` file.

## Tools
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-tools"></a>

**AWS services**
+ [NICE DCV](https://docs.aws.amazon.com/dcv/#nice-dcv) is a high-performance remote display protocol that helps you deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions.
+ [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html) helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [Amazon Virtual Private Cloud (Amazon VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) helps you launch AWS resources into a virtual network that you’ve defined.

**Other tools**
+ [Docker](https://www.docker.com/) is a set of platform as a service (PaaS) products that use virtualization at the operating-system level to deliver software in containers.
+ [Grafana](https://grafana.com/docs/grafana/latest/introduction/) is an open source software that helps you query, visualize, alert on, and explore metrics, logs, and traces.
+ [NGINX Open Source](https://nginx.org/en/docs/?_ga=2.187509224.1322712425.1699399865-405102969.1699399865) is an open source web server and reverse proxy.
+ [NVIDIA Data Center GPU Manager (DCGM)](https://docs.nvidia.com/data-center-gpu-manager-dcgm/index.html) is a suite of tools for managing and monitoring NVIDIA data center graphics processing units (GPUs) in cluster environments. In this pattern, you use [DCGM-Exporter](https://github.com/NVIDIA/dcgm-exporter), which helps you export GPU metrics from Prometheus.
+ [Prometheus](https://prometheus.io/docs/introduction/overview/) is an open source system-monitoring toolkit that collects and stores its metrics as time-series data with associated key-value pairs, which are called *labels*. In this pattern, you also use [Prometheus Slurm Exporter](https://github.com/vpenso/prometheus-slurm-exporter) to collect and export metrics, and you use [Prometheus Node Exporter](https://github.com/prometheus/node_exporter) to export metrics from the compute nodes.
+ [Ubuntu](https://help.ubuntu.com/) is an open source, Linux-based operating system that is designed for enterprise servers, desktops, cloud environments, and IoT.

**Code repository**

The code for this pattern is available in the GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard) repository.

## Epics
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-epics"></a>

### Create the required resources
<a name="create-the-required-resources"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an S3 bucket. | Create an Amazon S3 bucket. You use this bucket to store the configuration scripts. For instructions, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the Amazon S3 documentation. | General AWS | 
| Clone the repository. | Clone the GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring) repo by running the following command.<pre>git clone https://github.com/aws-samples/parallelcluster-monitoring-dashboard.git</pre> | DevOps engineer | 
| Create an admin password. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | Linux Shell scripting | 
| Copy the required files into the S3 bucket. | Copy the [post\$1install.sh](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/post_install.sh) script and the [aws-parallelcluster-monitoring](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring) folder into the S3 bucket you created. For instructions, see [Uploading objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html) in the Amazon S3 documentation. | General AWS | 
| Configure an additional security group for the head node. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS administrator | 
| Configure an IAM policy for the head node. | Create an identity-based policy for the head node. This policy allows the node to retrieve metric data from Amazon CloudWatch. The GitHub repo contains an example [policy](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/head_node.json). For instructions, see [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) in the AWS Identity and Access Management (IAM) documentation. | AWS administrator | 
| Configure an IAM policy for the compute nodes. | Create an identity-based policy for the compute nodes. This policy allows the node to create the tags that contain the job ID and job owner. The GitHub repo contains an example [policy](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/compute_node.json). For instructions, see [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) in the IAM documentation.If you use the provided example file, replace the following values:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS administrator | 

### Create the cluster
<a name="create-the-cluster"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Modify the provided cluster template file. | Create the AWS ParallelCluster cluster. Use the provided [cluster.yaml](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/cluster.yaml) AWS CloudFormation template file as a starting point to create the cluster. Replace the following values in the provided template:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS administrator | 
| Create the cluster. | In the AWS ParallelCluster CLI, enter the following command. This deploys the CloudFormation template and creates the cluster. For more information about this command, see [pcluster create-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.create-cluster-v3.html) in the AWS ParallelCluster documentation.<pre>pcluster create-cluster -n <cluster_name> -c cluster.yaml</pre> | AWS administrator | 
| Monitor the cluster creation. | Enter the following command to monitor the cluster creation. For more information about this command, see [pcluster describe-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.describe-cluster-v3.html) in the AWS ParallelCluster documentation.<pre>pcluster describe-cluster -n <cluster_name></pre> | AWS administrator | 

### Using the Grafana dashboards
<a name="using-the-grafana-dashboards"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Access to the Grafana portal. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS administrator | 

### Clean up the solution to stop incurring associated costs
<a name="clean-up-the-solution-to-stop-incurring-associated-costs"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Delete the cluster. | Enter the following command to delete the cluster. For more information about this command, see [pcluster delete-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.delete-cluster-v3.html) in the AWS ParallelCluster documentation.<pre>pcluster delete-cluster -n <cluster_name></pre> | AWS administrator | 
| Delete the IAM policies. | Delete the policies that you created for the head node and compute node. For more information about deleting policies, see [Deleting IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-delete.html) in the IAM documentation. | AWS administrator | 
| Delete the security group and rule. | Delete the security group that you created for the head node. For more information, see [Delete security group rules](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-group-rules) and [Delete a security group](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-groups) in the Amazon VPC documentation. | AWS administrator | 
| Delete the S3 bucket. | Delete the S3 bucket that you created to store the configuration scripts. For more information, see [Deleting a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) in the Amazon S3 documentation. | General AWS | 

## Troubleshooting
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| The head node is not accessible in the browser. | Check  the security group and confirm that the inbound port 443 is open. | 
| Grafana doesn't open. | On the head node, check the container log for `docker logs Grafana`. | 
| Some metrics have no data. | On the head node, check the container logs of all containers. | 

## Related resources
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-resources"></a>

**AWS documentation**
+ [IAM policies for Amazon EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-policies-for-amazon-ec2.html)

**Other AWS resources**
+ [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/)
+ [Monitoring dashboard for AWS ParallelCluster](https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/) (AWS blog post)

**Other resources**
+ [Prometheus monitoring system](https://prometheus.io/)
+ [Grafana](https://grafana.com/)

# More patterns
<a name="highperformancecomputing-more-patterns-pattern-list"></a>

**Topics**
+ [Implement AI-powered Kubernetes diagnostics and troubleshooting with K8sGPT and Amazon Bedrock integration](implement-ai-powered-kubernetes-diagnostics-and-troubleshooting-with-k8sgpt-and-amazon-bedrock-integration.md)