本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 高效能運算
<a name="highperformancecomputing-pattern-list"></a>

**Topics**
+ [使用 Terraform 和 DRA 部署 Lustre 檔案系統以進行高效能資料處理](deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.md)
+ [設定 AWS ParallelCluster 的 Grafana 監控儀表板](set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.md)
+ [更多模式](highperformancecomputing-more-patterns-pattern-list.md)

# 使用 Terraform 和 DRA 部署 Lustre 檔案系統以進行高效能資料處理
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra"></a>

*Arun Bagal 和 Ishwar Chauthaiwale，Amazon Web Services*

## 總結
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-summary"></a>

此模式會自動在 上部署 Lustre 檔案系統， AWS 並將其與 Amazon Elastic Compute Cloud (Amazon EC2) 和 Amazon Simple Storage Service (Amazon S3) 整合。

此解決方案可協助您快速設定具有整合式儲存、運算資源和 Amazon S3 資料存取的高效能運算 (HPC) 環境。它結合了 Lustre 的儲存功能與 Amazon EC2 提供的彈性運算選項，以及 Amazon S3 中可擴展的物件儲存，因此您可以在機器學習、HPC 和大數據分析中處理資料密集型工作負載。

模式使用 HashiCorp Terraform 模組和 Amazon FSx for Lustre 來簡化下列程序：
+ 佈建 Lustre 檔案系統
+ 在 FSx for Lustre 和 S3 儲存貯體之間建立資料儲存庫關聯 (DRA)，以將 Lustre 檔案系統與 Amazon S3 物件連結
+ 建立 EC2 執行個體
+ 在 EC2 執行個體上使用 Amazon S3 連結 DRA 掛載 Lustre 檔案系統

此解決方案的優點包括：
+ 模組化設計。您可以輕鬆維護和更新此解決方案的個別元件。
+ 延展性。​ 您可以跨 AWS 帳戶 或 區域快速部署一致的環境。
+ 彈性。您可以自訂部署以符合您的特定需求。
+ 最佳實務。此模式使用遵循 AWS 最佳實務的預先設定模組。

如需 Lustre 檔案系統的詳細資訊，請參閱 [Lustre 網站](https://www.lustre.org/)。

## 先決條件和限制
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-prereqs"></a>

**先決條件**
+ 作用中 AWS 帳戶
+ 最低權限 AWS Identity and Access Management (IAM) 政策 （請參閱[說明](https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/))

**限制**

FSx for Lustre 會將 Lustre 檔案系統限制在單一可用區域，如果您有高可用性需求，這可能會令人擔憂。如果包含檔案系統的可用區域失敗，則會失去對檔案系統的存取，直到復原為止。若要實現高可用性，您可以使用 DRA 將 Lustre 檔案系統與 Amazon S3 連結，並在可用區域之間傳輸資料。

**產品版本**
+ [Terraform 1.9.3 版或更新版本](https://developer.hashicorp.com/terraform/install?product_intent=terraform)
+ [HashiCorp AWS 提供者 4.0.0 版或更新版本](https://registry.terraform.io/providers/hashicorp/aws/latest)

## Architecture
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-architecture"></a>

下圖顯示 FSx for Lustre 和 AWS 服務 中互補的架構 AWS 雲端。

![\[FSx for Lustre 部署搭配 AWS KMS、Amazon EC2、Amazon CloudWatch Logs 和 Amazon S3。\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/images/pattern-img/51d38589-e752-42cd-9f46-59c3c8d0bfd3/images/c1c21952-fd6f-4b1d-9bf8-09b2f4f4459f.png)


架構包含下列項目：
+ S3 儲存貯體可做為資料耐用、可擴展且符合成本效益的儲存位置。FSx for Lustre 和 Amazon S3 之間的整合提供與 Amazon S3 無縫連結的高效能檔案系統。
+ FSx for Lustre 會執行和管理 Lustre 檔案系統。
+ Amazon CloudWatch Logs 會從檔案系統收集和監控日誌資料。這些日誌可讓您深入了解 Lustre 檔案系統的效能、運作狀態和活動。
+ Amazon EC2 用於使用開放原始碼 Lustre 用戶端存取 Lustre 檔案系統。EC2 執行個體可以從相同虛擬私有雲端 (VPC) 中的其他可用區域存取檔案系統。網路組態允許在 VPC 內的子網路之間存取 。在執行個體上掛載 Lustre 檔案系統之後，您可以使用其檔案和目錄，就像使用本機檔案系統一樣。
+ AWS Key Management Service (AWS KMS) 透過提供靜態資料的加密，增強檔案系統的安全性。

**自動化和擴展**

Terraform 可讓您更輕鬆地跨多個環境部署、管理和擴展 Lustre 檔案系統。在 FSx for Lustre 中，單一檔案系統具有大小限制，因此您可能需要建立多個檔案系統來水平擴展。您可以使用 Terraform 根據您的工作負載需求佈建多個 Lustre 檔案系統。

## 工具
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-tools"></a>

**AWS 服務**
+ [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 可協助您集中所有系統、應用程式和 的日誌， AWS 服務 以便您可以監控日誌並將其安全地存檔。
+ [Amazon Elastic Compute Cloud (Amazon EC2)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html) 在 AWS 雲端中提供可擴展的運算容量。您可以視需要啟動任意數量的虛擬伺服器，，並快速進行擴展或縮減。
+ [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) 可讓您輕鬆且經濟實惠地啟動、執行和擴展高效能 Lustre 檔案系統。
+ [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) 可協助您建立和控制密碼編譯金鑰，以協助保護您的資料。
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) 是一種雲端型物件儲存服務，可協助您儲存、保護和擷取任何數量的資料。

**程式碼儲存庫**

此模式的程式碼可在 GitHub [Provision FSx for Lustre Filesystem 中使用 Terraform](https://github.com/aws-samples/provision-fsx-lustre-with-terraform) 儲存庫。

## 最佳實務
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-best-practices"></a>
+ 下列變數定義 Lustre 檔案系統。請務必根據您的環境正確設定這些項目，如 [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) 一節所述。
  + `storage_capacity` – Lustre 檔案系統的儲存容量，以 GiBs為單位。最小和預設設定為 1200 GiB。
  + `deployment_type` – Lustre 檔案系統的部署類型。如需兩個選項 `PERSISTENT_1`和 `PERSISTENT_2`（預設） 的說明，請參閱 [FSx for Lustre 文件](https://docs.aws.amazon.com/fsx/latest/LustreGuide/using-fsx-lustre.html#persistent-file-system)。
  + `per_unit_storage_throughput` – 讀取和寫入輸送量，以每秒每 TiB MBs為單位。 
  + `subnet_id` – 您要部署 FSx for Lustre 的私有子網路 ID。
  + `vpc_id` – 您想要在 AWS 其中部署 FSx for Lustre 的虛擬私有雲端 ID。
  + `data_repository_path` – 將連結至 Lustre 檔案系統的 S3 儲存貯體路徑。
  + `iam_instance_profile` – 用來啟動 EC2 執行個體的 IAM 執行個體描述檔。
  + `kms_key_id` – 將用於資料加密之 AWS KMS 金鑰的 Amazon Resource Name (ARN)。
+ 使用 和 `vpc_id`變數，確保 VPC 內的適當網路存取`security_group`和放置。
+ 如 [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) 章節所述執行 `terraform plan`命令，以在套用變更之前預覽和驗證變更。這有助於發現潛在問題，並確保您知道要部署的內容。
+ 如 [Epics](#deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics) 一節所述使用 `terraform validate`命令來檢查語法錯誤，並確認組態是否正確。

## 史詩
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-epics"></a>

### 設定您的環境
<a name="set-up-your-environment"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 安裝 Terraform。 | 若要在本機電腦上安裝 Terraform，請遵循 [Terraform 文件](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)中的指示。 | AWS DevOps，DevOps 工程師 | 
| 設定 AWS 登入資料。 | 若要設定帳戶的 AWS Command Line Interface (AWS CLI) 設定檔，請遵循 [AWS 文件](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)中的指示。 | AWS DevOps，DevOps 工程師 | 
| 複製 GitHub 儲存庫。 | 若要複製 GitHub 儲存庫，請執行 命令：<pre>git clone https://github.com/aws-samples/provision-fsx-lustre-with-terraform.git</pre> | AWS DevOps，DevOps 工程師 | 

### 設定和部署 FSx for Lustre
<a name="configure-and-deploy-fsxlustre"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 更新部署組態。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.html) | AWS DevOps，DevOps 工程師 | 
| 初始化 Terraform 環境。 | 若要初始化您的環境以執行 Terraform `fsx_deployment`模組，請執行：<pre>terraform init</pre> | AWS DevOps，DevOps 工程師 | 
| 驗證 Terraform 語法。 | 若要檢查語法錯誤並確認組態是否正確，請執行：<pre>terraform validate </pre> | AWS DevOps，DevOps 工程師 | 
| 驗證 Terraform 組態。 | 若要建立 Terraform 執行計劃並預覽部署，請執行：<pre>terraform plan -var-file terraform.tfvars</pre> | AWS DevOps，DevOps 工程師 | 
| 部署 Terraform 模組。 | 若要部署 FSx for Lustre 資源，請執行：<pre>terraform apply -var-file terraform.tfvars</pre> | AWS DevOps，DevOps 工程師 | 

### 清除 AWS 資源
<a name="clean-up-aws-resources"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 移除 AWS 資源。 | 完成使用 FSx for Lustre 環境後，您可以移除 Terraform 部署 AWS 的資源，以避免產生不必要的費用。程式碼儲存庫中提供的 Terraform 模組會自動執行此清除。[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra.html) | AWS DevOps，DevOps 工程師 | 

## 疑難排解
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-troubleshooting"></a>


| 問題 | 解決方案 | 
| --- | --- | 
| FSx for Lustre 傳回錯誤。 | 如需 FSx for Lustre 問題的協助，請參閱 [FSx for Lustre 文件中的疑難排解 Amazon](https://docs.aws.amazon.com/fsx/latest/LustreGuide/troubleshooting.html) FSx for Lustre。 | 

## 相關資源
<a name="deploy-lustre-file-system-for-high-performance-data-processing-with-terraform-dra-resources"></a>
+ [使用 Terraform 建置 Amazon FSx for Lustre](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fsx_lustre_file_system) (Terraform 文件中的AWS 提供者參考）
+ [Amazon FSx for Lustre 入門](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started.html) (FSx for Lustre 文件）
+ [AWS 有關 Amazon FSx for Lustre 的部落格文章](https://aws.amazon.com/blogs/storage/tag/amazon-fsx-for-lustre/)

# 設定 AWS ParallelCluster 的 Grafana 監控儀表板
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster"></a>

*Dario La Porta 和 William Lu，Amazon Web Services*

## 總結
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-summary"></a>

AWS ParallelCluster 可協助您部署和管理高效能運算 (HPC) 叢集。它支援 AWS Batch 和 Slurm 開放原始碼任務排程器。雖然 AWS ParallelCluster 已與 Amazon CloudWatch 整合，用於記錄和指標，但不會為工作負載提供監控儀表板。

[AWS ParallelCluster (GitHub) 的 Grafana 儀表板](https://github.com/aws-samples/aws-parallelcluster-monitoring)是 AWS ParallelCluster 的監控儀表板。 GitHub 它在作業系統 (OS) 層級提供任務排程器洞察和詳細監控指標。如需此解決方案中包含之儀表板的詳細資訊，請參閱 GitHub 儲存庫中[的範例儀表板](https://github.com/aws-samples/aws-parallelcluster-monitoring#example-dashboards)。這些指標可協助您進一步了解 HPC 工作負載及其效能。不過，最新版本的 AWS ParallelCluster 或解決方案中使用的開放原始碼套件不會更新儀表板程式碼。此模式可增強解決方案，以提供下列優點：
+ 支援 AWS ParallelCluster v3
+ 使用最新版本的開放原始碼套件，包括 Prometheus、Grafana、Prometheus Slurm Exporter 和 NVIDIA DCGM-Exporter
+ 增加 Slurm 任務使用的 CPU 核心和 GPUs 數量
+ 新增任務監控儀表板
+ 為具有 4 或 8 個圖形處理單元 (GPUs) 的節點增強 GPU 節點監控儀表板

此增強型解決方案版本已在 AWS 客戶的 HPC 生產環境中實作和驗證。

## 先決條件和限制
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-prereqs"></a>

**先決條件**
+ [AWS ParallelCluster CLI](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster-v3.html)，已安裝並設定。
+ AWS ParallelCluster 支援[的網路組態](https://docs.aws.amazon.com/parallelcluster/latest/ug/iam-roles-in-parallelcluster-v3.html)。此模式使用 [AWS ParallelCluster 使用兩個子網路](https://docs.aws.amazon.com/parallelcluster/latest/ug/network-configuration-v3.html#network-configuration-v3-two-subnets)組態，需要公有子網路、私有子網路、網際網路閘道和 NAT 閘道。
+ 所有 AWS ParallelCluster 叢集節點都必須具有網際網路存取權。這是必要的，以便安裝指令碼可以下載開放原始碼軟體和 Docker 映像。
+ Amazon Elastic Compute Cloud (Amazon EC2) 中的[金鑰對](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)。具有此金鑰對的資源具有前端節點的安全殼層 (SSH) 存取權。

**限制**
+ 此模式旨在支援 Ubuntu 20.04 LTS。如果您使用的是不同版本的 Ubuntu，或是使用 Amazon Linux 或 CentOS，則需要修改此解決方案隨附的指令碼。這些修改不包含在此模式中。

**產品版本**
+ Ubuntu 20.04 LTS
+ ParallelCluster 3.X

**帳單和成本考量**
+ 免費方案不會涵蓋在此模式中部署的解決方案。Amazon EC2、Amazon FSx for Lustre、Amazon VPC 中的 NAT 閘道和 Amazon Route 53 需支付費用。

## Architecture
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-architecture"></a>

**目標架構**

下圖顯示使用者如何存取前端節點上 AWS ParallelCluster 的監控儀表板。前端節點執行 NICE DCV、Prometheus、Grafana、Prometheus Slurm Exporter、Prometheus Node Exporter 和 NGINX Open Source。運算節點會執行 Prometheus Node Exporter，如果節點包含 GPUs，也會執行 NVIDIA DCGM-Exporter。前端節點會從運算節點擷取資訊，並在 Grafana 儀表板中顯示該資料。

![\[存取前端節點上 AWS ParallelCluster 的監控儀表板。\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/images/pattern-img/a2132c94-98e0-4b90-8be0-99ebfa546442/images/d2255792-f66a-4ef2-8f04-cc3d5482db5f.png)


在大多數情況下，前端節點不會大量載入，因為任務排程器不需要大量的 CPU 或記憶體。使用者在連接埠 443 上使用 SSL 存取前端節點上的儀表板。

所有授權檢視者都可以匿名檢視監控儀表板。只有 Grafana 管理員可以修改儀表板。您可以在 `aws-parallelcluster-monitoring/docker-compose/docker-compose.head.yml` 檔案中設定 Grafana 管理員的密碼。

## 工具
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-tools"></a>

**AWS 服務**
+ [NICE DCV](https://docs.aws.amazon.com/dcv/#nice-dcv) 是一種高效能遠端顯示通訊協定，可協助您在不同的網路條件下，將遠端桌面和應用程式串流從任何雲端或資料中心交付到任何裝置。
+ [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html) 可協助您部署和管理高效能運算 (HPC) 叢集。它支援 AWS Batch 和 Slurm 開放原始碼任務排程器。
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) 是一種雲端型物件儲存服務，可協助您儲存、保護和擷取任何數量的資料。
+ [Amazon Virtual Private Cloud (Amazon VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 可協助您在已定義的虛擬網路中啟動 AWS 資源。

**其他工具**
+ [Docker](https://www.docker.com/) 是一組平台即服務 (PaaS) 產品，可在作業系統層級使用虛擬化在容器中交付軟體。
+ [Grafana](https://grafana.com/docs/grafana/latest/introduction/) 是一種開放原始碼軟體，可協助您查詢、視覺化、提醒和探索指標、日誌和追蹤。
+ [NGINX 開放原始碼](https://nginx.org/en/docs/?_ga=2.187509224.1322712425.1699399865-405102969.1699399865)是開放原始碼 Web 伺服器和反向代理。
+ [NVIDIA Data Center GPU Manager (DCGM)](https://docs.nvidia.com/data-center-gpu-manager-dcgm/index.html) 是一組工具，可用於管理和監控叢集環境中的 NVIDIA 資料中心圖形處理單元 (GPUs)。在此模式中，您會使用 [DCGM-Exporter](https://github.com/NVIDIA/dcgm-exporter)，這可協助您從 Prometheus 匯出 GPU 指標。
+ [Prometheus](https://prometheus.io/docs/introduction/overview/) 是一種開放原始碼系統監控工具組，可收集並儲存其指標，做為具有相關鍵值對的時間序列資料，稱為*標籤*。在此模式中，您也使用 [Prometheus Slurm Exporter](https://github.com/vpenso/prometheus-slurm-exporter) 收集和匯出指標，並使用 [Prometheus Node Exporter](https://github.com/prometheus/node_exporter) 從運算節點匯出指標。
+ [Ubuntu](https://help.ubuntu.com/) 是一種開放原始碼的 Linux 作業系統，專為企業伺服器、桌面、雲端環境和 IoT 而設計。

**程式碼儲存庫**

此模式的程式碼可在 GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard) 儲存庫中使用。

## 史詩
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-epics"></a>

### 建立必要的資源
<a name="create-the-required-resources"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 建立 S3 儲存貯體。 | 建立 Amazon S3 儲存貯體。您可以使用此儲存貯體來存放組態指令碼。如需說明，請參閱 Amazon S3 文件中的[建立儲存貯](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html)體。 | 一般 AWS | 
| 複製儲存庫。 | 執行下列命令，複製 GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring) 儲存庫。<pre>git clone https://github.com/aws-samples/parallelcluster-monitoring-dashboard.git</pre> | DevOps 工程師 | 
| 建立管理員密碼。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | Linux Shell 指令碼 | 
| 將必要的檔案複製到 S3 儲存貯體。 | 將 [post\$1install.sh](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/post_install.sh) 指令碼和 [aws-parallelcluster-monitoring](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring) 資料夾複製到您建立的 S3 儲存貯體。如需說明，請參閱 [Amazon S3 文件中的上傳物件](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html)。 Amazon S3  | 一般 AWS | 
| 為前端節點設定額外的安全群組。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理員 | 
| 設定前端節點的 IAM 政策。 | 為前端節點建立身分型政策。此政策允許節點從 Amazon CloudWatch 擷取指標資料。GitHub 儲存庫包含範例[政策](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/head_node.json)。如需說明，請參閱 AWS Identity and Access Management [(IAM) 文件中的建立 IAM 政策](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html)。 | AWS 管理員 | 
| 設定運算節點的 IAM 政策。 | 為運算節點建立身分型政策。此政策允許節點建立包含任務 ID 和任務擁有者的標籤。GitHub 儲存庫包含範例[政策](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/compute_node.json)。如需說明，請參閱 [IAM 文件中的建立 IAM 政策](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html)。如果您使用提供的範例檔案，請取代下列值：[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理員 | 

### 建立叢集
<a name="create-the-cluster"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 修改提供的叢集範本檔案。 | 建立 AWS ParallelCluster 叢集。使用提供的 [cluster.yaml](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/cluster.yaml) AWS CloudFormation 範本檔案作為建立叢集的起點。在提供的範本中取代下列值：[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理員 | 
| 建立 叢集 | 在 AWS ParallelCluster CLI 中，輸入下列命令。這會部署 CloudFormation 範本並建立叢集。如需此命令的詳細資訊，請參閱 AWS [ParallelCluster 文件中的 pcluster create-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.create-cluster-v3.html)。 ParallelCluster <pre>pcluster create-cluster -n <cluster_name> -c cluster.yaml</pre> | AWS 管理員 | 
| 監控叢集建立。 | 輸入下列命令來監控叢集建立。如需此命令的詳細資訊，請參閱 AWS [ParallelCluster 文件中的 pcluster describe-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.describe-cluster-v3.html)。 ParallelCluster <pre>pcluster describe-cluster -n <cluster_name></pre> | AWS 管理員 | 

### 使用 Grafana 儀表板
<a name="using-the-grafana-dashboards"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 存取 Grafana 入口網站。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理員 | 

### 清除解決方案以停止產生相關聯的成本
<a name="clean-up-the-solution-to-stop-incurring-associated-costs"></a>


| 任務 | Description | 所需的技能 | 
| --- | --- | --- | 
| 刪除叢集。 | 輸入下列命令來刪除叢集。如需此命令的詳細資訊，請參閱 AWS ParallelCluster 文件中的 [pcluster delete-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.delete-cluster-v3.html)。 ParallelCluster <pre>pcluster delete-cluster -n <cluster_name></pre> | AWS 管理員 | 
| 刪除 IAM 政策。 | 刪除您為前端節點和運算節點建立的政策。如需刪除政策的詳細資訊，請參閱 [IAM 文件中的刪除 IAM 政策](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-delete.html)。 | AWS 管理員 | 
| 刪除安全群組和規則。 | 刪除您為前端節點建立的安全群組。如需詳細資訊，請參閱 Amazon VPC 文件中的[刪除安全群組規則](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-group-rules)和[刪除安全群組](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-groups)。 | AWS 管理員 | 
| 刪除 S3 儲存貯體。 | 刪除您建立來存放組態指令碼的 S3 儲存貯體。如需詳細資訊，請參閱 Amazon S3 文件中的[刪除儲存貯](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html)體。 | 一般 AWS | 

## 疑難排解
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-troubleshooting"></a>


| 問題 | 解決方案 | 
| --- | --- | 
| 無法在瀏覽器中存取前端節點。 | 檢查安全群組並確認傳入連接埠 443 已開啟。 | 
| Grafana 未開啟。 | 在前端節點上，檢查 的容器日誌`docker logs Grafana`。 | 
| 有些指標沒有資料。 | 在前端節點上，檢查所有容器的容器日誌。 | 

## 相關資源
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-resources"></a>

**AWS 文件**
+ [適用於 Amazon EC2 的 IAM 政策](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-policies-for-amazon-ec2.html)

**其他 AWS 資源**
+ [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/)
+ [AWS ParallelCluster 的監控儀表板](https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/) (AWS 部落格文章）

**其他資源**
+ [Prometheus 監控系統](https://prometheus.io/)
+ [格拉法納](https://grafana.com/)

# 更多模式
<a name="highperformancecomputing-more-patterns-pattern-list"></a>

**Topics**
+ [使用 K8sGPT 和 Amazon Bedrock 整合實作採用 AI 技術的 Kubernetes 診斷和故障診斷](implement-ai-powered-kubernetes-diagnostics-and-troubleshooting-with-k8sgpt-and-amazon-bedrock-integration.md)