Validate Prometheus setup on the head node of a HyperPod cluster

After you have successfully set up your HyperPod cluster installed with the exporter packages, check if Prometheus is properly set up on the head node of your HyperPod cluster.

Connect to the head node of your cluster. For instructions on accessing a node, see Access your SageMaker HyperPod cluster nodes.

Run the following command to verify the Prometheus config and service file created by the lifecycle script install_prometheus.sh is running on the controller node. The output should show the Active status as active (running).


$ sudo systemctl status prometheus
• prometheus service - Prometheus Exporter
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset:disabled)
Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; Ss ago
Main PID: 12345 (prometheus)
Tasks: 7 (limit: 9281)
Memory: 35M
CPU: 234ms
CGroup: /system.slice/prometheus.service
        -12345 /usr/bin/prometheus--config.file=/etc/prometheus/prometheus.yml

Validate the Prometheus configuration file as follows. The output must be similar to the following, with three exporter configured with the right compute node IP addresses.


$ cat /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 15s

scrape_configs:
  - job_name: 'slurm_exporter'
    static_configs:
      - targets:
          - 'localhost:8080'
  - job_name: 'dcgm_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9400'
          - '<ComputeNodeIP>:9400'
  - job_name: 'efa_node_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9100'
          - '<ComputeNodeIP>:9100'

remote_write:
  - url: <AMPReoteWriteURL>
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500
    sigv4:
      region: <Region>

To test if Prometheus is exporting Slurm, DCGM, and EFA metrics properly, run the following curl command for Prometheus on port :9090 on the head node.
```
$ curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'
```
With the metrics exported to Amazon Managed Service for Prometheus Workspace through the Prometheus remote write configuration from the controller node, you can proceed to the next topic to set up Amazon Managed Grafana dashboards to display the metrics.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Install metrics exporter packages on your HyperPod cluster

Set up an Amazon Managed Grafana workspace