

 **Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Monitor your cluster metrics with Prometheus
<a name="prometheus"></a>

 [Prometheus](https://prometheus.io/) is a monitoring and time series database that scrapes endpoints. It provides the ability to query, aggregate, and store collected data. You can also use it for alerting and alert aggregation. This topic explains how to set up Prometheus as either a managed or open source option. Monitoring Amazon EKS control plane metrics is a common use case.

Amazon Managed Service for Prometheus is a Prometheus-compatible monitoring and alerting service that makes it easy to monitor containerized applications and infrastructure at scale. It is a fully-managed service that automatically scales the ingestion, storage, querying, and alerting of your metrics. It also integrates with AWS security services to enable fast and secure access to your data. You can use the open-source PromQL query language to query your metrics and alert on them. Also, you can use alert manager in Amazon Managed Service for Prometheus to set up alerting rules for critical alerts. You can then send these critical alerts as notifications to an Amazon SNS topic.

There are several different options for using Prometheus with Amazon EKS:
+ You can turn on Prometheus metrics when first creating an Amazon EKS cluster or you can create your own Prometheus scraper for existing clusters. Both of these options are covered by this topic.
+ You can deploy Prometheus using Helm. For more information, see [Deploy Prometheus using Helm](deploy-prometheus.md).
+ You can view control plane raw metrics in Prometheus format. For more information, see [Fetch control plane raw metrics in Prometheus format](view-raw-metrics.md).

## Step 1: Turn on Prometheus metrics
<a name="turn-on-prometheus-metrics"></a>

**Important**  
Amazon Managed Service for Prometheus resources are outside of the cluster lifecycle and need to be maintained independent of the cluster. When you delete your cluster, make sure to also delete any applicable scrapers to stop applicable costs. For more information, see [Find and delete scrapers](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-list-delete) in the *Amazon Managed Service for Prometheus User Guide*.

Prometheus discovers and collects metrics from your cluster through a pull-based model called scraping. Scrapers are set up to gather data from your cluster infrastructure and containerized applications. When you turn on the option to send Prometheus metrics, Amazon Managed Service for Prometheus provides a fully managed agentless scraper.

If you haven’t created the cluster yet, you can turn on the option to send metrics to Prometheus when first creating the cluster. In the Amazon EKS console, this option is in the **Configure observability** step of creating a new cluster. For more information, see [Create an Amazon EKS cluster](create-cluster.md).

If you already have an existing cluster, you can create your own Prometheus scraper. To do this in the Amazon EKS console, navigate to your cluster’s **Observability** tab and choose the **Add scraper** button. If you would rather do so with the AWS API or AWS CLI, see [Create a scraper](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-create) in the *Amazon Managed Service for Prometheus User Guide*.

The following options are available when creating the scraper with the Amazon EKS console.

 **Scraper alias**   
(Optional) Enter a unique alias for the scraper.

 **Destination**   
Choose an Amazon Managed Service for Prometheus workspace. A workspace is a logical space dedicated to the storage and querying of Prometheus metrics. With this workspace, you will be able to view Prometheus metrics across the accounts that have access to it. The **Create new workspace** option tells Amazon EKS to create a workspace on your behalf using the **Workspace alias** you provide. With the **Select existing workspace** option, you can select an existing workspace from a dropdown list. For more information about workspaces, see [Managing workspaces](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-manage-ingest-query.html) in the *Amazon Managed Service for Prometheus User Guide*.

 **Service access**   
This section summarizes the permissions you grant when sending Prometheus metrics:  
+ Allow Amazon Managed Service for Prometheus to describe the scraped Amazon EKS cluster
+ Allow remote writing to the Amazon Managed Prometheus workspace
If the `AmazonManagedScraperRole` already exists, the scraper uses it. Choose the `AmazonManagedScraperRole` link to see the **Permission details**. If the `AmazonManagedScraperRole` doesn’t exist already, choose the **View permission details** link to see the specific permissions you are granting by sending Prometheus metrics.

 **Subnets**   
Modify the subnets that the scraper will inherit as needed. If you need to add a grayed out subnet option, go back to the create cluster **Specify networking** step.

 **Scraper configuration**   
Modify the scraper configuration in YAML format as needed. To do so, use the form or upload a replacement YAML file. For more information, see [Scraper configuration](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-configuration) in the *Amazon Managed Service for Prometheus User Guide*.

Amazon Managed Service for Prometheus refers to the agentless scraper that is created alongside the cluster as an AWS managed collector. For more information about AWS managed collectors, see [Ingest metrics with AWS managed collectors](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector.html) in the *Amazon Managed Service for Prometheus User Guide*.

**Important**  
If you create a Prometheus scraper using the AWS CLI or AWS API, you need to adjust its configuration to give the scraper in-cluster permissions. For more information, see [Configuring your Amazon EKS cluster](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-eks-setup) in the *Amazon Managed Service for Prometheus User Guide*.
If you have a Prometheus scraper created before November 11, 2024 that uses the `aws-auth` `ConfigMap` instead of access entries, you need to update it to access additional metrics from the Amazon EKS cluster control plane. For the updated configuration, see [Manually configuring Amazon EKS for scraper access](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-collector-how-to.html#AMP-collector-eks-manual-setup) in the *Amazon Managed Service for Prometheus User Guide*.

## Step 2: Use the Prometheus metrics
<a name="use-prometheus-metrics"></a>

For more information about how to use the Prometheus metrics after you turn them on for your cluster, see the [Amazon Managed Service for Prometheus User Guide](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html).

## Step 3: Manage Prometheus scrapers
<a name="viewing-prometheus-scraper-details"></a>

To manage scrapers, choose the **Observability** tab in the Amazon EKS console. A table shows a list of scrapers for the cluster, including information such as the scraper ID, alias, status, and creation date. You can add more scrapers, edit scrapers, delete scrapers, or view more information about the current scrapers.

To see more details about a scraper, choose the scraper ID link. For example, you can view the ARN, environment, workspace ID, IAM role, configuration, and networking information. You can use the scraper ID as input to Amazon Managed Service for Prometheus API operations like [https://docs.aws.amazon.com/prometheus/latest/APIReference/API_DescribeScraper.html](https://docs.aws.amazon.com/prometheus/latest/APIReference/API_DescribeScraper.html), [https://docs.aws.amazon.com/prometheus/latest/APIReference/API_UpdateScraper.html](https://docs.aws.amazon.com/prometheus/latest/APIReference/API_UpdateScraper.html), and [https://docs.aws.amazon.com/prometheus/latest/APIReference/API_DeleteScraper.html](https://docs.aws.amazon.com/prometheus/latest/APIReference/API_DeleteScraper.html). For more information on using the Prometheus API, see the [Amazon Managed Service for Prometheus API Reference](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-APIReference.html).

# Deploy Prometheus using Helm
<a name="deploy-prometheus"></a>

As an alternative to using Amazon Managed Service for Prometheus, you can deploy Prometheus into your cluster with Helm. If you already have Helm installed, you can check your version with the `helm version` command. Helm is a package manager for Kubernetes clusters. For more information about Helm and how to install it, see [Deploy applications with Helm on Amazon EKS](helm.md).

After you configure Helm for your Amazon EKS cluster, you can use it to deploy Prometheus with the following steps.

1. Create a Prometheus namespace.

   ```
   kubectl create namespace prometheus
   ```

1. Add the `prometheus-community` chart repository.

   ```
   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
   ```

1. Deploy Prometheus.

   ```
   helm upgrade -i prometheus prometheus-community/prometheus \
       --namespace prometheus \
       --set alertmanager.persistence.storageClass="gp2" \
       --set server.persistentVolume.storageClass="gp2"
   ```
**Note**  
If you get the error `Error: failed to download "stable/prometheus" (hint: running helm repo update may help)` when executing this command, run `helm repo update prometheus-community`, and then try running the Step 2 command again.

   If you get the error `Error: rendered manifests contain a resource that already exists`, run `helm uninstall your-release-name -n namespace `, then try running the Step 3 command again.

1. Verify that all of the Pods in the `prometheus` namespace are in the `READY` state.

   ```
   kubectl get pods -n prometheus
   ```

   An example output is as follows.

   ```
   NAME                                             READY   STATUS    RESTARTS   AGE
   prometheus-alertmanager-59b4c8c744-r7bgp         1/2     Running   0          48s
   prometheus-kube-state-metrics-7cfd87cf99-jkz2f   1/1     Running   0          48s
   prometheus-node-exporter-jcjqz                   1/1     Running   0          48s
   prometheus-node-exporter-jxv2h                   1/1     Running   0          48s
   prometheus-node-exporter-vbdks                   1/1     Running   0          48s
   prometheus-pushgateway-76c444b68c-82tnw          1/1     Running   0          48s
   prometheus-server-775957f748-mmht9               1/2     Running   0          48s
   ```

1. Use `kubectl` to port forward the Prometheus console to your local machine.

   ```
   kubectl --namespace=prometheus port-forward deploy/prometheus-server 9090
   ```

1. Point a web browser to `http://localhost:9090` to view the Prometheus console.

1. Choose a metric from the **- insert metric at cursor** menu, then choose **Execute**. Choose the **Graph** tab to show the metric over time. The following image shows `container_memory_usage_bytes` over time.  
![\[Prometheus metrics\]](http://docs.aws.amazon.com/eks/latest/userguide/images/prometheus-metric.png)

1. From the top navigation bar, choose **Status**, then **Targets**.  
![\[Prometheus console\]](http://docs.aws.amazon.com/eks/latest/userguide/images/prometheus.png)

   All of the Kubernetes endpoints that are connected to Prometheus using service discovery are displayed.

# Fetch control plane raw metrics in Prometheus format
<a name="view-raw-metrics"></a>

The Kubernetes control plane exposes a number of metrics that are represented in a [Prometheus format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md). These metrics are useful for monitoring and analysis. They are exposed internally through metrics endpoints, and can be accessed without fully deploying Prometheus. However, deploying Prometheus more easily allows analyzing metrics over time.

To view the raw metrics output, replace `endpoint` and run the following command.

```
kubectl get --raw endpoint
```

This command allows you to pass any endpoint path and returns the raw response. The output lists different metrics line-by-line, with each line including a metric name, tags, and a value.

```
metric_name{tag="value"[,...]} value
```

## Fetch metrics from the API server
<a name="fetch-metrics"></a>

The general API server endpoint is exposed on the Amazon EKS control plane. This endpoint is primarily useful when looking at a specific metric.

```
kubectl get --raw /metrics
```

An example output is as follows.

```
[...]
# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="127.0.0.1:21362",method="POST"} 4994
rest_client_requests_total{code="200",host="127.0.0.1:443",method="DELETE"} 1
rest_client_requests_total{code="200",host="127.0.0.1:443",method="GET"} 1.326086e+06
rest_client_requests_total{code="200",host="127.0.0.1:443",method="PUT"} 862173
rest_client_requests_total{code="404",host="127.0.0.1:443",method="GET"} 2
rest_client_requests_total{code="409",host="127.0.0.1:443",method="POST"} 3
rest_client_requests_total{code="409",host="127.0.0.1:443",method="PUT"} 8
# HELP ssh_tunnel_open_count Counter of ssh tunnel total open attempts
# TYPE ssh_tunnel_open_count counter
ssh_tunnel_open_count 0
# HELP ssh_tunnel_open_fail_count Counter of ssh tunnel failed open attempts
# TYPE ssh_tunnel_open_fail_count counter
ssh_tunnel_open_fail_count 0
```

This raw output returns verbatim what the API server exposes.

## Fetch control plane metrics with `metrics.eks.amazonaws.com`
<a name="fetch-metrics-prometheus"></a>

For clusters that are Kubernetes version `1.28` and above, Amazon EKS also exposes metrics under the API group `metrics.eks.amazonaws.com`. These metrics include control plane components such as `kube-scheduler` and `kube-controller-manager`.

**Note**  
If you have a webhook configuration that could block the creation of the new `APIService` resource `v1.metrics.eks.amazonaws.com` on your cluster, the metrics endpoint feature might not be available. You can verify that in the `kube-apiserver` audit log by searching for the `v1.metrics.eks.amazonaws.com` keyword.

### Fetch `kube-scheduler` metrics
<a name="fetch-metrics-scheduler"></a>

To retrieve `kube-scheduler` metrics, use the following command.

```
kubectl get --raw "/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics"
```

An example output is as follows.

```
# TYPE scheduler_pending_pods gauge
scheduler_pending_pods{queue="active"} 0
scheduler_pending_pods{queue="backoff"} 0
scheduler_pending_pods{queue="gated"} 0
scheduler_pending_pods{queue="unschedulable"} 18
# HELP scheduler_pod_scheduling_attempts [STABLE] Number of attempts to successfully schedule a pod.
# TYPE scheduler_pod_scheduling_attempts histogram
scheduler_pod_scheduling_attempts_bucket{le="1"} 79
scheduler_pod_scheduling_attempts_bucket{le="2"} 79
scheduler_pod_scheduling_attempts_bucket{le="4"} 79
scheduler_pod_scheduling_attempts_bucket{le="8"} 79
scheduler_pod_scheduling_attempts_bucket{le="16"} 79
scheduler_pod_scheduling_attempts_bucket{le="+Inf"} 81
[...]
```

### Fetch `kube-controller-manager` metrics
<a name="fetch-metrics-controller"></a>

To retrieve `kube-controller-manager` metrics, use the following command.

```
kubectl get --raw "/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics"
```

An example output is as follows.

```
[...]
workqueue_work_duration_seconds_sum{name="pvprotection"} 0
workqueue_work_duration_seconds_count{name="pvprotection"} 0
workqueue_work_duration_seconds_bucket{name="replicaset",le="1e-08"} 0
workqueue_work_duration_seconds_bucket{name="replicaset",le="1e-07"} 0
workqueue_work_duration_seconds_bucket{name="replicaset",le="1e-06"} 0
workqueue_work_duration_seconds_bucket{name="replicaset",le="9.999999999999999e-06"} 0
workqueue_work_duration_seconds_bucket{name="replicaset",le="9.999999999999999e-05"} 19
workqueue_work_duration_seconds_bucket{name="replicaset",le="0.001"} 109
workqueue_work_duration_seconds_bucket{name="replicaset",le="0.01"} 139
workqueue_work_duration_seconds_bucket{name="replicaset",le="0.1"} 181
workqueue_work_duration_seconds_bucket{name="replicaset",le="1"} 191
workqueue_work_duration_seconds_bucket{name="replicaset",le="10"} 191
workqueue_work_duration_seconds_bucket{name="replicaset",le="+Inf"} 191
workqueue_work_duration_seconds_sum{name="replicaset"} 4.265655885000002
[...]
```

### Understand the scheduler and controller manager metrics
<a name="scheduler-controller-metrics"></a>

The following table describes the scheduler and controller manager metrics that are made available for Prometheus style scraping. For more information about these metrics, see [Kubernetes Metrics Reference](https://kubernetes.io/docs/reference/instrumentation/metrics/) in the Kubernetes documentation.


| Metric | Control plane component | Description | 
| --- | --- | --- | 
|  scheduler\$1pending\$1pods  |  scheduler  |  The number of Pods that are waiting to be scheduled onto a node for execution.  | 
|  scheduler\$1schedule\$1attempts\$1total  |  scheduler  |  The number of attempts made to schedule Pods.  | 
|  scheduler\$1preemption\$1attempts\$1total  |  scheduler  |  The number of attempts made by the scheduler to schedule higher priority Pods by evicting lower priority ones.  | 
|  scheduler\$1preemption\$1victims  |  scheduler  |  The number of Pods that have been selected for eviction to make room for higher priority Pods.  | 
|  scheduler\$1pod\$1scheduling\$1attempts  |  scheduler  |  The number of attempts to successfully schedule a Pod.  | 
|  scheduler\$1scheduling\$1attempt\$1duration\$1seconds  |  scheduler  |  Indicates how quickly or slowly the scheduler is able to find a suitable place for a Pod to run based on various factors like resource availability and scheduling rules.  | 
|  scheduler\$1pod\$1scheduling\$1sli\$1duration\$1seconds  |  scheduler  |  The end-to-end latency for a Pod being scheduled, from the time the Pod enters the scheduling queue. This might involve multiple scheduling attempts.  | 
|  cronjob\$1controller\$1job\$1creation\$1skew\$1duration\$1seconds  |  controller manager  |  The time between when a cronjob is scheduled to be run, and when the corresponding job is created.  | 
|  workqueue\$1depth  |  controller manager  |  The current depth of queue.  | 
|  workqueue\$1adds\$1total  |  controller manager  |  The total number of adds handled by workqueue.  | 
|  workqueue\$1queue\$1duration\$1seconds  |  controller manager  |  The time in seconds an item stays in workqueue before being requested.  | 
|  workqueue\$1work\$1duration\$1seconds  |  controller manager  |  The time in seconds processing an item from workqueue takes.  | 

## Deploy a Prometheus scraper to consistently scrape metrics
<a name="deploy-prometheus-scraper"></a>

To deploy a Prometheus scraper to consistently scrape the metrics, use the following configuration:

```
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-conf
data:
  prometheus.yml: |-
    global:
      scrape_interval: 30s
    scrape_configs:
    # apiserver metrics
    - job_name: apiserver-metrics
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels:
          [
            __meta_kubernetes_namespace,
            __meta_kubernetes_service_name,
            __meta_kubernetes_endpoint_port_name,
          ]
        action: keep
        regex: default;kubernetes;https
    # Scheduler metrics
    - job_name: 'ksh-metrics'
      kubernetes_sd_configs:
      - role: endpoints
      metrics_path: /apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels:
          [
            __meta_kubernetes_namespace,
            __meta_kubernetes_service_name,
            __meta_kubernetes_endpoint_port_name,
          ]
        action: keep
        regex: default;kubernetes;https
    # Controller Manager metrics
    - job_name: 'kcm-metrics'
      kubernetes_sd_configs:
      - role: endpoints
      metrics_path: /apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels:
          [
            __meta_kubernetes_namespace,
            __meta_kubernetes_service_name,
            __meta_kubernetes_endpoint_port_name,
          ]
        action: keep
        regex: default;kubernetes;https
---
apiVersion: v1
kind: Pod
metadata:
  name: prom-pod
spec:
  containers:
  - name: prom-container
    image: prom/prometheus
    ports:
    - containerPort: 9090
    volumeMounts:
    - name: config-volume
      mountPath: /etc/prometheus/
  volumes:
  - name: config-volume
    configMap:
      name: prometheus-conf
```

The permission that follows is required for the Pod to access the new metrics endpoint.

```
{
  "effect": "allow",
  "apiGroups": [
    "metrics.eks.amazonaws.com"
  ],
  "resources": [
    "kcm/metrics",
    "ksh/metrics"
  ],
  "verbs": [
    "get"
  ] },
```

To patch the role being used, you can use the following command.

```
kubectl patch clusterrole <role-name> --type=json -p='[
  {
    "op": "add",
    "path": "/rules/-",
    "value": {
      "verbs": ["get"],
      "apiGroups": ["metrics.eks.amazonaws.com"],
      "resources": ["kcm/metrics", "ksh/metrics"]
    }
  }
]'
```

Then you can view the Prometheus dashboard by proxying the port of the Prometheus scraper to your local port.

```
kubectl port-forward pods/prom-pod 9090:9090
```

For your Amazon EKS cluster, the core Kubernetes control plane metrics are also ingested into Amazon CloudWatch Metrics under the `AWS/EKS` namespace. To view them, open the [CloudWatch console](https://console.aws.amazon.com/cloudwatch/home#logs:prefix=/aws/eks) and select **All metrics** from the left navigation pane. On the **Metrics** selection page, choose the `AWS/EKS` namespace and a metrics dimension for your cluster.