

# Monitor an Amazon MSK Provisioned cluster
<a name="monitoring"></a>

There are several ways that Amazon MSK helps you monitor the status of your Amazon MSK Provisioned cluster.
+ Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch where you can view them. For more information about Apache Kafka metrics, including the ones that Amazon MSK surfaces, see [Monitoring](http://kafka.apache.org/documentation/#monitoring) in the Apache Kafka documentation.
+ You can also monitor your MSK cluster with Prometheus, an open-source monitoring application. For information about Prometheus, see [Overview](https://prometheus.io/docs/introduction/overview/) in the Prometheus documentation. To learn how to monitor your MSK Provisioned cluster with Prometheus, see [Monitor an MSK Provisioned cluster with Prometheus](open-monitoring.md).
+ (Standard brokers only) Amazon MSK helps you monitor your disk storage capacity by automatically sending you storage capacity alerts when a Provisioned cluster is about to reach its storage capacity limit. The alerts also provide recommendations on the best steps to take to address detected issues. This helps you to identify and quickly resolve disk capacity issues before they become critical. Amazon MSK automatically sends these alerts to the [Amazon MSK console](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/), Health Dashboard, Amazon EventBridge, and email contacts for your AWS account. For information about storage capacity alerts, see [Use Amazon MSK storage capacity alerts](cluster-alerts.md).

**Topics**
+ [

# View Amazon MSK metrics using CloudWatch
](cloudwatch-metrics.md)
+ [

# Amazon MSK metrics for monitoring Standard brokers with CloudWatch
](metrics-details.md)
+ [

# Amazon MSK metrics for monitoring Express brokers with CloudWatch
](metrics-details-express.md)
+ [

# Monitor an MSK Provisioned cluster with Prometheus
](open-monitoring.md)
+ [

# Monitor consumer lags
](consumer-lag.md)
+ [

# Use Amazon MSK storage capacity alerts
](cluster-alerts.md)

# View Amazon MSK metrics using CloudWatch
<a name="cloudwatch-metrics"></a>

You can monitor metrics for Amazon MSK using the CloudWatch console, the command line, or the CloudWatch API. The following procedures show you how to access metrics using these different methods. 

**To access metrics using the CloudWatch console**

Sign in to the AWS Management Console and open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose the **All metrics** tab, and then choose **AWS/Kafka**.

1. To view topic-level metrics, choose **Topic, Broker ID, Cluster Name**; for broker-level metrics, choose **Broker ID, Cluster Name**; and for cluster-level metrics, choose **Cluster Name**.

1. (Optional) In the graph pane, select a statistic and a time period, and then create a CloudWatch alarm using these settings.

**To access metrics using the AWS CLI**  
Use the [list-metrics](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/list-metrics.html) and [get-metric-statistics](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-metric-statistics.html) commands.

**To access metrics using the CloudWatch CLI**  
Use the [mon-list-metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/cli-mon-list-metrics.html) and [mon-get-stats](https://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/cli-mon-get-stats.html) commands.

**To access metrics using the CloudWatch API**  
Use the [ListMetrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListMetrics.html) and [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) operations.

# Amazon MSK metrics for monitoring Standard brokers with CloudWatch
<a name="metrics-details"></a>

Amazon MSK integrates with Amazon CloudWatch so that you can collect, view, and analyze CloudWatch metrics for your MSK Standard brokers. The metrics that you configure for your MSK Provisioned clusters are automatically collected and pushed to CloudWatch at 1 minute intervals. You can set the monitoring level for an MSK Provisioned cluster to one of the following: `DEFAULT`, `PER_BROKER`, `PER_TOPIC_PER_BROKER`, or `PER_TOPIC_PER_PARTITION`. The tables in the following sections show all the metrics that are available starting at each monitoring level.

**Note**  
The names of some Amazon MSK metrics for CloudWatch monitoring have changed in version 3.6.0 and higher. Use the new names for monitoring these metrics. For metrics with changed names, the table below shows the name used in version 3.6.0 and higher, followed by the name in version 2.8.2.tiered.

`DEFAULT`-level metrics are free. Pricing for other metrics is described in the [Amazon CloudWatch pricing](https://aws.amazon.com/cloudwatch/pricing/) page.

## `DEFAULT` Level monitoring
<a name="default-metrics"></a>

The metrics described in the following table are available at the `DEFAULT` monitoring level. They are free.


| Name | When visible | Dimensions | Description | 
| --- | --- | --- | --- | 
| ActiveControllerCount | After the cluster gets to the ACTIVE state. | Cluster Name | Only one controller per cluster should be active at any given time. | 
| BurstBalance |  After the cluster gets to the ACTIVE state.  |  Cluster Name , Broker ID  |  The remaining balance of input-output burst credits for EBS volumes in the cluster. Use it to investigate latency or decreased throughput. `BurstBalance` is not reported for EBS volumes when the baseline performance of a volume is higher than the maximum burst performance. For more information, see [I/O Credits and burst performance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html#IOcredit).  | 
| BytesInPerSec | After you create a topic. | Cluster Name, Broker ID, Topic | The number of bytes per second received from clients. This metric is available per broker and also per topic. | 
| BytesOutPerSec | After you create a topic. | Cluster Name, Broker ID, Topic | The number of bytes per second sent to clients. This metric is available per broker and also per topic. | 
| ClientConnectionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID, Client Authentication | The number of active authenticated client connections. | 
| ConnectionCount | After the cluster gets to the ACTIVE state. |  Cluster Name, Broker ID  | The number of active authenticated, unauthenticated, and inter-broker connections.  | 
| CPUCreditBalance  |  After the cluster gets to the ACTIVE state.  |  Cluster Name, Broker ID  |  The number of earned CPU credits that a broker has accrued since it was launched. Credits are accrued in the credit balance after they are earned, and removed from the credit balance when they are spent. If you run out of the CPU credit balance, it can have a negative impact on your cluster's performance. You can take steps to reduce CPU load. For example, you can reduce the number of client requests or update the broker type to an M5 broker type.  | 
| CpuIdle | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU idle time. | 
| CpuIoWait | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU idle time during a pending disk operation. | 
| CpuSystem | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU in kernel space. | 
| CpuUser | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU in user space. | 
| GlobalPartitionCount | After the cluster gets to the ACTIVE state. | Cluster Name | The number of partitions across all topics in the cluster, excluding replicas. Because GlobalPartitionCount doesn't include replicas, the sum of the PartitionCount values can be higher than GlobalPartitionCount if the replication factor for a topic is greater than 1. | 
| GlobalTopicCount | After the cluster gets to the ACTIVE state. | Cluster Name | Total number of topics across all brokers in the cluster. | 
| EstimatedMaxTimeLag\$1 | After consumer group consumes from a topic. | Cluster Name, Consumer Group, Topic | Time estimate (in seconds) to drain MaxOffsetLag. | 
| KafkaAppLogsDiskUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for application logs. | 
| KafkaDataLogsDiskUsed (Cluster Name, Broker ID dimension) | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of disk space used for data logs. | 
| LeaderCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The total number of leaders of partitions per broker, not including replicas. | 
| MaxOffsetLag\$1 | After consumer group consumes from a topic. | Cluster Name, Consumer Group, Topic | The maximum offset lag across all partitions in a topic. | 
| MemoryBuffered | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of buffered memory for the broker. | 
| MemoryCached | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of cached memory for the broker. | 
| MemoryFree | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is free and available for the broker. | 
| HeapMemoryAfterGC  |  After the cluster gets to the ACTIVE state.  |  Cluster Name, Broker ID  | The percentage of total heap memory in use after garbage collection. | 
| MemoryUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is in use for the broker. | 
| MessagesInPerSec | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of incoming messages per second for the broker. | 
| NetworkRxDropped | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of dropped receive packages. | 
| NetworkRxErrors | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of network receive errors for the broker. | 
| NetworkRxPackets | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of packets received by the broker. | 
| NetworkTxDropped | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of dropped transmit packages. | 
| NetworkTxErrors | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of network transmit errors for the broker. | 
| NetworkTxPackets | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of packets transmitted by the broker. | 
| OfflinePartitionsCount | After the cluster gets to the ACTIVE state. | Cluster Name | Total number of partitions that are offline in the cluster. | 
| PartitionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The total number of topic partitions per broker, including replicas. | 
| ProduceTotalTimeMsMean | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The mean produce time in milliseconds. | 
| RequestBytesMean | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The mean number of request bytes for the broker. | 
| RequestTime | After request throttling is applied. | Cluster Name, Broker ID | The average time in milliseconds spent in broker network and I/O threads to process requests. | 
| RootDiskUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of the root disk used by the broker. | 
| RollingEstimatedTimeLagMax\$1 | After consumer group consumes from a topic. | Cluster Name, Consumer Group, Topic | Rolling maximum time estimate (in seconds) to drain the partition offset lag across all partitions in a topic. | 
| SumOffsetLag\$1 | After consumer group consumes from a topic. | Cluster Name, Consumer Group, Topic | The aggregated offset lag for all the partitions in a topic. | 
| SwapFree | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of swap memory that is available for the broker. | 
| SwapUsed  | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of swap memory that is in use for the broker. | 
| TrafficShaping  |  After the cluster gets to the ACTIVE state.  |  Cluster Name, Broker ID  |  High-level metrics indicating the number of packets shaped (dropped or queued) due to exceeding network allocations. Finer detail is available with PER\$1BROKER metrics.  | 
| UnderMinIsrPartitionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of under minIsr partitions for the broker. | 
| UnderReplicatedPartitions | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of under-replicated partitions for the broker. | 
| UserPartitionExists | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | A Boolean metric that indicates the presence of a user-owned partition on a broker. A value of 1 indicates the presence of partitions on the broker. | 
| ZooKeeperRequestLatencyMsMean  | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | For ZooKeeper-based cluster. The mean latency in milliseconds for Apache ZooKeeper requests from broker. | 
| ZooKeeperSessionState | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | For ZooKeeper-based cluster. Connection status of broker's ZooKeeper session which may be one of the following: NOT\$1CONNECTED: '0.0', ASSOCIATING: '0.1', CONNECTING: '0.5', CONNECTEDREADONLY: '0.8', CONNECTED: '1.0', CLOSED: '5.0', AUTH\$1FAILED: '10.0'. | 

\$1 Consumer lag metrics require ASCII-only consumer group names and have specific emission requirements. For more information, see [Monitor consumer lags](consumer-lag.md).

## `PER_BROKER` Level monitoring
<a name="broker-metrics"></a>

When you set the monitoring level to `PER_BROKER`, you get the metrics described in the following table in addition to all the `DEFAULT` level metrics. You pay for the metrics in the following table, whereas the `DEFAULT` level metrics continue to be free. The metrics in this table have the following dimensions: Cluster Name, Broker ID.


| Name | When visible | Description | 
| --- | --- | --- | 
| BwInAllowanceExceeded | After the cluster gets to the ACTIVE state. |  The number of packets shaped because the inbound aggregate bandwidth exceeded the maximum for the broker.  | 
| BwOutAllowanceExceeded | After the cluster gets to the ACTIVE state. |  The number of packets shaped because the outbound aggregate bandwidth exceeded the maximum for the broker.  | 
| ConntrackAllowanceExceeded  | After the cluster gets to the ACTIVE state. |  The number of packets shaped because the connection tracking exceeded the maximum for the broker. Connection tracking is related to security groups that track each connection established to ensure that return packets are delivered as expected.   | 
| ConnectionCloseRate | After the cluster gets to the ACTIVE state. |  The number of connections closed per second per listener. This number is aggregated per listener and filtered for the client listeners.  | 
| ConnectionCreationRate | After the cluster gets to the ACTIVE state. |  The number of new connections established per second per listener. This number is aggregated per listener and filtered for the client listeners.  | 
| CpuCreditUsage | After the cluster gets to the ACTIVE state. |  The number of CPU credits spent by the broker. If you run out of the CPU credit balance, it can have a negative impact on your cluter's performance. You can take steps to reduce CPU load. For example, you can reduce the number of client requests or update the broker type to an M5 broker type.  | 
| FetchConsumerLocalTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request is processed at the leader. | 
| FetchConsumerRequestQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request waits in the request queue. | 
| FetchConsumerResponseQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request waits in the response queue. | 
| FetchConsumerResponseSendTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds for the consumer to send a response. | 
| FetchConsumerTotalTimeMsMean | After there's a producer/consumer. | The mean total time in milliseconds that consumers spend on fetching data from the broker. | 
| FetchFollowerLocalTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request is processed at the leader. | 
| FetchFollowerRequestQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request waits in the request queue. | 
| FetchFollowerResponseQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request waits in the response queue. | 
| FetchFollowerResponseSendTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds for the follower to send a response. | 
| FetchFollowerTotalTimeMsMean | After there's a producer/consumer. | The mean total time in milliseconds that followers spend on fetching data from the broker. | 
| FetchMessageConversionsPerSec | After you create a topic. | The number of fetch message conversions per second for the broker. | 
| FetchThrottleByteRate | After bandwidth throttling is applied. | The number of throttled bytes per second. | 
| FetchThrottleQueueSize | After bandwidth throttling is applied. | The number of messages in the throttle queue. | 
| FetchThrottleTime | After bandwidth throttling is applied. | The average fetch throttle time in milliseconds. | 
| IAMNumberOfConnectionRequests | After the cluster gets to the ACTIVE state. | The number of IAM authentication requests per second. | 
| IAMTooManyConnections | After the cluster gets to the ACTIVE state. | The number of connections attempted beyond 100. 0 means the number of connections is within the limit. If >0, the throttle limit is being exceeded and you need to reduce number of connections. | 
| LinklocalAllowanceExceeded  | After the cluster gets to the ACTIVE state. |  The number of packets dropped because the PPS of the traffic to local proxy services exceeded the maximum for the network interface. This impacts traffic to the DNS service, the Instance Metadata Service, and the Amazon Time Sync Service.  | 
| NetworkProcessorAvgIdlePercent | After the cluster gets to the ACTIVE state. | The average percentage of the time the network processors are idle. | 
| PpsAllowanceExceeded | After the cluster gets to the ACTIVE state. |  The number of packets shaped because the bidirectional PPS exceeded the maximum for the broker.  | 
| ProduceLocalTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that the request is processed at the leader. | 
| ProduceMessageConversionsPerSec | After you create a topic. | The number of produce message conversions per second for the broker. | 
| ProduceMessageConversionsTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds spent on message format conversions. | 
| ProduceRequestQueueTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that request messages spend in the queue. | 
| ProduceResponseQueueTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that response messages spend in the queue. | 
| ProduceResponseSendTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds spent on sending response messages. | 
| ProduceThrottleByteRate | After bandwidth throttling is applied. | The number of throttled bytes per second. | 
| ProduceThrottleQueueSize | After bandwidth throttling is applied. | The number of messages in the throttle queue. | 
| ProduceThrottleTime | After bandwidth throttling is applied. | The average produce throttle time in milliseconds. | 
| ProduceTotalTimeMsMean | After the cluster gets to the ACTIVE state. | The mean produce time in milliseconds. | 
|  `RemoteFetchBytesPerSec (RemoteBytesInPerSec in v2.8.2.tiered)`  |  After there’s a producer/consumer.  |  The total number of bytes transferred from tiered storage in response to consumer fetches. This metric includes all topic-partitions that contribute to downstream data transfer traffic. Category: Traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric.  | 
| RemoteCopyBytesPerSec (RemoteBytesOutPerSec in v2.8.2.tiered) |  After there’s a producer/consumer.  |  The total number of bytes transferred to tiered storage, including data from log segments, indexes, and other auxiliary files. This metric includes all topic-partitions that contribute to upstream data transfer traffic. Category: Traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric.  | 
| RemoteLogManagerTasksAvgIdlePercent |  After the cluster gets to the ACTIVE state.  | The average percentage of time the remote log manager spent idle. The remote log manager transfers data from the broker to tiered storage. Category: Internal activity. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteLogReaderAvgIdlePercent |  After the cluster gets to the ACTIVE state.  | The average percentage of time the remote log reader spent idle. The remote log reader transfers data from the remote storage to the broker in response to consumer fetches. Category: Internal activity. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteLogReaderTaskQueueSize |  After the cluster gets to the ACTIVE state.  | The number of tasks responsible for reads from tiered storage that are waiting to be scheduled. Category: Internal activity. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteFetchErrorsPerSec (RemoteReadErrorPerSec in v2.8.2.tiered) |  After the cluster gets to the ACTIVE state.  | The total rate of errors in response to read requests that the specified broker sent to tiered storage to retrieve data in response to consumer fetches. This metric includes all topic partitions that contribute to downstream data transfer traffic. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteFetchRequestsPerSec (RemoteReadRequestsPerSec in v2.8.2.tiered) |  After the cluster gets to the ACTIVE state.  | The total number of read requests that the specifies broker sent to tiered storage to retrieve data in response to consumer fetches. This metric includes all topic partitions which contribute to downstream data transfer traffic. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteCopyErrorsPerSec (RemoteWriteErrorPerSec in v2.8.2.tiered) |  After the cluster gets to the ACTIVE state.  | The total rate of errors in response to write requests that the specified broker sent to tiered storage to transfer data upstream. This metric includes all topic partitions that contribute to upstream data transfer traffic. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteLogSizeBytes | After the cluster gets to the ACTIVE state. |  The number of bytes stored on the remote tier. This metric is available for tiered storage clusters from Apache Kafka version 3.7.x on Amazon MSK.  | 
| ReplicationBytesInPerSec | After you create a topic. | The number of bytes per second received from other brokers. | 
| ReplicationBytesOutPerSec | After you create a topic. | The number of bytes per second sent to other brokers. | 
| RequestExemptFromThrottleTime | After request throttling is applied. | The average time in milliseconds spent in broker network and I/O threads to process requests that are exempt from throttling. | 
| RequestHandlerAvgIdlePercent | After the cluster gets to the ACTIVE state. | The average percentage of the time the request handler threads are idle. | 
| RequestThrottleQueueSize | After request throttling is applied. | The number of messages in the throttle queue. | 
| RequestThrottleTime | After request throttling is applied. | The average request throttle time in milliseconds. | 
| TcpConnections | After the cluster gets to the ACTIVE state. |  Shows number of incoming and outgoing TCP segments with the SYN flag set.  | 
| RemoteCopyLagBytes (TotalTierBytesLag in v2.8.2.tiered) | After you create a topic. | The total number of bytes of the data that is eligible for tiering on the broker but has not been transferred to tiered storage yet. This metrics show the efficiency of upstream data transfer. As the lag increases, the amount of data that doesn't persist in tiered storage increases. Category: Archive lag. This is a not a KIP-405 metric. | 
| TrafficBytes | After the cluster gets to the ACTIVE state. |  Shows network traffic in overall bytes between clients (producers and consumers) and brokers. Traffic between brokers isn't reported.  | 
| VolumeQueueLength | After the cluster gets to the ACTIVE state. |  The number of read and write operation requests waiting to be completed in a specified time period.  | 
|  VolumeReadBytes  | After the cluster gets to the ACTIVE state. |  The number of bytes read in a specified time period.  | 
| VolumeReadOps  | After the cluster gets to the ACTIVE state. |  The number of read operations in a specified time period.  | 
| VolumeTotalReadTime  | After the cluster gets to the ACTIVE state. |  The total number of seconds spent by all read operations that completed in a specified time period.  | 
| VolumeTotalWriteTime  | After the cluster gets to the ACTIVE state. |  The total number of seconds spent by all write operations that completed in a specified time period.  | 
| VolumeWriteBytes  | After the cluster gets to the ACTIVE state. |  The number of bytes written in a specified time period.  | 
| VolumeWriteOps  | After the cluster gets to the ACTIVE state. |  The number of write operations in a specified time period.  | 

## `PER_TOPIC_PER_BROKER` Level monitoring
<a name="broker-topic-metrics"></a>

When you set the monitoring level to `PER_TOPIC_PER_BROKER`, you get the metrics described in the following table, in addition to all the metrics from the `PER_BROKER` and DEFAULT levels. Only the `DEFAULT` level metrics are free. The metrics in this table have the following dimensions: Cluster Name, Broker ID, Topic.

**Important**  
For an Amazon MSK cluster that uses Apache Kafka 2.4.1 or a newer version, the metrics in the following table appear only after their values become nonzero for the first time. For example, to see `BytesInPerSec`, one or more producers must first send data to the cluster. 


| Name | When visible | Description | 
| --- | --- | --- | 
| FetchMessageConversionsPerSec | After you create a topic. | The number of fetched messages converted per second. | 
| MessagesInPerSec | After you create a topic. | The number of messages received per second. | 
| ProduceMessageConversionsPerSec | After you create a topic. | The number of conversions per second for produced messages. | 
| RemoteFetchBytesPerSec (RemoteBytesInPerSec in v2.8.2.tiered) |  After you create a topic and the topic is producing/consuming.  |  The number of bytes transferred from tiered storage in response to consumer fetches for the specified topic and broker. This metric includes all partitions from the topic that contribute to downstream data transfer traffic on the specified broker. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric.  | 
| RemoteCopyBytesPerSec (RemoteBytesOutPerSec in v2.8.2.tiered) | After you create a topic and the topic is producing/consuming. |  The number of bytes transferred to tiered storage, for the specified topic and broker. This metric includes all partitions from the topic that contribute to upstream data transfer traffic on the specified broker. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric.  | 
| RemoteFetchErrorsPerSec (RemoteReadErrorPerSec in v2.8.2.tiered) | After you create a topic and the topic is producing/consuming. | The rate of errors in response to read requests that the specified broker sends to tiered storage to retrieve data in response to consumer fetches on the specified topic. This metric includes all partitions from the topic that contribute to downstream data transfer traffic on the specified broker. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteFetchRequestsPerSec (RemoteReadRequestsPerSec in v2.8.2.tiered) | After you create a topic and the topic is producing/consuming. | The number of read requests that the specifies broker sends to tiered storage to retrieve data in response to consumer fetches on the specified topic. This metric includes all partitions from the topic that contribute to downstream data transfer traffic on the specified broker. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteCopyErrorsPerSec (RemoteWriteErrorPerSec in v2.8.2.tiered) | After you create a topic and the topic is producing/consuming. | The rate of errors in response to write requests that the specified broker sends to tiered storage to transfer data upstream. This metric includes all partitions from the topic that contribute to upstream data transfer traffic on the specified broker. Category: traffic and error rates. This is a [KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) metric. | 
| RemoteLogSizeBytes | After you create a topic. |  The number of bytes stored on the remote tier. This metric is available for tiered storage clusters from Apache Kafka version 3.7.x on Amazon MSK.  | 

## `PER_TOPIC_PER_PARTITION` Level monitoring
<a name="topic-partition-metrics"></a>

When you set the monitoring level to `PER_TOPIC_PER_PARTITION`, you get the metrics described in the following table, in addition to all the metrics from the `PER_TOPIC_PER_BROKER`, `PER_BROKER`, and DEFAULT levels. Only the `DEFAULT` level metrics are free. The metrics in this table have the following dimensions: Consumer Group, Topic, Partition.


| Name | When visible | Description | 
| --- | --- | --- | 
| EstimatedTimeLag\$1 | After consumer group consumes from a topic. | Time estimate (in seconds) to drain the partition offset lag. | 
| OffsetLag\$1 | After consumer group consumes from a topic. | Partition-level consumer lag in number of offsets. | 
| RollingEstimatedTimeLag\$1 | After consumer group consumes from a topic. | Rolling time estimate (in seconds) to drain the partition offset lag. | 

\$1 Consumer lag metrics require ASCII-only consumer group names and have specific emission requirements. For more information, see [Monitor consumer lags](consumer-lag.md).

# Understand MSK Provisioned cluster states
<a name="msk-cluster-states"></a>

The following table shows the possible states of a MSK Provisioned cluster and describes what they mean. Unless otherwise specified, MSK Provisioned cluster states apply to both Standard and Express broker types. This table also describes what actions you can and cannot perform when an MSK Provisioned cluster is in one of these states. To find out the state of a cluster, you can visit the AWS Management Console. You can also use the [describe-cluster-v2](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/kafka/describe-cluster-v2.html) command or the [DescribeClusterV2](https://docs.aws.amazon.com/MSK/2.0/APIReference/v2-clusters-clusterarn.html#DescribeClusterV2) operation to describe the Provisioned cluster. The description of a cluster includes its state.


****  

| MSK Provisioned cluster state | Meaning and possible actions | 
| --- | --- | 
| ACTIVE |  You can produce and consume data. You can also perform Amazon MSK API and AWS CLI operations on the cluster.  | 
| CREATING |  Amazon MSK is setting up the Provisioned cluster. You must wait for the cluster to reach the ACTIVE state before you can use it to produce or consume data or to perform Amazon MSK API or AWS CLI operations on it.  | 
| DELETING | The Provisioned cluster is being deleted. You cannot use it to produce or consume data. You also cannot perform Amazon MSK API or AWS CLI operations on it. | 
| FAILED | The Provisioned cluster creation or deletion process failed. You cannot use the cluster to produce or consume data. You can delete the cluster but cannot perform Amazon MSK API or AWS CLI update operations on it. | 
| HEALING |  Amazon MSK is running an internal operation, like replacing an unhealthy broker. For example, the broker might be unresponsive. You can still use the Provisioned cluster to produce and consume data. However, you cannot perform Amazon MSK API or AWS CLI update operations on the cluster until it returns to the ACTIVE state.  | 
| MAINTENANCE | (Standard brokers only) Amazon MSK is performing routine maintenance operations on the cluster. Such maintenance operations include security patching. You can still use the cluster to produce and consume data. However, you cannot perform Amazon MSK API or AWS CLI update operations on the cluster until it returns to the ACTIVE state. The cluster State remains ACTIVE during maintenance on Express brokers. See [Patching on MSK Provisioned clusters](patching-impact.md). | 
| REBOOTING\$1BROKER | Amazon MSK is rebooting a broker. You can still use the Provisioned cluster to produce and consume data. However, you cannot perform Amazon MSK API or AWS CLI update operations on the cluster until it returns to the ACTIVE state. | 
| UPDATING | A user-initiated Amazon MSK API or AWS CLI operation is updating the Provisioned cluster. You can still use the Provisioned cluster to produce and consume data. However, you cannot perform any additional Amazon MSK API or AWS CLI update operations on the cluster until it returns to the ACTIVE state. | 

# Amazon MSK metrics for monitoring Express brokers with CloudWatch
<a name="metrics-details-express"></a>

Amazon MSK integrates with CloudWatch so that you can collect, view, and analyze CloudWatch metrics for your MSK Express brokers. The metrics that you configure for your MSK Provisioned clusters are automatically collected and pushed to CloudWatch at 1 minute intervals. You can set the monitoring level for an MSK Provisioned cluster to one of the following: `DEFAULT`, `PER_BROKER`, `PER_TOPIC_PER_BROKER`, or `PER_TOPIC_PER_PARTITION`. The tables in the following sections show the metrics that are available starting at each monitoring level.

`DEFAULT`-level metrics are free. Pricing for other metrics is described in the [Amazon CloudWatch pricing](https://aws.amazon.com/cloudwatch/pricing/) page.

## `DEFAULT` Level monitoring for Express brokers
<a name="express-default-metrics"></a>

The metrics described in the following table are available free of cost at the `DEFAULT` monitoring level.


| Name | When visible | Dimensions | Description | 
| --- | --- | --- | --- | 
| ActiveControllerCount | After the cluster gets to the ACTIVE state. | Cluster Name | Only one controller per cluster should be active at any given time. | 
| BytesInPerSec | After you create a topic. | Cluster Name, Broker ID, Topic | The number of bytes per second received from clients. This metric is available per broker and also per topic. | 
| BytesOutPerSec | After you create a topic. | Cluster Name, Broker ID, Topic | The number of bytes per second sent to clients. This metric is available per broker and also per topic. | 
| ClientConnectionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID, Client Authentication | The number of active authenticated client connections. | 
| ConnectionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of active authenticated, unauthenticated, and inter-broker connections. | 
| CpuIdle | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU idle time. | 
| CpuSystem | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU in kernel space. | 
| CpuUser | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The percentage of CPU in user space. | 
| GlobalPartitionCount | After the cluster gets to the ACTIVE state. | Cluster Name | The number of partitions across all topics in the cluster, excluding replicas. Because `GlobalPartitionCount` doesn't include replicas, the sum of the `PartitionCount` values can be higher than `GlobalPartitionCount` if the replication factor for a topic is greater than `1`. | 
| GlobalTopicCount | After the cluster gets to the ACTIVE state. | Cluster Name | Total number of topics across all brokers in the cluster. | 
| EstimatedMaxTimeLag\$1 | After consumer group consumes from a topic. | Consumer Group, Topic | Time estimate (in seconds) to drain `MaxOffsetLag`. | 
| LeaderCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The total number of leaders of partitions per broker, not including replicas. | 
| MaxOffsetLag\$1 | After consumer group consumes from a topic. | Consumer Group, Topic | The maximum offset lag across all partitions in a topic. | 
| MemoryBuffered | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of buffered memory for the broker. | 
| MemoryCached | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of cached memory for the broker. | 
| MemoryFree | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is free and available for the broker. | 
| MemoryUsed | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The size in bytes of memory that is in use for the broker. | 
| MessagesInPerSec | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of incoming messages per second for the broker. | 
| NetworkRxDropped | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of dropped receive packages. | 
| NetworkRxErrors | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of network receive errors for the broker. | 
| NetworkRxPackets | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of packets received by the broker. | 
| NetworkTxDropped | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of dropped transmit packages. | 
| NetworkTxErrors | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of network transmit errors for the broker. | 
| NetworkTxPackets | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The number of packets transmitted by the broker. | 
| PartitionCount | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The total number of topic partitions per broker, including replicas. | 
| ProduceTotalTimeMsMean | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The mean produce time in milliseconds. | 
| RequestBytesMean | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | The mean number of request bytes for the broker. | 
| RequestTime | After request throttling is applied. | Cluster Name, Broker ID | The average time in milliseconds spent in broker network and I/O threads to process requests. | 
| RollingEstimatedTimeLagMax\$1 | After consumer group consumes from a topic. | Consumer Group, Topic | Rolling maximum time estimate (in seconds) to drain the partition offset lag across all partitions in a topic. | 
| StorageUsed | After the cluster gets to the ACTIVE state. | Cluster Name | The total storage used across all partitions in the cluster, excluding replicas. | 
| SumOffsetLag\$1 | After consumer group consumes from a topic. | Consumer Group, Topic | The aggregated offset lag for all the partitions in a topic. | 
| UserPartitionExists | After the cluster gets to the ACTIVE state. | Cluster Name, Broker ID | Boolean metric that indicates the presence of a user-owned partition on a broker. A value of 1 indicates the presence of partitions on the broker. | 

\$1 Consumer lag metrics require ASCII-only consumer group names and have specific emission requirements. For more information, see [Monitor consumer lags](consumer-lag.md).

## `PER_BROKER` Level monitoring for Express brokers
<a name="express-per-broker-metrics"></a>

When you set the monitoring level to `PER_BROKER`, you get the metrics described in the following table in addition to all the `DEFAULT` level metrics. You pay for the metrics in the following table, whereas the `DEFAULT` level metrics continue to be free of charge. The metrics in this table have the following dimensions: Cluster Name, Broker ID.


| Name | When visible | Description | 
| --- | --- | --- | 
| ConnectionCloseRate | After the cluster gets to the ACTIVE state. | The number of connections closed per second per listener. This number is aggregated per listener and filtered for the client listeners. | 
| ConnectionCreationRate | After the cluster gets to the ACTIVE state. | The number of new connections established per second per listener. This number is aggregated per listener and filtered for the client listeners. | 
| FetchConsumerLocalTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request is processed at the leader. | 
| FetchConsumerRequestQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request waits in the request queue. | 
| FetchConsumerResponseQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the consumer request waits in the response queue. | 
| FetchConsumerResponseSendTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds for the consumer to send a response. | 
| FetchConsumerTotalTimeMsMean | After there's a producer/consumer. | The mean total time in milliseconds that consumers spend on fetching data from the broker. | 
| FetchFollowerLocalTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request is processed at the leader. | 
| FetchFollowerRequestQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request waits in the request queue. | 
| FetchFollowerResponseQueueTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds that the follower request waits in the response queue. | 
| FetchFollowerResponseSendTimeMsMean | After there's a producer/consumer. | The mean time in milliseconds for the follower to send a response. | 
| FetchFollowerTotalTimeMsMean | After there's a producer/consumer. | The mean total time in milliseconds that followers spend on fetching data from the broker. | 
| FetchThrottleByteRate | After bandwidth throttling is applied. | The number of throttled bytes per second. | 
| FetchThrottleQueueSize | After bandwidth throttling is applied. | The number of messages in the throttle queue. | 
| FetchThrottleTime | After bandwidth throttling is applied. | The average fetch throttle time in milliseconds. | 
| IAMNumberOfConnectionRequests | After the cluster gets to the ACTIVE state. | The number of IAM authentication requests per second. | 
| IAMTooManyConnections | After the cluster gets to the ACTIVE state. | The number of connections attempted beyond 100. `0` means the number of connections is within the limit. If `>0`, the throttle limit is being exceeded and you need to reduce number of connections. | 
| NetworkProcessorAvgIdlePercent | After the cluster gets to the ACTIVE state. | The average percentage of the time the network processors are idle. | 
| ProduceLocalTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that the request is processed at the leader. | 
| ProduceRequestQueueTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that request messages spend in the queue. | 
| ProduceResponseQueueTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds that response messages spend in the queue. | 
| ProduceResponseSendTimeMsMean | After the cluster gets to the ACTIVE state. | The mean time in milliseconds spent on sending response messages. | 
| ProduceThrottleByteRate | After bandwidth throttling is applied. | The number of throttled bytes per second. | 
| ProduceThrottleQueueSize | After bandwidth throttling is applied. | The number of messages in the throttle queue. | 
| ProduceThrottleTime | After bandwidth throttling is applied. | The average produce throttle time in milliseconds. | 
| ProduceTotalTimeMsMean | After the cluster gets to the ACTIVE state. | The mean produce time in milliseconds. | 
| ReplicationBytesInPerSec | After you create a topic. | The number of bytes per second received from other brokers. | 
| ReplicationBytesOutPerSec | After you create a topic. | The number of bytes per second sent to other brokers. | 
| RequestExemptFromThrottleTime | After request throttling is applied. | The average time in milliseconds spent in broker network and I/O threads to process requests that are exempt from throttling. | 
| RequestHandlerAvgIdlePercent | After the cluster gets to the ACTIVE state. | The average percentage of the time the request handler threads are idle. | 
| RequestThrottleQueueSize | After request throttling is applied. | The number of messages in the throttle queue. | 
| RequestThrottleTime | After request throttling is applied. | The average request throttle time in milliseconds. | 
| TcpConnections | After the cluster gets to the ACTIVE state. | Shows number of incoming and outgoing TCP segments with the SYN flag set. | 
| TrafficBytes | After the cluster gets to the ACTIVE state. | Shows network traffic in overall bytes between clients (producers and consumers) and brokers. Traffic between brokers isn't reported. | 

## `PER_TOPIC_PER_PARTITION` level monitoring for Express brokers
<a name="express-per-topic-per-partition-metrics"></a>

When you set the monitoring level to `PER_TOPIC_PER_PARTITION`, you get the metrics described in the following table, in addition to all the metrics from the `PER_TOPIC_PER_BROKER`, `PER_BROKER`, and `DEFAULT` levels. Only the `DEFAULT` level metrics are free of charge. The metrics in this table have the following dimensions: Consumer Group, Topic, Partition.


| Name | When visible | Description | 
| --- | --- | --- | 
| EstimatedTimeLag\$1 | After consumer group consumes from a topic. | Time estimate (in seconds) to drain the partition offset lag. | 
| OffsetLag\$1 | After consumer group consumes from a topic. | Partition-level consumer lag in number of offsets. | 
| RollingEstimatedTimeLag\$1 | After consumer group consumes from a topic. | Rolling time estimate (in seconds) to drain the partition offset lag. | 

\$1 Consumer lag metrics require ASCII-only consumer group names and have specific emission requirements. For more information, see [Monitor consumer lags](consumer-lag.md).

## `PER_TOPIC_PER_BROKER` level monitoring for Express brokers
<a name="express-per-topic-per-broker-metrics"></a>

When you set the monitoring level to `PER_TOPIC_PER_BROKER`, you get the metrics described in the following table, in addition to all the metrics from the `PER_BROKER` and `DEFAULT` levels. Only the `DEFAULT` level metrics are free of charge. The metrics in this table have the following dimensions: Cluster Name, Broker ID, Topic.

**Important**  
The metrics in the following table appear only after their values become nonzero for the first time. For example, to see BytesInPerSec, one or more producers must first send data to the cluster.


| Name | When visible | Description | 
| --- | --- | --- | 
| MessagesInPerSec | After you create a topic. | The number of messages received per second. | 

# Monitor an MSK Provisioned cluster with Prometheus
<a name="open-monitoring"></a>

You can monitor your MSK Provisioned cluster with Prometheus, an open-source monitoring system for time-series metric data. You can publish this data to Amazon Managed Service for Prometheus using Prometheus's remote write feature. You can also use tools that are compatible with Prometheus-formatted metrics or tools that integrate with Amazon MSK Open Monitoring, such as [Datadog](https://docs.datadoghq.com/integrations/amazon_msk/), [Lenses](https://docs.lenses.io/latest/deployment/configuration/agent/automation/kafka/aws-msk), [New Relic](https://docs.newrelic.com/docs/integrations/amazon-integrations/aws-integrations-list/aws-managed-kafka-msk-integration), and [Sumo logic](https://help.sumologic.com/03Send-Data/Collect-from-Other-Data-Sources/Amazon_MSK_Prometheus_metrics_collection). Open monitoring is available for free but charges apply for the transfer of data across Availability Zones.

For information about Prometheus, see the [Prometheus documentation](https://prometheus.io/docs).

For information about using Prometheus, see [Enhance operational insights for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana](https://aws.amazon.com/blogs//big-data/enhance-operational-insights-for-amazon-msk-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/).

**Note**  
KRaft metadata mode and MSK Express brokers can't have open monitoring and public access both enabled.

# Enable open monitoring on new MSK Provisioned clusters
<a name="enable-open-monitoring-at-creation"></a>

This procedure describes how to enable open monitoring on a new MSK cluster using the AWS Management Console, the AWS CLI, or the Amazon MSK API.

**Using the AWS Management Console**

1. Sign in to the AWS Management Console, and open the Amazon MSK console at [https://console.aws.amazon.com/msk/home?region=us-east-1\$1/home/](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/).

1. In the **Monitoring** section, select the check box next to **Enable open monitoring with Prometheus**.

1. Provide the required information in all the sections of the page, and review all the available options.

1. Choose **Create cluster**.

**Using the AWS CLI**
+ Invoke the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/kafka/create-cluster.html) command and specify its `open-monitoring` option. Enable the `JmxExporter`, the `NodeExporter`, or both. If you specify `open-monitoring`, the two exporters can't be disabled at the same time.

**Using the API**
+ Invoke the [CreateCluster](https://docs.aws.amazon.com/msk/1.0/apireference/clusters.html#CreateCluster) operation and specify `OpenMonitoring`. Enable the `jmxExporter`, the `nodeExporter`, or both. If you specify `OpenMonitoring`, the two exporters can't be disabled at the same time.

# Enable open monitoring on existing MSK Provisioned cluster
<a name="enable-open-monitoring-after-creation"></a>

To enable open monitoring, make sure that the MSK Provisioned cluster is in the `ACTIVE` state.

**Using the AWS Management Console**

1. Sign in to the AWS Management Console, and open the Amazon MSK console at [https://console.aws.amazon.com/msk/home?region=us-east-1\$1/home/](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/).

1. Choose the name of the cluster that you want to update. This takes you to a page the contains details for the cluster.

1. On the **Properties** tab, scroll down to find the **Monitoring** section.

1. Choose **Edit**.

1. Select the check box next to **Enable open monitoring with Prometheus**.

1. Choose **Save changes**.

**Using the AWS CLI**
+ Invoke the [update-monitoring](https://docs.aws.amazon.com/cli/latest/reference/kafka/update-monitoring.html) command and specify its `open-monitoring` option. Enable the `JmxExporter`, the `NodeExporter`, or both. If you specify `open-monitoring`, the two exporters can't be disabled at the same time.

**Using the API**
+ Invoke the [UpdateMonitoring](https://docs.aws.amazon.com/msk/1.0/apireference/clusters-clusterarn-monitoring.html#UpdateMonitoring) operation and specify `OpenMonitoring`. Enable the `jmxExporter`, the `nodeExporter`, or both. If you specify `OpenMonitoring`, the two exporters can't be disabled at the same time.

# Set up a Prometheus host on an Amazon EC2 instance
<a name="set-up-prometheus-host"></a>

This procedure describes how to set up a Prometheus host using a prometheus.yml file.

1. Download the Prometheus server from [https://prometheus.io/download/#prometheus](https://prometheus.io/download/#prometheus) to your Amazon EC2 instance.

1. Extract the downloaded file to a directory and go to that directory.

1. Create a file with the following contents and name it `prometheus.yml`.

   ```
   # file: prometheus.yml
   # my global config
   global:
     scrape_interval:     60s
   
   # A scrape configuration containing exactly one endpoint to scrape:
   # Here it's Prometheus itself.
   scrape_configs:
     # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
     - job_name: 'prometheus'
       static_configs:
       # 9090 is the prometheus server port
       - targets: ['localhost:9090']
     - job_name: 'broker'
       file_sd_configs:
       - files:
         - 'targets.json'
   ```

1. Use the [ListNodes](https://docs.aws.amazon.com//msk/1.0/apireference/clusters-clusterarn-nodes.html#ListNodes) operation to get a list of your cluster's brokers.

1. Create a file named `targets.json` with the following JSON. Replace *broker\$1dns\$11*, *broker\$1dns\$12*, and the rest of the broker DNS names with the DNS names you obtained for your brokers in the previous step. Include all of the brokers you obtained in the previous step. Amazon MSK uses port 11001 for the JMX Exporter and port 11002 for the Node Exporter.

------
#### [ ZooKeeper mode targets.json ]

   ```
   [
     {
       "labels": {
         "job": "jmx"
       },
       "targets": [
         "broker_dns_1:11001",
         "broker_dns_2:11001",
         .
         .
         .
         "broker_dns_N:11001"
       ]
     },
     {
       "labels": {
         "job": "node"
       },
       "targets": [
         "broker_dns_1:11002",
         "broker_dns_2:11002",
         .
         .
         .
         "broker_dns_N:11002"
       ]
     }
   ]
   ```

------
#### [ KRaft mode targets.json ]

   ```
   [
     {
       "labels": {
         "job": "jmx"
       },
       "targets": [
         "broker_dns_1:11001",
         "broker_dns_2:11001",
         .
         .
         .
         "broker_dns_N:11001",
         "controller_dns_1:11001",
         "controller_dns_2:11001",
         "controller_dns_3:11001"
       ]
     },
     {
       "labels": {
         "job": "node"
       },
       "targets": [
         "broker_dns_1:11002",
         "broker_dns_2:11002",
         .
         .
         .
         "broker_dns_N:11002"
       ]
     }
   ]
   ```

------
**Note**  
To scrape JMX metrics from KRaft controllers, add controller DNS names as targets in the JSON file. For example: `controller_dns_1:11001`, replacing `controller_dns_1` with the actual controller DNS name.

1. To start the Prometheus server on your Amazon EC2 instance, run the following command in the directory where you extracted the Prometheus files and saved `prometheus.yml` and `targets.json`.

   ```
   ./prometheus
   ```

1. Find the IPv4 public IP address of the Amazon EC2 instance where you ran Prometheus in the previous step. You need this public IP address in the following step.

1. To access the Prometheus web UI, open a browser that can access your Amazon EC2 instance, and go to `Prometheus-Instance-Public-IP:9090`, where *Prometheus-Instance-Public-IP* is the public IP address you got in the previous step.

# Use Prometheus metrics
<a name="prometheus-metrics"></a>

All metrics emitted by Apache Kafka to JMX are accessible using open monitoring with Prometheus. For information about Apache Kafka metrics, see [Monitoring](https://kafka.apache.org/documentation/#monitoring) in the Apache Kafka documentation. Along with Apache Kafka metrics, consumer-lag metrics are also available at port 11001 under the JMX MBean name `kafka.consumer.group:type=ConsumerLagMetrics`. You can also use the Prometheus Node Exporter to get CPU and disk metrics for your brokers at port 11002.

# Store Prometheus metrics in Amazon Managed Service for Prometheus
<a name="managed-service-prometheus"></a>

Amazon Managed Service for Prometheus is a Prometheus-compatible monitoring and alerting service that you can use to monitor Amazon MSK clusters. It is a fully-managed service that automatically scales the ingestion, storage, querying, and alerting of your metrics. It also integrates with AWS security services to give you fast and secure access to your data. You can use the open-source PromQL query language to query your metrics and alert on them.

For more information, see [Getting started with Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html).

# Monitor consumer lags
<a name="consumer-lag"></a>

Monitoring consumer lag allows you to identify slow or stuck consumers that aren't keeping up with the latest data available in a topic. When necessary, you can then take remedial actions, such as scaling or rebooting those consumers. To monitor consumer lag, you can use Amazon CloudWatch or open monitoring with Prometheus.

Consumer lag metrics quantify the difference between the latest data written to your topics and the data read by your applications. Amazon MSK provides the following consumer-lag metrics, which you can get through Amazon CloudWatch or through open monitoring with Prometheus: `EstimatedMaxTimeLag`, `EstimatedTimeLag`, `MaxOffsetLag`, `OffsetLag`, and `SumOffsetLag`. For information about these metrics, see [Amazon MSK metrics for monitoring Standard brokers with CloudWatch](metrics-details.md).

Amazon MSK supports consumer lag metrics for clusters with Apache Kafka 2.2.1 or a later version. Consider the following points when you work with Kafka and CloudWatch metrics:
+ Consumer lag metrics are emitted only if a consumer group is in a STABLE or EMPTY state. A consumer group is STABLE after the successful completion of re-balancing, ensuring that partitions are evenly distributed among the consumers.
+ Consumer lag metrics are absent in the following scenarios:
  + If the consumer group is unstable.
  + The name of the consumer group contains a colon (:).
  + You haven't set the consumer offset for the consumer group.
+ Consumer group names are used as dimensions for consumer lag metrics in CloudWatch. While Kafka supports UTF-8 characters in consumer group names, CloudWatch supports only ASCII characters for [dimension values](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_Dimension.html). If you use non-ASCII characters in consumer group names, CloudWatch drops the consumer lag metrics. To make sure that your consumer lag metrics are properly captured in CloudWatch, you must use only ASCII characters in your consumer group names.

# Use Amazon MSK storage capacity alerts
<a name="cluster-alerts"></a>

On Amazon MSK provisioned clusters, you choose the cluster's primary storage capacity. If you exhaust the storage capacity on a broker in your provisioned cluster, it can affect its ability to produce and consume data, leading to costly downtime. Amazon MSK offers CloudWatch metrics to help you monitor your cluster's storage capacity. However, to make it easier for you to detect and resolve storage capacity issues, Amazon MSK automatically sends you dynamic cluster storage capacity alerts. The storage capacity alerts include recommendations for short-term and long-term steps to manage your cluster's storage capacity. From the [Amazon MSK console](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/), you can use quick links within the alerts to take recommended actions immediately.

There are two types of MSK storage capacity alerts: proactive and remedial.
+ Proactive ("Action required") storage capacity alerts warn you about potential storage issues with your cluster. When a broker in an MSK cluster has used over 60% or 80% of its disk storage capacity, you'll receive proactive alerts for the affected broker. 
+ Remedial ("Critical action required") storage capacity alerts require you to take remedial action to fix a critical cluster issue when one of the brokers in your MSK cluster has run out of disk storage capacity.

Amazon MSK automatically sends these alerts to the [Amazon MSK console](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/), [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health/), [Amazon EventBridge](https://aws.amazon.com/pm/eventbridge/), and email contacts for your AWS account. You can also [configure Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-api-destination-partners.html) to deliver these alerts to Slack or to tools such as New Relic, and Datadog. 

Storage capacity alerts are enabled by default for all MSK provisioned clusters and can't be turned off. This feature is supported in all regions where MSK is available.

## Monitor storage capacity alerts
<a name="cluster-alerts-monitoring"></a>

You can check for storage capacity alerts in several ways:
+ Go to the [Amazon MSK console](https://console.aws.amazon.com/msk/home?region=us-east-1#/home/). Storage capacity alerts are displayed in the cluster alerts pane for 90 days. The alerts contain recommendations and single-click link actions to address disk storage capacity issues.
+ Use [ListClusters](https://docs.aws.amazon.com/msk/1.0/apireference/clusters.html#ListClusters), [ListClustersV2](https://docs.aws.amazon.com/MSK/2.0/APIReference/v2-clusters.html#ListClustersV2), [DescribeCluster](https://docs.aws.amazon.com/msk/1.0/apireference/clusters-clusterarn.html#DescribeCluster), or [DescribeClusterV2](https://docs.aws.amazon.com/MSK/2.0/APIReference/v2-clusters-clusterarn.html#DescribeClusterV2) APIs to view `CustomerActionStatus` and all the alerts for a cluster.
+ Go to the [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health/) to view alerts from MSK and other AWS services.
+ Set up [AWS Health API](https://docs.aws.amazon.com/health/latest/ug/health-api.html) and [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-api-destination-partners.html) to route alert notifications to 3rd party platforms such as Datadog, NewRelic, and Slack.