Amazon Managed Service for Apache Flink에서 CloudWatch 경보 사용 - Managed Service for Apache Flink

Amazon Managed Service for Apache Flink는 이전에 Amazon Kinesis Data Analytics for Apache Flink로 알려졌습니다.

기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다.

Amazon Managed Service for Apache Flink에서 CloudWatch 경보 사용

Amazon CloudWatch 경보를 사용하면 지정한 기간 동안 CloudWatch 지표를 모니터링할 수 있습니다. 이러한 경보는 여러 기간에 대해 지정된 임계값과 지표 또는 표현식의 값을 비교하여 하나 이상의 작업을 수행합니다. 작업의 한 예로는 Amazon Simple Notification Service(SNS) 주제에 알림을 보내는 것이 있습니다.

CloudWatch 경보에 대한 자세한 내용은 Amazon CloudWatch 경보 사용을 참조하세요.

이 섹션은 Managed Service for Apache Flink 애플리케이션을 모니터링할 때 권장되는 경보를 설명합니다.

이 표에는 권장 경보가 설명되어 있으며 다음과 같은 열이 있습니다.

  • 지표 표현식: 임계값에 비춰 테스트할 지표나 지표 표현식입니다.

  • 통계: 지표를 확인하는 데 사용되는 통계(예: 평균)입니다.

  • 임계값: 이 경보를 사용하려면 예상 애플리케이션 성능의 한계를 정의하는 임계값을 결정해야 합니다. 이 임계값은 정상적인 조건에서 애플리케이션을 모니터링하여 결정해야 합니다.

  • 설명: 이 경보를 트리거할 수 있는 원인과 해당 조건에 대한 가능한 해결 방법.

지표 표현식 통계 Threshold 설명
다운타임 > 0 Average 0 A downtime greater than zero indicates that the application has failed. If the value is larger than 0, the application is not processing any data. Recommended for all applications. The 다운타임 metric measures the duration of an outage. A downtime greater than zero indicates that the application has failed. For troubleshooting, see 애플리케이션이 다시 시작 중입니다..
RATE (numberOfFailedCheckpoints) > 0 Average 0 This metric counts the number of failed checkpoints since the application started. Depending on the application, it can be tolerable if checkpoints fail occasionally. But if checkpoints are regularly failing, the application is likely unhealthy and needs further attention. We recommend monitoring RATE(numberOfFailedCheckpoints) to alarm on the gradient and not on absolute values. Recommended for all applications. Use this metric to monitor application health and checkpointing progress. The application saves state data to checkpoints when it's healthy. Checkpointing can fail due to timeouts if the application isn't making progress in processing the input data. For troubleshooting, see 체크포인트 시간이 초과되었습니다..
Operator.numRecordsOutPerSecond < threshold Average The minimum number of records emitted from the application during normal conditions. Recommended for all applications. Falling below this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see 처리량이 너무 느림.
records_lag_max|millisbehindLatest > threshold Maximum The maximum expected latency during normal conditions. If the application is consuming from Kinesis or Kafka, these metrics indicate if the application is falling behind and needs to be scaled in order to keep up with the current load. This is a good generic metric that is easy to track for all kinds of applications. But it can only be used for reactive scaling, i.e., when the application has already fallen behind. Recommended for all applications. Use the records_lag_max metric for a Kafka source, or the millisbehindLatest for a Kinesis stream source. Rising above this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see 처리량이 너무 느림.
lastCheckpointDuration > threshold Maximum The maximum expected checkpoint duration during normal conditions. Monitors how much data is stored in state and how long it takes to take a checkpoint. If checkpoints grow or take long, the application is continuously spending time on checkpointing and has less cycles for actual processing. At some points, checkpoints may grow too large or take so long that they fail. In addition to monitoring absolute values, customers should also considering monitoring the change rate with RATE(lastCheckpointSize) and RATE(lastCheckpointDuration). If the lastCheckpointDuration continuously increases, rising above this threshold can indicate that the application isn't making expected progress on the input data, or that there are problems with application health such as backpressure. For troubleshooting, see 무제한 상태 증가.
lastCheckpointSize > threshold Maximum The maximum expected checkpoint size during normal conditions. Monitors how much data is stored in state and how long it takes to take a checkpoint. If checkpoints grow or take long, the application is continuously spending time on checkpointing and has less cycles for actual processing. At some points, checkpoints may grow too large or take so long that they fail. In addition to monitoring absolute values, customers should also considering monitoring the change rate with RATE(lastCheckpointSize) and RATE(lastCheckpointDuration). If the lastCheckpointSize continuously increases, rising above this threshold can indicate that the application is accumulating state data. If the state data becomes too large, the application can run out of memory when recovering from a checkpoint, or recovering from a checkpoint might take too long. For troubleshooting, see 무제한 상태 증가.
heapMemoryUtilization > threshold Maximum This gives a good indication of the overall resource utilization of the application and can be used for proactive scaling unless the application is I/O bound. The maximum expected heapMemoryUtilization size during normal conditions, with a recommended value of 90 percent. You can use this metric to monitor the maximum memory utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources. You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see 애플리케이션 조정 구현.
cpuUtilization > threshold Maximum This gives a good indication of the overall resource utilization of the application and can be used for proactive scaling unless the application is I/O bound. The maximum expected cpuUtilization size during normal conditions, with a recommended value of 80 percent. You can use this metric to monitor the maximum CPU utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see 애플리케이션 조정 구현.
threadsCount > threshold Maximum The maximum expected threadsCount size during normal conditions. You can use this metric to watch for thread leaks in task managers across the application. If this metric reaches this threshold, check your application code for threads being created without being closed.
(oldGarbageCollectionTime* 100) /1분 동안 60_000') > threshold Maximum The maximum expected oldGarbageCollectionTime duration. We recommend setting a threshold such that typical garbage collection time is 60 percent of the specified threshold, but the correct threshold for your application will vary. If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application.
비율(oldGarbageCollectionCount) > threshold Maximum The maximum expected oldGarbageCollectionCount under normal conditions. The correct threshold for your application will vary. If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application.
Operator.currentOutputWatermark - Operator.currentInputWatermark > threshold Minimum The minimum expected watermark increment under normal conditions. The correct threshold for your application will vary. If this metric is continually increasing, this can indicate that either the application is processing increasingly older events, or that an upstream subtask has not sent a watermark in an increasingly long time.