Data skew

A Flink application is executed on a cluster in a distributed fashion. To scale out to multiple nodes, Flink uses the concept of keyed streams, which essentially means that the events of a stream are partitioned according to a specific key, e.g., customer id, and Flink can then process different partitions on different nodes. Many of the Flink operators are then evaluated based on these partitions, e.g., Keyed Windows, Process Functions and Async I/O.

Choosing a partition key often depends on the business logic. At the same time, many of the best practices for, e.g., DynamoDB and Spark, equally apply to Flink, including:

ensuring a high cardinality of partition keys
avoiding skew in the event volume between partitions

You can identify skew in the partitions by comparing the records received/sent of subtasks (i.e., instances of the same operator) in the Flink dashboard. In addition, Managed Service for Apache Flink monitoring can be configured to expose metrics for numRecordsIn/Out and numRecordsInPerSecond/OutPerSecond on a subtask level as well.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Backpressure

State skew