Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

MSCK Optimization - Amazon EMR

MSCK Optimization

Hive stores a list of partitions for each table in its metastore. However, when partitions are directly added to or removed from the file system, the Hive metastore is unaware of these changes. The MSCK command updates the partition metadata in the Hive metastore for partitions that were directly added to or removed from the file system. The syntax for the command is:

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

Hive implements this command as follows:

  1. Hive retrieves all the partitions for the table from the metastore. From the list of partition paths that do not exist in the file system then creates a list of partitions to drop from the metastore.

  2. Hive gathers the partition paths present in the file system, compares them with the list of partitions from the metastore, and generates a list of partitions that need to be added to the metastore.

  3. Hive updates the metastore using ADD, DROP, or SYNC mode.

Note

When there are many partitions in the metastore, the step to check if a partition does not exist in the file system takes a long time to run because the file system's exists API call must be made for each partition.

In Amazon EMR 6.5.0, Hive introduced a flag called hive.emr.optimize.msck.fs.check. When enabled, this flag causes Hive to check for the presence of a partition from the list of partition paths from the file system that is generated in step 2 above instead of making file system API calls. In Amazon EMR 6.8.0, Hive enabled this optimization by default, eliminating the need to set the flag hive.emr.optimize.msck.fs.check.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.