Configuring an external metastore for Hive - Amazon EMR

Configuring an external metastore for Hive

By default, Hive records metastore information in a MySQL database on the primary node's file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the primary node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an external metastore that exists outside the cluster.

You have two options for an external metastore:

AWS Glue Data Catalog (Amazon EMR release 5.8.0 or later only).

For more information, see Using the AWS Glue Data Catalog as the metastore for Hive.
Amazon RDS or Amazon Aurora.

For more information, see Using an external MySQL database or Amazon Aurora.

Note

If you're using Hive 3 and encounter too many connections to Hive metastore, configure the parameter datanucleus.connectionPool.maxPoolSize to have a smaller value or increase the number of connection the database server can handle. The increased number of connections is due to the way Hive computes the maximum number of JDBC connections. To calculate the optimal value for performance, see Hive Configuration Properties.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Differences and considerations for Hive on Amazon EMR

Using the AWS Glue Data Catalog as the metastore for Hive

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Configuring an external metastore for Hive

Note

Did this page help you?

Next topic:

Previous topic:

Need help?