Using the AWS Glue Data Catalog as a metastore

Metastore configuration for EMR Serverless

A Hive metastore is a centralized location that stores structural information about your tables, including schemas, partition names, and data types. With EMR Serverless, you can persist this table metadata in a metastore that has access to your jobs.

You have two options for a Hive metastore:

The AWS Glue Data Catalog
An external Apache Hive metastore

Using the AWS Glue Data Catalog as a metastore

You can configure your Spark and Hive jobs to use the AWS Glue Data Catalog as its metastore. We recommend this configuration when you require a persistent metastore or a metastore shared by different applications, services, or AWS accounts. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog. For information about AWS Glue pricing, see AWS Glue pricing.

You can configure your EMR Serverless job to use the AWS Glue Data Catalog either in the same AWS account as your application, or in a different AWS account.

Configure the AWS Glue Data Catalog

To configure the Data Catalog, choose which type of EMR Serverless application that you want to use.

Spark

When you use EMR Studio to run your jobs with EMR Serverless Spark applications, the AWS Glue Data Catalog is the default metastore.

When you use SDKs or AWS CLI, you can set the spark.hadoop.hive.metastore.client.factory.class configuration to com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory in the sparkSubmit parameters of your job run. The following example shows how to configure the Data Catalog with the AWS CLI.


aws emr-serverless start-job-run \
    --application-id application-id \
    --execution-role-arn job-role-arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://amzn-s3-demo-bucket/code/pyspark/extreme_weather.py",
            "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g"
        }
    }'

Alternatively, you can set this configuration when you create a new SparkSession in your Spark code.


from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("SparkSQL")
    .config(
        "spark.hadoop.hive.metastore.client.factory.class",
        "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
    )
    .enableHiveSupport()
    .getOrCreate()
)

# we can query tables with SparkSQL
spark.sql("SHOW TABLES").show()

# we can also them with native Spark
print(spark.catalog.listTables())

Hive

For EMR Serverless Hive applications, the Data Catalog is the default metastore. That is, when you run jobs on a EMR Serverless Hive application, Hive records metastore information in the Data Catalog in the same AWS account as your application. You don't need a virtual private cloud (VPC) to use the Data Catalog as your metastore.

To access the Hive metastore tables, add the required AWS Glue policies outlined in Setting up IAM Permissions for AWS Glue.

Configure cross-account access for EMR Serverless and AWS Glue Data Catalog

To set up cross-account access for EMR Serverless, you must first sign in to the following AWS accounts:

AccountA – An AWS account where you have created an EMR Serverless application.
AccountB – An AWS account that contains a AWS Glue Data Catalog that you want your EMR Serverless job runs to access.

Make sure an administrator or other authorized identity in AccountB attaches a resource policy to the Data Catalog in AccountB. This policy grants AccountA specific cross-account permissions to perform operations on resources in the AccountB catalog.


{
  "Version" : "2012-10-17",
  "Statement" : [ {
    "Effect" : "Allow",
    "Principal": {
        "AWS": [
            "arn:aws:iam::accountA:role/job-runtime-role-A"
        ]},
    "Action" : [
        "glue:GetDatabase",
        "glue:CreateDatabase",
        "glue:GetDataBases",
        "glue:CreateTable",
        "glue:GetTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:GetTables",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition",
        "glue:GetUserDefinedFunctions"
    ],
    "Resource": ["arn:aws:glue:region:AccountB:catalog"]
  } ]
}

Add an IAM policy to the EMR Serverless job runtime role in AccountA so that role can access Data Catalog resources in AccountB.


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:CreateDatabase",
        "glue:GetDataBases",
        "glue:CreateTable",
        "glue:GetTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:GetTables",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition",
        "glue:GetUserDefinedFunctions"
      ],
      "Resource": ["arn:aws:glue:region:AccountB:catalog"]
    }
  ]
}

Start your job run. This step is slightly different depending on AccountA's EMR Serverless application type.

Considerations when using the AWS Glue Data Catalog

You can add auxiliary JARs with ADD JAR in your Hive scripts. For additional considerations, see Considerations when using AWS Glue Data Catalog.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Working with Glue Data Catalog views

Using an external Hive metastore