To connect the AWS Glue Data Catalog to a Hive metastore, you need to deploy an AWS SAM application called GlueDataCatalogFederation-HiveMetastore
The AWS SAM application creates the connection for the Hive metastore behind Amazon API Gateway using a Lambda function. The AWS SAM application uses a uniform resource identifier (URI) as an input from the user and connects the external Hive metastore to the Data Catalog. When a user runs a query on Hive tables, the Data Catalog calls the API Gateway endpoint. The endpoint invokes the Lambda function to retrieve the metadata of the Hive tables.
To connect the Data Catalog to the Hive metastore and set up permissions
-
Deploy the AWS SAM application.
Sign in to the AWS Management Console and open the AWS Serverless Application Repository.
In the navigation pane, choose Available applications.
-
Choose Public applications.
Select the option Show apps that create custom IAM roles or resource policies.
In the search box, enter the name GlueDataCatalogFederation-HiveMetastore.
-
Choose the GlueDataCatalogFederation-HiveMetastore application.
-
Under Application Settings, enter the following minimum required settings for your Lambda function:
Application name - A name for your AWS SAM application.
GlueConnectionName - A name for the connection.
HiveMetastoreURIs - The URI of your Hive metastore host.
-
LambdaMemory - The amount of Lambda memory in MB from 128-10240. The default is 1024.
LambdaTimeout - The maximum Lambda invocation runtime in seconds. The default is 30.
VPCSecurityGroupIds and VPCSubnetIds - Information for the VPC where the Hive metastore exists.
Select I acknowledge that this app creates custom IAM roles and resource policies. For more information, choose the Info link.
At the bottom right of the Application settings section, choose Deploy. When the deployment is complete, the Lambda function appears in the Resources section in the Lambda console.
The application is deployed to Lambda. Its name is prepended with serverlessrepo- to indicate that the application was deployed from the AWS Serverless Application Repository. Selecting the application takes you to the Resources page where each of the resources of the application that were deployed are listed. The resources include the Lambda function that allows communication between the Data Catalog and the Hive metastore, the AWS Glue connection, and other resources that are needed for the database federation.
-
Create a federated database in the Data Catalog.
After you've created a connection to the Hive metastore, you can create federated databases in the Data Catalog that point to the external Hive metastore databases. You need to create a corresponding database in the Data Catalog for every Hive metastore database that you're connecting to the Data Catalog.
-
On the Data sharing page, choose the Shared databases tab, and then choose Create database.
For Connection name, choose the name of your Hive metastore connection from the dropdown menu.
Enter a unique database name and the federation source identifier for the database. This is the name that you use in your SQL statements when you query tables. The name can consist of a maximum of 255 characters maximum and must be unique within your account.
Choose Create database.
-
View tables in the federated database.
After you've created the federated database, you can view the list of tables in your Hive metastore using the Lake Formation console or the AWS CLI.
-
Select the database name from the Shared databases tab.
-
On the Databases page, choose View tables.
-
Grant permissions.
After you’ve created the database, you can grant permissions to other IAM users and roles in your account or to external AWS accounts and organizations. You will not be able to grant write data permissions (insert, delete) and metadata permissions (alter, drop, create) on the federated databases. For more information on granting permissions, see Managing Lake Formation permissions.
-
Query the federated databases.
After you grant permissions, users can sign in and start querying the federated database using Athena and Amazon Redshift. Users can now use the local database name to reference the Hive database in SQL queries.
Example Amazon Athena query syntax
Replace
fed_glue_db
with the local database name that you created earlier.Select * from fed_glue_db.customers limit 10;