Accessing the Data Catalog - AWS Glue

Accessing the Data Catalog

You can use the AWS Glue Data Catalog (Data Catalog) to discover and understand your data. Data Catalog provides a consistent way to maintain schema definitions, data types, locations, and other metadata. You can access the Data Catalog using the following methods:

  • AWS Glue console – You can access and manage the Data Catalog through the AWS Glue console, a web-based user interface. The console allows you to browse and search for databases, tables, and their associated metadata, as well as create, update, and delete metadata definitions.

  • AWS Glue crawler – Crawlers are programs that automatically scan your data sources and populate the Data Catalog with metadata. You can create and run crawlers to discover and catalog data from various sources like Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon CloudWatch, and JDBC-compliant relational databases such as MySQL, and PostgreSQL as well as several non-AWS sources such as Snowflake and Google BigQuery.

  • AWS Glue APIs – You can access the Data Catalog programmatically using the AWS Glue APIs. These APIs allow you to interact with the Data Catalog programmatically, enabling automation and integration with other applications and services.

  • AWS Command Line Interface (AWS CLI) – You can use the AWS CLI to access and manage the Data Catalog from the command line. The CLI provides commands for creating, updating, and deleting metadata definitions, as well as querying and retrieving metadata information.

  • Integration with other AWS services – The Data Catalog integrates with various other AWS services, allowing you to access and utilize the metadata stored in the catalog. For example, you can use Amazon Athena to query data sources using the metadata in the Data Catalog, and use AWS Lake Formation to manage data access and governance for the Data Catalog resources.

Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint

AWS Glue's Iceberg REST endpoint supports API operations specified in the Apache Iceberg REST specification. Using an Iceberg REST client, you can connect your application running on an analytics engine to the REST catalog hosted in the Data Catalog.

The endpoint supports both Apache Iceberg table specifications - v1 and v2, defaulting to v2. When using the Iceberg table v1 specification, you must specify v1 in the API call. Using the API operation, you can access Iceberg tables stored in both Amazon S3 object storage and Amazon S3 Table storage.

Endpoint configuration

You can access the AWS Glue Iceberg REST catalog using the service endpoint. Refer to the AWS Glue service endpoints reference guide for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows:

Endpoint : https://glue.us-east-1.amazonaws.com/iceberg

Additional configuration properties – When using Iceberg client to connect an analytics engine like Spark to the service endpoint, you are required to specify the following application configuration properties:

catalog_name = "mydatacatalog"
aws_account_id = "123456789012"
aws_region = "us-east-1"
spark = SparkSession.builder \
    ... \
    .config("spark.sql.defaultCatalog", catalog_name) \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.type", "rest") \
    .config(f"spark.sql.catalog.{catalog_name}.uri", "https://glue.{aws_region}.amazonaws.com/iceberg") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", "{aws_account_id}") \
    .config(f"spark.sql.catalog.{catalog_name}.rest.sigv4-enabled", "true") \
    .config(f"spark.sql.catalog.{catalog_name}.rest.signing-name", "glue") \    
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()
     

Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint

AWS Glue Iceberg REST extension endpoint provides additional APIs, which are not present in the Apache Iceberg REST specification, and provides server-side scan planning capabilities. These additional APIs are used when you access tables stored in Amazon Redshift managed storage. The endpoint is accessible from an application using Apache Iceberg AWS Glue Data Catalog extensions.

Endpoint configuration -– A catalog with tables in the Redshift managed storage is accessible using the service endpoint. Refer to the AWS Glue service endpoints reference guide for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows:

Endpoint : https://glue.us-east-1.amazonaws.com/extensions
catalog_name = "myredshiftcatalog"
aws_account_id = "123456789012"
aws_region = "us-east-1"
spark = SparkSession.builder \
    .config("spark.sql.defaultCatalog", catalog_name) \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.type", "glue") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.id", "{123456789012}:redshiftnamespacecatalog/redshiftdb") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()
    

Authenticating and authorizing access to AWS Glue service endpoints

API requests to the AWS Glue Data Catalog endpoints are authenticated using AWS Signature Version 4 (SigV4). See AWS Signature Version 4 for API requests section to learn more about AWS SigV4.

When accessing the AWS Glue service endpoint and AWS Glue metadata, the application assumes an IAM role which requires glue:getCatalog IAM action.

REST operation REST path AWS Glue IAM Action CloudTrail EventName Lake Formation permissions
GetConfig GET /config GetCatalog

GetConfig

Not required.
ListNamespaces GET /namespaces GetDatabases GetDatabases ALL, DESCRIBE, SELECT
CreateNamespace POST /namespaces CreateDatabase CreateDatabase ALL, CREATE_DATABASE
LoadNamespaceMetadata GET /namespaces/{ns} GetDatabase GetDatabase ALL, DESCRIBE, SELECT
UpdateProperties POST /namespaces/{ns}/properties UpdateDatabase UpdateDatabase ALL, ALTER
DeleteNamespace DELETE /namespace/{ns} DeleteDatabase DeleteDatabase ALL, DROP
ListTables GET /namespaces/{ns}/tables GetTables GetTables ALL, SELECT, DESCRIBE
CreateTable POST /namespaces/{ns}/tables CreateTable CreateTable ALL, CREATE_TABLE
LoadTable GET /namespaces/{ns}/tables/{tbl} GetTable GetTable ALL, SELECT, DESCRIBE
TableExists HEAD /namespaces/{ns}/tables/{tbl} GetTable GetTable ALL, SELECT, DESCRIBE
UpdateTable POST /namespaces/{ns}/tables/{tbl} UpdateTable UpdateTable ALL, ALTER
DeleteTable DELETE /namespaces/{ns}/tables/{tbl} DeleteTable DeleteTable ALL, DROP

You can use IAM, AWS Lake Formation, or Lake Formation hybrid mode permissions to manage access to the default Data Catalog and its objects.

Federated catalogs in the AWS Glue Data Catalog have Lake Formation registered data locations. Lake Formationintegrates with Data Catalog and provides database-style permissions to manage user access to catalog objects. In Lake Formation, permissions must be set up for the IAM user or role that is used to create, insert, or delete data. The permissions are the same as for existing AWS Glue tables:

  • CREATE_CATALOG – Required to create catalogs

  • CREATE_DATABASE – Required to create databases

  • CREATE_TABLE – Required to create tables

  • DELETE – Required to delete data from a table

  • DESCRIBE – Required to read metadata

  • DROP – Required to drop/delete a table or database

  • INSERT: Needed when the principal needs to insert data into a table

  • SELECT: Needed when the principal needs to select data from a table

For more information, see Lake Formation permissions reference in the AWS Lake Formation Developer Guide.

Connecting to Data Catalog from a standalone Spark application

You can connect to the Data Catalog from a stand application using an Apache Iceberg connector.

  1. Create an IAM role for Spark application.

  2. Connect to AWS Glue Iceberg Rest endpoint using Iceberg connector.

    # configure your application. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html for best practices on configuring environment variables. export AWS_ACCESS_KEY_ID=$(aws configure get appUser.aws_access_key_id) export AWS_SECRET_ACCESS_KEY=$(aws configure get appUser.aws_secret_access_key) export AWS_SESSION_TOKEN=$(aws configure get appUser.aws_secret_token) export AWS_REGION=us-east-1 export REGION=us-east-1 export AWS_ACCOUNT_ID = {specify your aws account id here} ~/spark-3.5.3-bin-hadoop3/bin/spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.defaultCatalog=spark_catalog" \ --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.spark_catalog.type=rest" \ --conf "spark.sql.catalog.spark_catalog.uri=https://glue.us-east-1.amazonaws.com/iceberg" \ --conf "spark.sql.catalog.spark_catalog.warehouse = {AWS_ACCOUNT_ID}" \ --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-name=glue" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \ --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider"
  3. Query data in the Data Catalog.

    spark.sql("create database myicebergdb").show()
    spark.sql("""CREATE TABLE myicebergdb.mytbl (name string) USING iceberg location 's3://bucket_name/mytbl'""")
    spark.sql("insert into myicebergdb.mytbl values('demo') ").show()
           

Data mapping between Amazon Redshift and Apache Iceberg

Redshift and Iceberg support various data types. The following compatibility matrix outlines the support and limitations when mapping data between these two data systems. Please refer to Amazon Redshift Data Types and Apache Iceberg Table Specifications for more details on supported data types in respective data systems.

Redshift data type Aliases Iceberg data type
SMALLINT INT2 int
INTEGER INT, INT4 int
BIGINT INT8 long
DECIMAL NUMERIC decimal
REAL FLOAT4 float
REAL FLOAT4 float
DOUBLE PRECISION FLOAT8, FLOAT double
CHAR CHARACTER, NCHAR string
VARCHAR CHARACTER VARYING, NVARCHAR string
BPCHAR string
TEXT string
DATE date
TIME TIME WITHOUT TIMEZONE time
TIME TIME WITH TIMEZONE not supported
TIMESTAMP TIMESTAMP WITHOUT TIMEZONE TIMESTAMP
TIMESTAMPZ TIMESTAMP WITH TIMEZONE TIMESTAMPZ
INTERVAL YEAR TO MONTH Not supported
INTERVAL DAY TO SECOND Not supported
BOOLEAN BOOL bool
HLLSKETCH Not supported
SUPER Not supported
VARBYTE VARBINARY, BINARY VARYING binary
GEOMETRY Not supported
GEOGRAPHY Not supported

Considerations and limitations when using AWS Glue Iceberg REST Catalog APIs

Following are the considerations and limitations when using the Apache Iceberg REST Catalog Data Definition Language (DDL) operation behavior.

Considerations
  • DeleteTable API behavior – DeleteTable API supports a purge option. When purge is set to true, the table data is deleted, otherwise data is not deleted. For tables in Amazon S3, the operation will not delete table data. The operation fails when table is stored in Amazon S3 and purge = TRUE, .

    For tables stored in Amazon Redshift managed storage, the operation will delete table data, similar to DROP TABLEbehavior in Amazon Redshift. The operation fails when table is stored in Amazon Redshift and purge = FALSE.

  • CreateTable API behavior – The CreateTable API operation doesn't support the option state-create = TRUE.

  • RenameTable API behavior – The RenameTable operation is supported in tables in Amazon Redshift but not in Amazon S3.

  • DDL operations for namespaces and tables in Amazon Redshift – Create, Update, Delete operations for namespaces and tables in Amazon Redshift are asynchronous operations because they are dependent on when Amazon Redshift managed workgroup is available and whether a conflicting DDL and DML transaction is in progress and operation has to wait for lock and then attempt to commit changes.

    During a create, update or delete operation, the endpoint returns a 202 response with the following payload.

    {
      "transaction-context": "operation/resource", 
      "transaction-id": "data-api-request-id:crypto-hash-signature(operation, resource, data-api-uuid)"
    }        

    For example, the endpoint will provide the following response for an UpdateTable operation :

    {
      "transaction-context": "UpdateTable/arn:aws:glue:us-east-1:123456789012:table/123456789012/cat1/db1/tbl1", 
      "transaction-id": "b0033764-20df-4679-905d-71f20a0cdbe7:ca8a95d54158793204f1f39b4971d2a7"
    }        

    To track the progress of this transaction, you can use CheckTransactionStatus API, in the following shape:

    POST /transactions/status
    
    Request:
    {
      "transaction-context": "UpdateTable/arn:aws:glue:us-east-1:123456789012:table/123456789012/cat1/db1/tbl1", 
      "transaction-id": "transaction-id": "b0033764-20df-4679-905d-71f20a0cdbe7:ca8a95d54158793204f1f39b4971d2a7"
    }
    
    Response:
    {
       "status": "IN_PRORESS|SUCCEEDED|FAILED|CANCELED",
      "error": "message" // if failed
    }
            
Limitations
  • View APIs in the Apache Iceberg REST specification are not supported in AWS Glue Iceberg REST Catalog.