Accessing the Data Catalog
You can use the AWS Glue Data Catalog (Data Catalog) to discover and understand your data. Data Catalog provides a consistent way to maintain schema definitions, data types, locations, and other metadata. You can access the Data Catalog using the following methods:
AWS Glue console – You can access and manage the Data Catalog through the AWS Glue console, a web-based user interface. The console allows you to browse and search for databases, tables, and their associated metadata, as well as create, update, and delete metadata definitions.
AWS Glue crawler – Crawlers are programs that automatically scan your data sources and populate the Data Catalog with metadata. You can create and run crawlers to discover and catalog data from various sources like Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon CloudWatch, and JDBC-compliant relational databases such as MySQL, and PostgreSQL as well as several non-AWS sources such as Snowflake and Google BigQuery.
AWS Glue APIs – You can access the Data Catalog programmatically using the AWS Glue APIs. These APIs allow you to interact with the Data Catalog programmatically, enabling automation and integration with other applications and services.
-
AWS Command Line Interface (AWS CLI) – You can use the AWS CLI to access and manage the Data Catalog from the command line. The CLI provides commands for creating, updating, and deleting metadata definitions, as well as querying and retrieving metadata information.
-
Integration with other AWS services – The Data Catalog integrates with various other AWS services, allowing you to access and utilize the metadata stored in the catalog. For example, you can use Amazon Athena to query data sources using the metadata in the Data Catalog, and use AWS Lake Formation to manage data access and governance for the Data Catalog resources.
Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint
AWS Glue's Iceberg REST endpoint supports API operations specified in the Apache Iceberg REST specification. Using an Iceberg REST client, you can connect your application running on an analytics engine to the REST catalog hosted in the Data Catalog.
The endpoint supports both Apache Iceberg table specifications - v1 and v2, defaulting to v2. When using the Iceberg table v1 specification, you must specify v1 in the API call. Using the API operation, you can access Iceberg tables stored in both Amazon S3 object storage and Amazon S3 Table storage.
Endpoint configuration
You can access the AWS Glue Iceberg REST catalog using the service endpoint. Refer to the AWS Glue service endpoints reference guide for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows:
Endpoint : https://glue.us-east-1
.amazonaws.com/iceberg
Additional configuration properties – When using Iceberg client to connect an analytics engine like Spark to the service endpoint, you are required to specify the following application configuration properties:
catalog_name ="mydatacatalog"
aws_account_id ="123456789012"
aws_region = "us-east-1" spark = SparkSession.builder \ ... \ .config("spark.sql.defaultCatalog", catalog_name) \ .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \ .config(f"spark.sql.catalog.{catalog_name}.type", "rest") \ .config(f"spark.sql.catalog.{catalog_name}.uri", "https://glue.{aws_region}.amazonaws.com/iceberg") \ .config(f"spark.sql.catalog.{catalog_name}.warehouse", "{aws_account_id}") \ .config(f"spark.sql.catalog.{catalog_name}.rest.sigv4-enabled", "true") \ .config(f"spark.sql.catalog.{catalog_name}.rest.signing-name", "glue") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .getOrCreate()
Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint
AWS Glue Iceberg REST extension endpoint provides additional APIs, which are not present in the Apache Iceberg REST specification, and provides server-side scan planning capabilities. These additional APIs are used when you access tables stored in Amazon Redshift managed storage. The endpoint is accessible from an application using Apache Iceberg AWS Glue Data Catalog extensions.
Endpoint configuration -– A catalog with tables in the Redshift managed storage is accessible using the service endpoint. Refer to the AWS Glue service endpoints reference guide for the region-specific endpoint. For example, when connecting to AWS Glue in the us-east-1 Region, you need to configure the endpoint URI property as follows:
Endpoint : https://glue.us-east-1
.amazonaws.com/extensions
catalog_name ="myredshiftcatalog"
aws_account_id ="123456789012"
aws_region = "us-east-1" spark = SparkSession.builder \ .config("spark.sql.defaultCatalog", catalog_name) \ .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \ .config(f"spark.sql.catalog.{catalog_name}.type", "glue") \ .config(f"spark.sql.catalog.{catalog_name}.glue.id", "{123456789012}:redshiftnamespacecatalog/redshiftdb") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .getOrCreate()
Authenticating and authorizing access to AWS Glue service endpoints
API requests to the AWS Glue Data Catalog endpoints are authenticated using AWS Signature Version 4 (SigV4). See AWS Signature Version 4 for API requests section to learn more about AWS SigV4.
When accessing the AWS Glue service endpoint and AWS Glue metadata, the application assumes an IAM role which requires glue:getCatalog
IAM action.
REST operation | REST path | AWS Glue IAM Action | CloudTrail EventName | Lake Formation permissions |
---|---|---|---|---|
GetConfig | GET /config | GetCatalog |
GetConfig |
Not required. |
ListNamespaces | GET /namespaces | GetDatabases | GetDatabases | ALL, DESCRIBE, SELECT |
CreateNamespace | POST /namespaces | CreateDatabase | CreateDatabase | ALL, CREATE_DATABASE |
LoadNamespaceMetadata | GET /namespaces/{ns} | GetDatabase | GetDatabase | ALL, DESCRIBE, SELECT |
UpdateProperties | POST /namespaces/{ns}/properties | UpdateDatabase | UpdateDatabase | ALL, ALTER |
DeleteNamespace | DELETE /namespace/{ns} | DeleteDatabase | DeleteDatabase | ALL, DROP |
ListTables | GET /namespaces/{ns}/tables | GetTables | GetTables | ALL, SELECT, DESCRIBE |
CreateTable | POST /namespaces/{ns}/tables | CreateTable | CreateTable | ALL, CREATE_TABLE |
LoadTable | GET /namespaces/{ns}/tables/{tbl} | GetTable | GetTable | ALL, SELECT, DESCRIBE |
TableExists | HEAD /namespaces/{ns}/tables/{tbl} | GetTable | GetTable | ALL, SELECT, DESCRIBE |
UpdateTable | POST /namespaces/{ns}/tables/{tbl} | UpdateTable | UpdateTable | ALL, ALTER |
DeleteTable | DELETE /namespaces/{ns}/tables/{tbl} | DeleteTable | DeleteTable | ALL, DROP |
You can use IAM, AWS Lake Formation, or Lake Formation hybrid mode permissions to manage access to the default Data Catalog and its objects.
Federated catalogs in the AWS Glue Data Catalog have Lake Formation registered data locations. Lake Formationintegrates with Data Catalog and provides database-style permissions to manage user access to catalog objects. In Lake Formation, permissions must be set up for the IAM user or role that is used to create, insert, or delete data. The permissions are the same as for existing AWS Glue tables:
-
CREATE_CATALOG – Required to create catalogs
CREATE_DATABASE – Required to create databases
CREATE_TABLE – Required to create tables
DELETE – Required to delete data from a table
DESCRIBE – Required to read metadata
DROP – Required to drop/delete a table or database
INSERT: Needed when the principal needs to insert data into a table
SELECT: Needed when the principal needs to select data from a table
For more information, see Lake Formation permissions reference in the AWS Lake Formation Developer Guide.
Connecting to Data Catalog from a standalone Spark application
You can connect to the Data Catalog from a stand application using an Apache Iceberg connector.
Create an IAM role for Spark application.
-
Connect to AWS Glue Iceberg Rest endpoint using Iceberg connector.
# configure your application. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html for best practices on configuring environment variables. export AWS_ACCESS_KEY_ID=$(aws configure get appUser.aws_access_key_id) export AWS_SECRET_ACCESS_KEY=$(aws configure get appUser.aws_secret_access_key) export AWS_SESSION_TOKEN=$(aws configure get appUser.aws_secret_token) export AWS_REGION=us-east-1 export REGION=us-east-1 export AWS_ACCOUNT_ID = {specify your aws account id here} ~/spark-3.5.3-bin-hadoop3/bin/spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.defaultCatalog=spark_catalog" \ --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.spark_catalog.type=rest" \ --conf "spark.sql.catalog.spark_catalog.uri=https://glue.us-east-1.amazonaws.com/iceberg" \ --conf "spark.sql.catalog.spark_catalog.warehouse = {AWS_ACCOUNT_ID}" \ --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-name=glue" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \ --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider"
-
Query data in the Data Catalog.
spark.sql("create database myicebergdb").show() spark.sql("""CREATE TABLE myicebergdb.mytbl (name string) USING iceberg location 's3://
bucket_name
/mytbl
'""") spark.sql("insert into myicebergdb.mytbl values('demo') ").show()
Data mapping between Amazon Redshift and Apache Iceberg
Redshift and Iceberg support various data types. The following compatibility matrix
outlines the support and limitations when mapping data between these two data systems. Please
refer to Amazon
Redshift Data Types and Apache Iceberg Table
Specifications
Redshift data type | Aliases | Iceberg data type |
---|---|---|
SMALLINT | INT2 | int |
INTEGER | INT, INT4 | int |
BIGINT | INT8 | long |
DECIMAL | NUMERIC | decimal |
REAL | FLOAT4 | float |
REAL | FLOAT4 | float |
DOUBLE PRECISION | FLOAT8, FLOAT | double |
CHAR | CHARACTER, NCHAR | string |
VARCHAR | CHARACTER VARYING, NVARCHAR | string |
BPCHAR | string | |
TEXT | string | |
DATE | date | |
TIME | TIME WITHOUT TIMEZONE | time |
TIME | TIME WITH TIMEZONE | not supported |
TIMESTAMP | TIMESTAMP WITHOUT TIMEZONE | TIMESTAMP |
TIMESTAMPZ | TIMESTAMP WITH TIMEZONE | TIMESTAMPZ |
INTERVAL YEAR TO MONTH | Not supported | |
INTERVAL DAY TO SECOND | Not supported | |
BOOLEAN | BOOL | bool |
HLLSKETCH | Not supported | |
SUPER | Not supported | |
VARBYTE | VARBINARY, BINARY VARYING | binary |
GEOMETRY | Not supported | |
GEOGRAPHY | Not supported |
Considerations and limitations when using AWS Glue Iceberg REST Catalog APIs
Following are the considerations and limitations when using the Apache Iceberg REST Catalog Data Definition Language (DDL) operation behavior.
Considerations
-
DeleteTable API behavior – DeleteTable API supports a purge option. When purge is set to
true
, the table data is deleted, otherwise data is not deleted. For tables in Amazon S3, the operation will not delete table data. The operation fails when table is stored in Amazon S3 andpurge = TRUE,
.For tables stored in Amazon Redshift managed storage, the operation will delete table data, similar to
DROP TABLE
behavior in Amazon Redshift. The operation fails when table is stored in Amazon Redshift andpurge = FALSE
. -
CreateTable API behavior – The
CreateTable
API operation doesn't support the optionstate-create = TRUE
. -
RenameTable
API behavior – TheRenameTable
operation is supported in tables in Amazon Redshift but not in Amazon S3. -
DDL operations for namespaces and tables in Amazon Redshift – Create, Update, Delete operations for namespaces and tables in Amazon Redshift are asynchronous operations because they are dependent on when Amazon Redshift managed workgroup is available and whether a conflicting DDL and DML transaction is in progress and operation has to wait for lock and then attempt to commit changes.
During a create, update or delete operation, the endpoint returns a 202 response with the following payload.
{ "transaction-context": "
operation/resource
", "transaction-id": "data-api-request-id:crypto-hash-signature(operation, resource, data-api-uuid)
" }For example, the endpoint will provide the following response for an UpdateTable operation :
{ "transaction-context": "UpdateTable/arn:aws:glue:us-east-1:123456789012:table/123456789012/cat1/db1/tbl1", "transaction-id": "b0033764-20df-4679-905d-71f20a0cdbe7:ca8a95d54158793204f1f39b4971d2a7" }
To track the progress of this transaction, you can use
CheckTransactionStatus
API, in the following shape:POST /transactions/status Request: { "transaction-context": "UpdateTable/arn:aws:glue:us-east-1:123456789012:table/123456789012/cat1/db1/tbl1", "transaction-id": "transaction-id": "b0033764-20df-4679-905d-71f20a0cdbe7:ca8a95d54158793204f1f39b4971d2a7" } Response: { "status": "IN_PRORESS|SUCCEEDED|FAILED|CANCELED", "error": "message" // if failed }
Limitations
-
View APIs in the Apache Iceberg REST specification are not supported in AWS Glue Iceberg REST Catalog.