

 Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Amazon Redshift Spectrum
<a name="c-using-spectrum"></a>

This section describes how to use Redshift Spectrum to efficiently read data from Amazon S3.

Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.

**Topics**
+ [

# Amazon Redshift Spectrum overview
](c-spectrum-overview.md)
+ [

# Getting started with Amazon Redshift Spectrum
](c-getting-started-using-spectrum.md)
+ [

# Query Amazon S3 Tables from Amazon Redshift
](querying-s3Tables.md)
+ [

# IAM policies for Amazon Redshift Spectrum
](c-spectrum-iam-policies.md)
+ [

# Redshift Spectrum and AWS Lake Formation
](spectrum-lake-formation.md)
+ [

# Data files for queries in Amazon Redshift Spectrum
](c-spectrum-data-files.md)
+ [

# External schemas in Amazon Redshift Spectrum
](c-spectrum-external-schemas.md)
+ [

# External tables for Redshift Spectrum
](c-spectrum-external-tables.md)
+ [

# Using Apache Iceberg tables with Amazon Redshift
](querying-iceberg.md)
+ [

# Amazon Redshift Spectrum query performance
](c-spectrum-external-performance.md)
+ [

# Data handling options
](t_setting-data-handling-options.md)
+ [

# Example: Performing correlated subqueries in Redshift Spectrum
](c_performing-correlated-subqueries-spectrum.md)
+ [

# Metrics in Amazon Redshift Spectrum
](c-spectrum-metrics.md)
+ [

# Query troubleshooting in Amazon Redshift Spectrum
](c-spectrum-troubleshooting.md)
+ [

# Tutorial: Querying nested data with Amazon Redshift Spectrum
](tutorial-query-nested-data.md)

# Amazon Redshift Spectrum overview
<a name="c-spectrum-overview"></a>

This topic describes details for using Redshift Spectrum to efficiently read from Amazon S3.

Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Amazon Redshift pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of your cluster's processing capacity than other queries. Redshift Spectrum also scales intelligently. Based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing.

You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. You can create and manage external tables either from Amazon Redshift using data definition language (DDL) commands or using any other tool that connects to the external data catalog. Changes to the external data catalog are immediately available to any of your Amazon Redshift clusters. 

Optionally, you can partition the external tables on one or more columns. Defining partitions as part of the external table can improve performance. The improvement occurs because the Amazon Redshift query optimizer eliminates partitions that don't contain data for the query. 

Materialized views on Spectrum tables can greatly improve cost and performance. For more information, see [Materialized views on external data lake tables in Amazon Redshift SpectrumMaterialized views on external data lake tables](materialized-view-external-table.md). 

After your Redshift Spectrum tables have been defined, you can query and join the tables just as you do any other Amazon Redshift table. Redshift Spectrum doesn't support update operations on external tables. You can add Redshift Spectrum tables to multiple Amazon Redshift clusters and query the same data on Amazon S3 from any cluster in the same AWS Region. When you update Amazon S3 data files, the data is immediately available for query from any of your Amazon Redshift clusters. 

The AWS Glue Data Catalog that you access might be encrypted to increase security. If the AWS Glue catalog is encrypted, you need the AWS Key Management Service (AWS KMS) key for AWS Glue to access the AWS Glue catalog. AWS Glue catalog encryption is not available in all AWS Regions. For a list of supported AWS Regions, see [Encryption and Secure Access for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/encryption-glue-resources.html) in the *[AWS Glue Developer Guide](https://docs.aws.amazon.com/glue/latest/dg/). *For more information about AWS Glue Data Catalog encryption, see [Encrypting Your AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/encrypt-glue-data-catalog.html) in the *[AWS Glue Developer Guide](https://docs.aws.amazon.com/glue/latest/dg/). * 

**Note**  
You can't view details for Redshift Spectrum tables using the same resources that you use for standard Amazon Redshift tables, such as [PG\$1TABLE\$1DEF](r_PG_TABLE_DEF.md), [STV\$1TBL\$1PERM](r_STV_TBL_PERM.md), PG\$1CLASS, or information\$1schema. If your business intelligence or analytics tool doesn't recognize Redshift Spectrum external tables, configure your application to query [SVV\$1EXTERNAL\$1TABLES](r_SVV_EXTERNAL_TABLES.md) and [SVV\$1EXTERNAL\$1COLUMNS](r_SVV_EXTERNAL_COLUMNS.md).

## Amazon Redshift Spectrum Regions
<a name="c-spectrum-regions"></a>

Redshift Spectrum is available in AWS Regions where Amazon Redshift is available, unless otherwise specified in Region specific documentation. For AWS Region availability in commercial Regions, see [Service endpoints](https://docs.aws.amazon.com/general/latest/gr/redshift-service.html#redshift_region) for the **Redshift API** in the *Amazon Web Services General Reference*. 

# Amazon Redshift Spectrum limitations
<a name="c-spectrum-considerations"></a>

This topic describes limitations for using Redshift Spectrum.

Note the following considerations when you use Redshift Spectrum:
+ The Amazon Redshift cluster and the Amazon S3 bucket must be in the same AWS Region. 
+ Redshift Spectrum doesn't support enhanced VPC routing with provisioned clusters. To access your Amazon S3 data, you might need to perform additional configuration steps. For more information, see [Redshift Spectrum and enhanced VPC routing](https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html) in the *Amazon Redshift Management Guide*.
+ Redshift Spectrum supports Amazon S3 access point aliases. For more information, see [Using a bucket–style alias for your access point](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html) in the *Amazon Simple Storage Service User Guide*. However, Redshift Spectrum doesn't support VPC with Amazon S3 access point aliases. For more information, see [Redshift Spectrum and enhanced VPC routing](https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html) in the *Amazon Redshift Management Guide*.
+ You can't perform update or delete operations on external tables. To create a new external table in the specified schema, you can use CREATE EXTERNAL TABLE. For more information about CREATE EXTERNAL TABLE, see [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md). To insert the results of a SELECT query into existing external tables on external catalogs, you can use INSERT (external table). For more information about INSERT (external table), see [INSERT (external table)](r_INSERT_external_table.md).
+ Unless you are using an AWS Glue Data Catalog that is enabled for AWS Lake Formation, you can't control user permissions on an external table. Instead, you can grant and revoke permissions on the external schema. For more information about working with AWS Lake Formation, see [Redshift Spectrum and AWS Lake Formation](spectrum-lake-formation.md).
+ To run Redshift Spectrum queries, the database user must have permission to create temporary tables in the database. The following example grants temporary permission on the database `spectrumdb` to the `spectrumusers` user group. 

  ```
  grant temp on database spectrumdb to group spectrumusers;
  ```

  For more information, see [GRANT](r_GRANT.md).
+ When using the Athena Data Catalog or AWS Glue Data Catalog as a metadata store, see [Quotas and Limits](https://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html) in the *Amazon Redshift Management Guide*. 
+ Redshift Spectrum doesn't support Amazon EMR with Kerberos.

# Getting started with Amazon Redshift Spectrum
<a name="c-getting-started-using-spectrum"></a>

In this tutorial, you learn how to use Amazon Redshift Spectrum to query data directly from files on Amazon S3. If you already have a cluster and a SQL client, you can complete this tutorial with minimal setup. 

**Note**  
Redshift Spectrum queries incur additional charges. The cost of running the sample queries in this tutorial is nominal. For more information about pricing, see [ Amazon Redshift Spectrum pricing](https://aws.amazon.com/redshift/pricing/#redshift-spectrum-pricing).

## Prerequisites
<a name="c-getting-started-using-spectrum-prerequisites"></a>

To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that's connected to your cluster so that you can run SQL commands. The cluster and the data files in Amazon S3 must be in the same AWS Region. 

For information about how to create an Amazon Redshift cluster, see [Get started with Amazon Redshift provisioned data warehouses](https://docs.aws.amazon.com/redshift/latest/gsg/new-user.html) in the *Amazon Redshift Getting Started Guide*. For information about ways to connect to a cluster, see [Connecting to Amazon Redshift data warehouses](https://docs.aws.amazon.com/redshift/latest/gsg/database-tasks.html) in the *Amazon Redshift Getting Started Guide*.

In some of the examples that follow, the sample data is in the US East (N. Virginia) Region (`us-east-1`), so you need a cluster that is also in `us-east-1`. Or, you can use Amazon S3 to copy data objects from the following buckets and folders to your bucket in the AWS Region where your cluster is located: 
+ `s3://redshift-downloads/tickit/spectrum/customers/*`
+ `s3://redshift-downloads/tickit/spectrum/sales_partition/*`
+ `s3://redshift-downloads/tickit/spectrum/sales/*`
+ `s3://redshift-downloads/tickit/spectrum/salesevent/*`

Run an Amazon S3 command similar to the following to copy sample data that is located in the US East (N. Virginia) to your AWS Region. Before running the command create your bucket and folders in your bucket to match your Amazon S3 copy command. The output of the Amazon S3 copy command confirms that the files are copied to the *bucket-name* in your desired AWS Region.

```
aws s3 cp s3://redshift-downloads/tickit/spectrum/ s3://bucket-name/tickit/spectrum/ --copy-props none --recursive
```

## Getting started with Redshift Spectrum using AWS CloudFormation
<a name="c-getting-started-using-spectrum-cfn"></a>

As an alternative to the following steps, you can access the Redshift Spectrum DataLake AWS CloudFormation template to create a stack with an Amazon S3 bucket that you can query. For more information, see [Launch your AWS CloudFormation stack and then query your data in Amazon S3](#c-getting-started-using-spectrum-query-s3-data-cfn).

## Getting started with Redshift Spectrum step by step
<a name="c-getting-started-using-spectrum-steps"></a>

To get started using Amazon Redshift Spectrum, follow these steps:
+ [Step 1. Create an IAM role for Amazon Redshift](#c-getting-started-using-spectrum-create-role) 
+ [Step 2: Associate the IAM role with your cluster](#c-getting-started-using-spectrum-add-role) 
+ [Step 3: Create an external schema and an external table](#c-getting-started-using-spectrum-create-external-table) 
+ [Step 4: Query your data in Amazon S3](#c-getting-started-using-spectrum-query-s3-data) 

## Step 1. Create an IAM role for Amazon Redshift
<a name="c-getting-started-using-spectrum-create-role"></a>

Your cluster needs authorization to access your external Data Catalog in AWS Glue or Amazon Athena and your data files in Amazon S3. To provide that authorization, you reference an AWS Identity and Access Management (IAM) role that is attached to your cluster. For more information about using roles with Amazon Redshift, see [Authorizing COPY and UNLOAD Operations Using IAM Roles](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html).

**Note**  
In certain cases, you can migrate your Athena Data Catalog to an AWS Glue Data Catalog. You can do this if your cluster is in an AWS Region where AWS Glue is supported and you have Redshift Spectrum external tables in the Athena Data Catalog. To use the AWS Glue Data Catalog with Redshift Spectrum, you might need to change your IAM policies. For more information, see [Upgrading to the AWS Glue Data Catalog](https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html#glue-upgrade) in the *Athena User Guide*.

When you create a role for Amazon Redshift, choose one of the following approaches:
+ If you are using Redshift Spectrum with either an Athena Data Catalog or AWS Glue Data Catalog, follow the steps outlined in [To create an IAM role for Amazon Redshift](#spectrum-get-started-create-role). 
+ If you are using Redshift Spectrum with an AWS Glue Data Catalog that is enabled for AWS Lake Formation, follow the steps outlined in these procedures:
  +  [To create an IAM role for Amazon Redshift using an AWS Glue Data Catalog enabled for AWS Lake Formation](#spectrum-get-started-create-role-lake-formation) 
  +  [To grant SELECT permissions on the table to query in the Lake Formation database](#spectrum-get-started-grant-lake-formation-table) <a name="spectrum-get-started-create-role"></a>

**To create an IAM role for Amazon Redshift**

1. Open the [IAM console](https://console.aws.amazon.com/iam/home?#home).

1. In the navigation pane, choose **Roles**.

1. Choose **Create role**.

1. Choose **AWS service** as the trusted entity, and then choose **Redshift** as the use case.

1. Under **Use case for other AWS services**, choose **Redshift - Customizable** and then choose **Next**.

1. The **Add permissions policy** page appears. Choose `AmazonS3ReadOnlyAccess` and `AWSGlueConsoleFullAccess`, if you're using the AWS Glue Data Catalog. Or choose `AmazonAthenaFullAccess` if you're using the Athena Data Catalog. Choose **Next**.
**Note**  
The `AmazonS3ReadOnlyAccess` policy gives your cluster read-only access to all Amazon S3 buckets. To grant access to only the AWS sample data bucket, create a new policy and add the following permissions.  

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:Get*",
                   "s3:List*"
               ],
               "Resource": "arn:aws:s3:::redshift-downloads/*"
           }
       ]
   }
   ```

1. For **Role name**, enter a name for your role, for example **myspectrum\$1role**.

1. Review the information, and then choose **Create role**.

1. In the navigation pane, choose **Roles**. Choose the name of your new role to view the summary, and then copy the **Role ARN** to your clipboard. This value is the Amazon Resource Name (ARN) for the role that you just created. You use that value when you create external tables to reference your data files on Amazon S3.<a name="spectrum-get-started-create-role-lake-formation"></a>

**To create an IAM role for Amazon Redshift using an AWS Glue Data Catalog enabled for AWS Lake Formation**

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the navigation pane, choose **Policies**.

   If this is your first time choosing **Policies**, the **Welcome to Managed Policies** page appears. Choose **Get Started**.

1. Choose **Create policy**. 

1. Choose to create the policy on the **JSON** tab. 

1. Paste in the following JSON policy document, which grants access to the Data Catalog but denies the administrator permissions for Lake Formation.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "RedshiftPolicyForLF",
               "Effect": "Allow",
               "Action": [
                   "glue:*",
                   "lakeformation:GetDataAccess"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. When you are finished, choose **Review** to review the policy. The policy validator reports any syntax errors.

1. On the **Review policy** page, for **Name** enter **myspectrum\$1policy** to name the policy that you are creating. Enter a **Description** (optional). Review the policy **Summary** to see the permissions that are granted by your policy. Then choose **Create policy** to save your work.

   After you create a policy, you can provide access to your users.

To provide access, add permissions to your users, groups, or roles:
+ Users and groups in AWS IAM Identity Center:

  Create a permission set. Follow the instructions in [Create a permission set](https://docs.aws.amazon.com//singlesignon/latest/userguide/howtocreatepermissionset.html) in the *AWS IAM Identity Center User Guide*.
+ Users managed in IAM through an identity provider:

  Create a role for identity federation. Follow the instructions in [Create a role for a third-party identity provider (federation)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-idp.html) in the *IAM User Guide*.
+ IAM users:
  + Create a role that your user can assume. Follow the instructions in [Create a role for an IAM user](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-user.html) in the *IAM User Guide*.
  + (Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in [Adding permissions to a user (console)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_users_change-permissions.html#users_change_permissions-add-console) in the *IAM User Guide*.<a name="spectrum-get-started-grant-lake-formation-table"></a>

**To grant SELECT permissions on the table to query in the Lake Formation database**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. In the navigation pane, choose **Data lake permissions**, and then choose **Grant**.

1. Follow the instructions in [Granting table permissions using the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html) in the *AWS Lake Formation Developer Guide*. Provide the following information:
   + For **IAM role**, choose the IAM role you created, `myspectrum_role`. When you run the Amazon Redshift Query Editor, it uses this IAM role for permission to the data. 
**Note**  
To grant SELECT permission on the table in a Lake Formation–enabled Data Catalog to query, do the following:  
Register the path for the data in Lake Formation. 
Grant users permission to that path in Lake Formation. 
Created tables can be found in the path registered in Lake Formation. 

1. Choose **Grant**.

**Important**  
As a best practice, allow access only to the underlying Amazon S3 objects through Lake Formation permissions. To prevent unapproved access, remove any permission granted to Amazon S3 objects outside of Lake Formation. If you previously accessed Amazon S3 objects before setting up Lake Formation, remove any IAM policies or bucket permissions that previously were set up. For more information, see [Upgrading AWS Glue Data Permissions to the AWS Lake Formation Model](https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html) and [Lake Formation Permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/lake-formation-permissions.html). 

## Step 2: Associate the IAM role with your cluster
<a name="c-getting-started-using-spectrum-add-role"></a>

Now you have an IAM role that authorizes Amazon Redshift to access the external Data Catalog and Amazon S3 for you. At this point, you must associate that role with your Amazon Redshift cluster.

**To associate an IAM role with a cluster**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Clusters**, then choose the name of the cluster that you want to update. 

1. For **Actions**, choose **Manage IAM roles**. The **IAM roles** page appears. 

1. Either choose **Enter ARN** and then enter an ARN or an IAM role, or choose an IAM role from the list. Then choose **Add IAM role** to add it to the list of **Attached IAM roles**. 

1. Choose **Done** to associate the IAM role with the cluster. The cluster is modified to complete the change. 

## Step 3: Create an external schema and an external table
<a name="c-getting-started-using-spectrum-create-external-table"></a>

Create external tables in an external schema. The external schema references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access Amazon S3 on your behalf. You can create an external database in an Amazon Athena Data Catalog, AWS Glue Data Catalog, or an Apache Hive metastore, such as Amazon EMR. For this example, you create the external database in an Amazon Athena Data Catalog when you create the external schema Amazon Redshift. For more information, see [External schemas in Amazon Redshift Spectrum](c-spectrum-external-schemas.md). <a name="spectrum-get-started-create-external-table"></a>

**To create an external schema and an external table**

1. To create an external schema, replace the IAM role ARN in the following command with the role ARN you created in [step 1](#c-getting-started-using-spectrum-create-role). Then run the command in your SQL client.

   ```
   create external schema myspectrum_schema 
   from data catalog 
   database 'myspectrum_db' 
   iam_role 'arn:aws:iam::123456789012:role/myspectrum_role'
   create external database if not exists;
   ```

1. To create an external table, run the following CREATE EXTERNAL TABLE command.
**Note**  
Your cluster and the Amazon S3 bucket must be in the same AWS Region. For this example CREATE EXTERNAL TABLE command, the Amazon S3 bucket with the sample data is located in the US East (N. Virginia) AWS Region. To see the source data, download the [`sales_ts.000` file](https://s3.amazonaws.com/redshift-downloads/tickit/spectrum/sales/sales_ts.000).   
You can modify this example to run in a different AWS Region. Create an Amazon S3 bucket in your desired AWS Region. Copy the sales data with an Amazon S3 copy command. Then update the location option in the example `CREATE EXTERNAL TABLE` command to your bucket.  

   ```
   aws s3 cp s3://redshift-downloads/tickit/spectrum/sales/ s3://bucket-name/tickit/spectrum/sales/ --copy-props none --recursive
   ```
The output of the Amazon S3 copy command confirms that the file was copied to the *bucket-name* in your desired AWS Region.  

   ```
   copy: s3://redshift-downloads/tickit/spectrum/sales/sales_ts.000 to s3://bucket-name/tickit/spectrum/sales/sales_ts.000
   ```

   ```
   create external table myspectrum_schema.sales(
   salesid integer,
   listid integer,
   sellerid integer,
   buyerid integer,
   eventid integer,
   dateid smallint,
   qtysold smallint,
   pricepaid decimal(8,2),
   commission decimal(8,2),
   saletime timestamp)
   row format delimited
   fields terminated by '\t'
   stored as textfile
   location 's3://redshift-downloads/tickit/spectrum/sales/'
   table properties ('numRows'='172000');
   ```

## Step 4: Query your data in Amazon S3
<a name="c-getting-started-using-spectrum-query-s3-data"></a>

After your external tables are created, you can query them using the same SELECT statements that you use to query other Amazon Redshift tables. These SELECT statement queries include joining tables, aggregating data, and filtering on predicates. <a name="spectrum-get-started-query-s3-data"></a>

**To query your data in Amazon S3**

1. Get the number of rows in the MYSPECTRUM\$1SCHEMA.SALES table. 

   ```
   select count(*) from myspectrum_schema.sales;
   ```

   ```
   count 
   ------
   172462
   ```

1. Keep your larger fact tables in Amazon S3 and your smaller dimension tables in Amazon Redshift, as a best practice. If you loaded the sample data in [Load data](https://docs.aws.amazon.com/redshift/latest/gsg/cm-dev-t-load-sample-data.html), you have a table named EVENT in your database. If not, create the EVENT table by using the following command.

   ```
   create table event(
   eventid integer not null distkey,
   venueid smallint not null,
   catid smallint not null,
   dateid smallint not null sortkey,
   eventname varchar(200),
   starttime timestamp);
   ```

1. Load the EVENT table by replacing the IAM role ARN in the following COPY command with the role ARN you created in [Step 1. Create an IAM role for Amazon Redshift](#c-getting-started-using-spectrum-create-role). You can optionally download and view the [ source data for the `allevents_pipe.txt`](https://s3.amazonaws.com/redshift-downloads/tickit/allevents_pipe.txt) from an Amazon S3 bucket in AWS Region `us-east-1`.

   ```
   copy event from 's3://redshift-downloads/tickit/allevents_pipe.txt' 
   iam_role 'arn:aws:iam::123456789012:role/myspectrum_role'
   delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';
   ```

   The following example joins the external Amazon S3 table MYSPECTRUM\$1SCHEMA.SALES with the local Amazon Redshift table EVENT to find the total sales for the top 10 events.

   ```
   select top 10 myspectrum_schema.sales.eventid, sum(myspectrum_schema.sales.pricepaid) from myspectrum_schema.sales, event
   where myspectrum_schema.sales.eventid = event.eventid
   and myspectrum_schema.sales.pricepaid > 30
   group by myspectrum_schema.sales.eventid
   order by 2 desc;
   ```

   ```
   eventid | sum     
   --------+---------
       289 | 51846.00
      7895 | 51049.00
      1602 | 50301.00
       851 | 49956.00
      7315 | 49823.00
      6471 | 47997.00
      2118 | 47863.00
       984 | 46780.00
      7851 | 46661.00
      5638 | 46280.00
   ```

1. View the query plan for the previous query. Notice the `S3 Seq Scan`, `S3 HashAggregate`, and `S3 Query Scan` steps that were run against the data on Amazon S3.

   ```
   explain
   select top 10 myspectrum_schema.sales.eventid, sum(myspectrum_schema.sales.pricepaid) 
   from myspectrum_schema.sales, event
   where myspectrum_schema.sales.eventid = event.eventid
   and myspectrum_schema.sales.pricepaid > 30
   group by myspectrum_schema.sales.eventid
   order by 2 desc;
   ```

   ```
   QUERY PLAN                                                                                                                                                                                
   -----------------------------------------------------------------------------
   XN Limit  (cost=1001055770628.63..1001055770628.65 rows=10 width=31)                                                                                                                      
     ->  XN Merge  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                               
           Merge Key: sum(sales.derived_col2)                                                                                                                                                
           ->  XN Network  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                       
                 Send to leader                                                                                                                                                              
                 ->  XN Sort  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                    
                       Sort Key: sum(sales.derived_col2)                                                                                                                                     
                       ->  XN HashAggregate  (cost=1055770620.49..1055770620.99 rows=200 width=31)                                                                                           
                             ->  XN Hash Join DS_BCAST_INNER  (cost=3119.97..1055769620.49 rows=200000 width=31)                                                                             
                                   Hash Cond: ("outer".derived_col1 = "inner".eventid)                                                                                                       
                                   ->  XN S3 Query Scan sales  (cost=3010.00..5010.50 rows=200000 width=31)                                                                                  
                                         ->  S3 HashAggregate  (cost=3010.00..3010.50 rows=200000 width=16)                                                                                  
                                               ->  S3 Seq Scan myspectrum_schema.sales location:"s3://redshift-downloads/tickit/spectrum/sales" format:TEXT  (cost=0.00..2150.00 rows=172000 width=16)
                                                     Filter: (pricepaid > 30.00)                                                                                                             
                                   ->  XN Hash  (cost=87.98..87.98 rows=8798 width=4)                                                                                                        
                                         ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=4)
   ```

## Launch your AWS CloudFormation stack and then query your data in Amazon S3
<a name="c-getting-started-using-spectrum-query-s3-data-cfn"></a>

After you create an Amazon Redshift cluster and connect to the cluster, you can install your Redshift Spectrum DataLake AWS CloudFormation template and then query your data.

CloudFormation installs the Redshift Spectrum Getting Started DataLake template and creates a stack that includes the following: 
+ A role named `myspectrum_role` associated with your Redshift cluster
+ An external schema named `myspectrum_schema`
+ An external table named `sales` in an Amazon S3 bucket
+ A Redshift table named `event` loaded with data

**To launch your Redshift Spectrum Getting Started DataLake CloudFormation stack**

1. Choose [https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=DataLake&templateURL=https://s3.amazonaws.com/redshift-downloads/docs-downloads/DataLake.yml](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=DataLake&templateURL=https://s3.amazonaws.com/redshift-downloads/docs-downloads/DataLake.yml). The CloudFormation console opens with the DataLake.yml template selected.

   You can also download and customize the Redshift Spectrum Getting Started DataLake CloudFormation [CFN template](https://s3.amazonaws.com/redshift-downloads/docs-downloads/DataLake.yml), then open CloudFormation console ([https://console.aws.amazon.com/cloudformation](https://console.aws.amazon.com/cloudformation/)) and create a stack with the customized template. 

1. Choose **Next**.

1. Under **Parameters**, enter the Amazon Redshift cluster name, database name, and your database user name.

1. Choose **Next**.

   The stack options appear.

1. Choose **Next** to accept the default settings.

1. Review the information and under **Capabilities**, and choose **I acknowledge that AWS CloudFormation might create IAM resources**.

1. Choose **Create stack**.

If an error occurs while the stack is being created, see the following information:
+ View the CloudFormation **Events** tab for information that can help you resolve the error.
+ Delete the DataLake CloudFormation stack before trying the operation again.
+ Make sure that you are connected to your Amazon Redshift database.
+ Make sure that you entered the correct information for the Amazon Redshift cluster name, database name, and database user name.

### Query your data in Amazon S3
<a name="c-getting-started-spectrum-cfn-query-s3-data"></a>

You query external tables using the same SELECT statements that you use to query other Amazon Redshift tables. These SELECT statement queries include joining tables, aggregating data, and filtering on predicates. 

The following query returns the number of rows in the `myspectrum_schema.sales` external table. 

```
select count(*) from myspectrum_schema.sales;
```

```
count 
------
172462
```

### Join an external table with a local table
<a name="c-getting-started-spectrum-cfn-table-join"></a>

The following example joins the external table `myspectrum_schema.sales` with the local table `event` to find the total sales for the top 10 events.

```
select top 10 myspectrum_schema.sales.eventid, sum(myspectrum_schema.sales.pricepaid) from myspectrum_schema.sales, event
where myspectrum_schema.sales.eventid = event.eventid
and myspectrum_schema.sales.pricepaid > 30
group by myspectrum_schema.sales.eventid
order by 2 desc;
```

```
eventid | sum     
--------+---------
    289 | 51846.00
   7895 | 51049.00
   1602 | 50301.00
    851 | 49956.00
   7315 | 49823.00
   6471 | 47997.00
   2118 | 47863.00
    984 | 46780.00
   7851 | 46661.00
   5638 | 46280.00
```

### View the query plan
<a name="c-getting-started-spectrum-cfn-query-plan"></a>

View the query plan for the previous query. Note the `S3 Seq Scan`, `S3 HashAggregate`, and `S3 Query Scan` steps that were run on the data on Amazon S3.

```
explain
select top 10 myspectrum_schema.sales.eventid, sum(myspectrum_schema.sales.pricepaid) 
from myspectrum_schema.sales, event
where myspectrum_schema.sales.eventid = event.eventid
and myspectrum_schema.sales.pricepaid > 30
group by myspectrum_schema.sales.eventid
order by 2 desc;
```

```
QUERY PLAN                                                                                                                                                                                
-----------------------------------------------------------------------------
XN Limit  (cost=1001055770628.63..1001055770628.65 rows=10 width=31)                                                                                                                      
  ->  XN Merge  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                               
        Merge Key: sum(sales.derived_col2)                                                                                                                                                
        ->  XN Network  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                       
              Send to leader                                                                                                                                                              
              ->  XN Sort  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                    
                    Sort Key: sum(sales.derived_col2)                                                                                                                                     
                    ->  XN HashAggregate  (cost=1055770620.49..1055770620.99 rows=200 width=31)                                                                                           
                          ->  XN Hash Join DS_BCAST_INNER  (cost=3119.97..1055769620.49 rows=200000 width=31)                                                                             
                                Hash Cond: ("outer".derived_col1 = "inner".eventid)                                                                                                       
                                ->  XN S3 Query Scan sales  (cost=3010.00..5010.50 rows=200000 width=31)                                                                                  
                                      ->  S3 HashAggregate  (cost=3010.00..3010.50 rows=200000 width=16)                                                                                  
                                            ->  S3 Seq Scan spectrum.sales location:"s3://redshift-downloads/tickit/spectrum/sales" format:TEXT  (cost=0.00..2150.00 rows=172000 width=16)
                                                  Filter: (pricepaid > 30.00)                                                                                                             
                                ->  XN Hash  (cost=87.98..87.98 rows=8798 width=4)                                                                                                        
                                      ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=4)
```

# Query Amazon S3 Tables from Amazon Redshift
<a name="querying-s3Tables"></a>

Amazon Redshift integrates with Amazon S3 table buckets, allowing you to access S3 table resources using Amazon Redshift. Whether you are just getting started or managing thousands of tables in your Iceberg environment, table buckets simplify data lake management at any scale. For more information, see [Table buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets.html).

This topic describes how to get started with Amazon S3 Tables and Redshift and access S3 Tables objects using Amazon Redshift.

## Prerequisites
<a name="querying-s3Tables-prerequisites"></a>

Before querying S3 Tables from Amazon Redshift, you must integrate S3 Tables with AWS Glue Data Catalog. For instructions, see [Integrating Amazon S3 Tables with AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/glue-federation-s3tables.html).

After S3 Tables is integrated with AWS Glue Data Catalog, IAM principals with the required S3 Tables and AWS Glue IAM permissions can discover S3 Tables through the AWS Glue Data Catalog.

**Note**  
Method 3 (Auto-mounted awsdatacatalog) has additional prerequisites. See [Method 3: Auto-mounted awsdatacatalog](#querying-s3Tables-method3) for details.

## Query S3 Tables from Amazon Redshift
<a name="querying-s3Tables-steps"></a>

To get started with querying S3 Tables, follow these steps:
+ Step 1: Create an IAM role for Amazon Redshift
+ Step 2: Attach an IAM role to your Amazon Redshift cluster
+ Step 3: Query S3 Tables from Amazon Redshift

### Step 1: Create an IAM role for Amazon Redshift
<a name="querying-s3Tables-step1"></a>

Your cluster needs authorization to access the external S3 Tables catalog in AWS Glue. To provide that authorization, Amazon Redshift uses an IAM role that is attached to your cluster. Create an IAM role with the following policy permissions.

**Note**  
In the policy and examples below, replace `us-west-2` with your AWS Region and `111122223333` with your AWS account ID.

**To create a policy:**

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the navigation pane, choose **Policies**.

1. Choose **Create policy**.

1. Choose the **JSON** tab.

1. Paste in the following JSON policy document:

   ```
   {
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "GlueDataCatalogPermissions",
         "Effect": "Allow",
         "Action": [
           "glue:GetCatalog",
           "glue:GetDatabase",
           "glue:GetTable",
           "glue:GetTables",
           "glue:UpdateTable",
           "glue:DeleteTable"
         ],
         "Resource": [
           "arn:aws:glue:us-west-2:111122223333:catalog",
           "arn:aws:glue:us-west-2:111122223333:catalog/s3tablescatalog",
           "arn:aws:glue:us-west-2:111122223333:catalog/s3tablescatalog/*",
           "arn:aws:glue:us-west-2:111122223333:database/s3tablescatalog/*/*",
           "arn:aws:glue:us-west-2:111122223333:table/s3tablescatalog/*/*/*",
           "arn:aws:glue:us-west-2:111122223333:database/*",
           "arn:aws:glue:us-west-2:111122223333:table/*/*"
         ]
       },
       {
         "Sid": "S3TablesDataAccessPermissions",
         "Effect": "Allow",
         "Action": [
           "s3tables:GetTableBucket",
           "s3tables:GetNamespace",
           "s3tables:GetTable",
           "s3tables:GetTableMetadataLocation",
           "s3tables:GetTableData",
           "s3tables:ListTableBuckets",
           "s3tables:CreateTable",
           "s3tables:PutTableData",
           "s3tables:UpdateTableMetadataLocation",
           "s3tables:ListNamespaces",
           "s3tables:ListTables",
           "s3tables:DeleteTable"
         ],
         "Resource": [
           "arn:aws:s3tables:us-west-2:111122223333:bucket/*",
           "arn:aws:s3tables:us-west-2:111122223333:bucket/*/table/*"
         ]
       }
     ]
   }
   ```

1. Choose **Review policy**.

1. On the **Review policy** page, enter `GlueCatalogS3Tables_Policy` for **Name**. Optionally, enter a description. Review the policy summary, then choose **Create policy**.

**To create an IAM role for Amazon Redshift:**

1. Open the [IAM console](https://console.aws.amazon.com/iam/).

1. In the navigation pane, choose **Roles**.

1. Choose **Create role**.

1. Choose **AWS service** as the trusted entity, then choose **Redshift** as the use case.

1. Under **Use case for other AWS services**, choose **Redshift - Customizable**, then choose **Next**.

1. On the **Add permissions** page, attach the `GlueCatalogS3Tables_Policy` policy you created above. Choose **Next**.

1. For **Role name**, enter a name for your role, for example `RedshiftS3TablesRole`.

1. Review the information, then choose **Create role**.

1. In the navigation pane, choose **Roles**. Choose the name of your new role to view the summary, then copy the **Role ARN** to your clipboard. You will use this ARN when creating external schemas for S3 Tables namespaces.

**Note**  
Scope down to specific resources by replacing the wildcard with exact Amazon Resource Names (ARNs).

### Step 2: Attach an IAM role to your Amazon Redshift cluster
<a name="querying-s3Tables-step2"></a>

Associate the IAM role you configured in Step 1 with your Amazon Redshift cluster.

**Using the AWS Management Console:**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Clusters**, then choose the cluster that you want to update.

1. For **Actions**, choose **Manage IAM roles** to display the current list of IAM roles associated with the cluster.

1. On the **Manage IAM roles** page, choose the IAM role to add, then choose **Add IAM role**.

1. Choose **Done** to save your changes.

**Using the AWS CLI:**

Run the following command to associate an IAM role with an existing cluster or namespace. Replace `my-redshift-cluster` or `my-redshift-namespace` with your cluster identifier or namespace and `111122223333` with your AWS account ID.

```
aws redshift modify-cluster-iam-roles \
    --cluster-identifier my-redshift-cluster \
    --add-iam-roles arn:aws:iam::111122223333:role/RedshiftS3TablesRole

-- for serverless
aws redshift-serverless update-namespace \
    --namespace-name my-redshift-namespace \
    --iam-roles "arn:aws:iam::111122223333:role/RedshiftS3TablesRole"
```

For more information, see [Associating IAM roles with clusters](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html) in the *Amazon Redshift Management Guide*.

### Step 3: Query S3 Tables from Amazon Redshift
<a name="querying-s3Tables-step3"></a>

When you integrate S3 Tables with AWS Glue Data Catalog, the service creates a federated catalog structure that maps S3 Tables resources to AWS Glue catalog objects:
+ An S3 table bucket becomes a **catalog** in the AWS Glue Data Catalog.
+ An S3 namespace becomes a **AWS Glue database**.
+ An S3 table becomes a **AWS Glue table object**.

The integration creates the following hierarchy:
+ **Federated catalog:**`s3tablescatalog` (automatically created)
+ **Child catalogs:** Each S3 table bucket becomes a child catalog under `s3tablescatalog`.
+ **Databases:** Each S3 namespace within a table bucket becomes a database.
+ **Tables:** Each S3 table within a namespace becomes a table.

For example, if you have an S3 table bucket named `analytics-bucket` with a namespace `sales` containing a table `transactions`, the full path in the AWS Glue Data Catalog would be: `s3tablescatalog/analytics-bucket/sales/transactions`.

**Create a resource link**

Before using any of the three query methods below, you must create a resource link in AWS Glue Data Catalog. Resource links allow Amazon Redshift to reference S3 Tables databases through the standard catalog.

*Using the AWS Glue console:*

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Databases**.

1. Choose **Create**, then choose **Resource link**.

1. On the **Create resource link** page, provide the following information:
   + **Resource link name:** Enter a name for the resource link (for example, `sales_resource_link`).
   + **Shared database:** Enter the S3 Tables database path (for example, `s3tablescatalog/analytics-bucket/sales`).
   + **Shared database owner:** Enter your AWS account ID.
   + **Shared database's catalog ID:** Enter the catalog ID in the format `<account-id>:s3tablescatalog/<bucket-name>`.

1. Choose **Create**.

*Using the AWS CLI:*

```
aws glue create-database \
  --region us-west-2 \
  --cli-input-json '{
        "CatalogId": "111122223333",
        "DatabaseInput": {
            "Name": "sales_resource_link",
            "TargetDatabase": {
                "CatalogId": "111122223333:s3tablescatalog/analytics-bucket",
                "DatabaseName": "sales"
            }
        }
  }'
```

This command creates a resource link named `sales_resource_link` in your default AWS Glue Data Catalog that points to the `sales` database in the S3 table bucket `analytics-bucket`.

Once resource links are created, Amazon Redshift provides three methods to query S3 Tables. Choose the method that best fits your use case.

**Note**  
To create a resource link at the database level, the Redshift Administrator must have the `AWS Glue:CreateDatabase` permission on the default catalog and the database being created.

#### Method 1: CREATE EXTERNAL SCHEMA
<a name="querying-s3Tables-method1"></a>

Use `CREATE EXTERNAL SCHEMA` to create an external schema that references your S3 Tables database. This method provides explicit control over schema naming and configuration.

For complete syntax details, see [CREATE EXTERNAL SCHEMA](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html) in the *Amazon Redshift Database Developer Guide*.

**Example**

Use the database name and catalog ID from Step 3. Replace `111122223333` with your AWS account ID.

```
CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG DATABASE 'sales_resource_link'
IAM_ROLE 'arn:aws:iam::111122223333:role/RedshiftS3TablesRole'
REGION 'us-west-2'
CATALOG_ID '111122223333';

SELECT * FROM s3tables_schema.transactions;
```

#### Method 2: CREATE DATABASE FROM ARN
<a name="querying-s3Tables-method2"></a>

Use `CREATE DATABASE` with the `FROM ARN` clause to create a federated database that directly references your AWS Glue resource link. This method automatically maps the AWS Glue database to a Redshift database.

For complete syntax details, see [CREATE DATABASE](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_DATABASE.html) in the *Amazon Redshift Database Developer Guide*.

**Example**

Replace `111122223333` with your AWS account ID.

```
CREATE DATABASE s3tables_db
FROM ARN 'arn:aws:glue:us-west-2:111122223333:database/sales_resource_link'
WITH DATA CATALOG SCHEMA analytics_schema
IAM_ROLE 'arn:aws:iam::111122223333:role/RedshiftS3TablesRole';

SELECT * FROM s3tables_db.analytics_schema.transactions;
```

#### Method 3: Auto-mounted awsdatacatalog
<a name="querying-s3Tables-method3"></a>

Amazon Redshift can automatically mount AWS Glue Data Catalog databases, including S3 Tables resource links, through the `awsdatacatalog` database. This method requires federated access to Spectrum (FAS) to be enabled on your cluster.

**Prerequisites**

To use the auto-mounted `awsdatacatalog` database, you must enable federated access to Spectrum. This allows Amazon Redshift to use federated identity credentials to access AWS Glue Data Catalog and external data sources.

To enable federated access to Spectrum:

1. Connect to your Redshift cluster using an IAM identity with the following permissions:
   + `redshift:GetClusterCredentialsWithIAM` (for provisioned clusters) or `redshift-serverless:GetCredentials` (for Serverless)
   + `AmazonS3ReadOnlyAccess`
   + `AWSGlueConsoleFullAccess`
   + S3 Tables permissions (as defined in Step 1)

1. When you connect with an IAM identity, Amazon Redshift automatically creates a database user prefixed with `IAM:` (for users) or `IAMR:` (for roles).

1. As a cluster administrator, grant the federated user permissions to access the external schema. Replace `my_user` with your IAM role or user name:

   ```
   GRANT ALL ON SCHEMA awsdatacatalog TO "IAMR:my_user";
   ```

For detailed instructions on setting up federated access, see [Using a federated identity to manage Amazon Redshift access to local resources and Amazon Redshift Spectrum external tables](https://docs.aws.amazon.com/redshift/latest/mgmt/authorization-fas-spectrum.html) in the *Amazon Redshift Management Guide*.

**Query S3 Tables**

Once federated access is configured, verify the mounted schemas and query your S3 Tables.

Verify mounted schemas:

```
SHOW SCHEMAS FROM DATABASE awsdatacatalog;
```

Query S3 Tables using the resource link name from Step 3:

```
SELECT * FROM awsdatacatalog.sales_resource_link.transactions;
```

# IAM policies for Amazon Redshift Spectrum
<a name="c-spectrum-iam-policies"></a>

This topic describes the required IAM permissions for using Redshift Spectrum.

By default, Amazon Redshift Spectrum uses the AWS Glue Data Catalog in AWS Regions that support AWS Glue. In other AWS Regions, Redshift Spectrum uses the Athena Data Catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role that is attached to your cluster. If you use an Apache Hive metastore to manage your data catalog, you don't need to provide access to Athena.

You can chain roles so that your cluster can assume other roles not attached to the cluster. For more information, see [Chaining IAM roles in Amazon Redshift Spectrum](#c-spectrum-chaining-roles).

The AWS Glue catalog that you access might be encrypted to increase security. If the AWS Glue catalog is encrypted, you need the AWS KMS key for AWS Glue to access the AWS Glue Data Catalog. For more information, see [Encrypting Your AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/encrypt-glue-data-catalog.html) in the *[AWS Glue Developer Guide](https://docs.aws.amazon.com/glue/latest/dg/).*

**Topics**
+ [

## Amazon S3 permissions
](#spectrum-iam-policies-s3)
+ [

## Cross-account Amazon S3 permissions
](#spectrum-iam-policies-cross-account)
+ [

## Policies to grant or restrict access using Redshift Spectrum
](#spectrum-iam-policies-spectrum-only)
+ [

## Policies to grant minimum permissions
](#spectrum-iam-policies-minimum-permissions)
+ [

## Chaining IAM roles in Amazon Redshift Spectrum
](#c-spectrum-chaining-roles)
+ [

## Controlling access to the AWS Glue Data Catalog
](#c-spectrum-glue-acess)

## Amazon S3 permissions
<a name="spectrum-iam-policies-s3"></a>

At a minimum, your cluster needs GET and LIST access to your Amazon S3 bucket. If your bucket is not in the same AWS account as your cluster, your bucket must also authorize your cluster to access the data. For more information, see [ Authorizing Amazon Redshift to Access Other AWS Services on Your Behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html). 

**Note**  
The Amazon S3 bucket can't use a bucket policy that restricts access only from specific VPC endpoints.

The following policy grants GET and LIST access to any Amazon S3 bucket. The policy allows access to Amazon S3 buckets for Redshift Spectrum as well as COPY operations.

------
#### [ JSON ]

****  

```
{
	"Version":"2012-10-17",		 	 	 
	"Statement": [{
		"Effect": "Allow",
		"Action": ["s3:Get*", "s3:List*"],
		"Resource": "*"
	}]
}
```

------

The following policy grants GET and LIST access to your Amazon S3 bucket named `amzn-s3-demo-bucket`.

------
#### [ JSON ]

****  

```
{
	"Version":"2012-10-17",		 	 	 
	"Statement": [{
		"Effect": "Allow",
		"Action": ["s3:Get*", "s3:List*"],
		"Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
	}]
}
```

------

## Cross-account Amazon S3 permissions
<a name="spectrum-iam-policies-cross-account"></a>

To grant Redshift Spectrum permission to access data in an Amazon S3 bucket that belongs to another AWS account, add the following policy to the Amazon S3 bucket. For more information, see [Granting Cross-Account Bucket Permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "Example permissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/spectrumrole"
            },
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListMultipartUploadParts",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads"
            ],
            "Resource": [
                "arn:aws:s3:::bucketname",
                "arn:aws:s3:::bucketname/*"
            ]
        }
    ]
}
```

------

## Policies to grant or restrict access using Redshift Spectrum
<a name="spectrum-iam-policies-spectrum-only"></a>

The following example policy allows access to Amazon S3 buckets for Amazon Redshift. When adding this policy to an IAM role, we recommend using an IAM role that is only used for Amazon Redshift. We also recommend that you don't reuse existing roles or add extra permissions to this IAM role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucketVersions",
                "s3:ListBucket"
            ],
            "Resource": [ 
                "arn:aws:s3:::amzn-s3-demo-bucket", 
                "arn:aws:s3:::amzn-s3-demo-bucket/*" 
            ]
        }
    ]
}
```

------

## Policies to grant minimum permissions
<a name="spectrum-iam-policies-minimum-permissions"></a>

The following policy grants the minimum permissions required to use Redshift Spectrum with Amazon S3, AWS Glue, and Athena. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucketVersions",
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::bucketname", 
                "arn:aws:s3:::bucketname/folder1/folder2/*" 
             ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:BatchDeleteTable",
                "glue:UpdateTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": "*"
        }
    ]
}
```

------

If you use Athena for your data catalog instead of AWS Glue, the policy requires full Athena access. The following policy grants access to Athena resources. If your external database is in a Hive metastore, you don't need Athena access.

------
#### [ JSON ]

****  

```
{
	"Version":"2012-10-17",		 	 	 
	"Statement": [{
		"Effect": "Allow",
		"Action": ["athena:*"],
		"Resource": ["*"]
	}]
}
```

------

## Chaining IAM roles in Amazon Redshift Spectrum
<a name="c-spectrum-chaining-roles"></a>

When you attach a role to your cluster, your cluster can assume that role to access Amazon S3, Athena, and AWS Glue on your behalf. If a role attached to your cluster doesn't have access to the necessary resources, you can chain another role, possibly belonging to another account. Your cluster then temporarily assumes the chained role to access the data. You can also grant cross-account access by chaining roles. You can chain a maximum of 10 roles. Each role in the chain assumes the next role in the chain, until the cluster assumes the role at the end of chain. 

To chain roles, you establish a trust relationship between the roles. A role that assumes another role must have a permissions policy that allows it to assume the specified role. In turn, the role that passes permissions must have a trust policy that allows it to pass its permissions to another role. For more information, see [Chaining IAM Roles in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html#authorizing-redshift-service-chaining-roles). 

When you run the CREATE EXTERNAL SCHEMA command, you can chain roles by including a comma-separated list of role ARNs. 

**Note**  
The list of chained roles must not include spaces.

In the following example, `MyRedshiftRole` is attached to the cluster. `MyRedshiftRole` assumes the role `AcmeData`, which belongs to account `111122223333`.

```
create external schema acme from data catalog 
database 'acmedb' region 'us-west-2' 
iam_role 'arn:aws:iam::123456789012:role/MyRedshiftRole,arn:aws:iam::111122223333:role/AcmeData';
```

## Controlling access to the AWS Glue Data Catalog
<a name="c-spectrum-glue-acess"></a>

If you use AWS Glue for your data catalog, you can apply fine-grained access control to the AWS Glue Data Catalog with your IAM policy. For example, you might want to expose only a few databases and tables to a specific IAM role.

The following sections describe the IAM policies for various levels of access to data stored in the AWS Glue Data Catalog.

**Topics**
+ [

### Policy for database operations
](#c-spectrum-glue-acess-database)
+ [

### Policy for table operations
](#c-spectrum-glue-acess-tables)
+ [

### Policy for partition operations
](#c-spectrum-glue-acess-partitions)

### Policy for database operations
<a name="c-spectrum-glue-acess-database"></a>

If you want to give users permissions to view and create a database, they need access rights to both the database and the AWS Glue Data Catalog.

The following example query creates a database.

```
CREATE EXTERNAL SCHEMA example_db
FROM DATA CATALOG DATABASE 'example_db' region 'us-west-2' 
IAM_ROLE 'arn:aws:iam::redshift-account:role/spectrumrole'
CREATE EXTERNAL DATABASE IF NOT EXISTS
```

The following IAM policy gives the minimum permissions required for creating a database.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:catalog"
            ]
        }
    ]
}
```

------

The following example query lists the current databases.

```
SELECT * FROM SVV_EXTERNAL_DATABASES WHERE
databasename = 'example_db1' or databasename = 'example_db2';
```

The following IAM policy gives the minimum permissions required to list the current databases.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:database/example_db1",
                "arn:aws:glue:us-west-2:111122223333:database/example_db2",
                "arn:aws:glue:us-west-2:111122223333:catalog"
            ]
        }
    ]
}
```

------

### Policy for table operations
<a name="c-spectrum-glue-acess-tables"></a>

If you want to give users permissions to view, create, drop, alter, or take other actions on tables, they need several types of access. They need access to the tables themselves, the databases they belong to, and the catalog.

The following example query creates an external table.

```
CREATE EXTERNAL TABLE example_db.example_tbl0(
    col0 INT,
    col1 VARCHAR(255)
) PARTITIONED BY (part INT) STORED AS TEXTFILE
LOCATION 's3://test/s3/location/';
```

The following IAM policy gives the minimum permissions required to create an external table.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

The following example queries each list the current external tables.

```
SELECT * FROM svv_external_tables
WHERE tablename = 'example_tbl0' OR
tablename = 'example_tbl1';
```

```
SELECT * FROM svv_external_columns
WHERE tablename = 'example_tbl0' OR
tablename = 'example_tbl1';
```

```
SELECT parameters FROM svv_external_tables
WHERE tablename = 'example_tbl0' OR
tablename = 'example_tbl1';
```

The following IAM policy gives the minimum permissions required to list the current external tables.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTables"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl1"
            ]
        }
    ]
}
```

------

The following example query alters an existing table.

```
ALTER TABLE example_db.example_tbl0
SET TABLE PROPERTIES ('numRows' = '100');
```

The following IAM policy gives the minimum permissions required to alter an existing table.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

The following example query drops an existing table.

```
DROP TABLE example_db.example_tbl0;                       
```

The following IAM policy gives the minimum permissions required to drop an existing table.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

### Policy for partition operations
<a name="c-spectrum-glue-acess-partitions"></a>

If you want to give users permissions to perform partition-level operations (view, create, drop, alter, and so on), they need permissions to the tables that the partitions belong to. They also need permissions to the related databases and the AWS Glue Data Catalog.

The following example query creates a partition.

```
ALTER TABLE example_db.example_tbl0
ADD PARTITION (part=0) LOCATION 's3://test/s3/location/part=0/';
ALTER TABLE example_db.example_t
ADD PARTITION (part=1) LOCATION 's3://test/s3/location/part=1/';
```

The following IAM policy gives the minimum permissions required to create a partition.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:BatchCreatePartition"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

The following example query lists the current partitions.

```
SELECT * FROM svv_external_partitions
WHERE schemname = 'example_db' AND
tablename = 'example_tbl0'
```

The following IAM policy gives the minimum permissions required to list the current partitions.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetPartitions",
                "glue:GetTables",
                "glue:GetTable"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

The following example query alters an existing partition.

```
ALTER TABLE example_db.example_tbl0 PARTITION(part='0')
SET LOCATION 's3://test/s3/new/location/part=0/';
```

The following IAM policy gives the minimum permissions required to alter an existing partition.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetPartition",
                "glue:UpdatePartition"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

The following example query drops an existing partition.

```
ALTER TABLE example_db.example_tbl0 DROP PARTITION(part='0');               
```

The following IAM policy gives the minimum permissions required to drop an existing partition.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:DeletePartition"
            ],
            "Resource": [
                "arn:aws:glue:us-west-2:111122223333:catalog",
                "arn:aws:glue:us-west-2:111122223333:database/example_db",
                "arn:aws:glue:us-west-2:111122223333:table/example_db/example_tbl0"
            ]
        }
    ]
}
```

------

# Redshift Spectrum and AWS Lake Formation
<a name="spectrum-lake-formation"></a>

This topic describes how to use Redshift Spectrum with Lake Formation. Lake Formation is a service for sharing analytics data.

You can use AWS Lake Formation to centrally define and enforce database, table, and column-level access policies to data stored in Amazon S3. After your data is registered with an AWS Glue Data Catalog enabled with Lake Formation, you can query it by using several services, including Redshift Spectrum. 

Lake Formation provides the security and governance of the Data Catalog. Within Lake Formation, you can grant and revoke permissions to the Data Catalog objects, such as databases, tables, columns, and underlying Amazon S3 storage.

**Important**  
You can only use Redshift Spectrum with a Lake Formation enabled Data Catalog in AWS Regions where Lake Formation is available. For a list of available Regions, see [AWS Lake Formation endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/lake-formation.html) in the *AWS General Reference*.

By using Redshift Spectrum with Lake Formation, you can do the following:
+ Use Lake Formation as a centralized place where you grant and revoke permissions and access control policies on all of your data in the data lake. Lake Formation provides a hierarchy of permissions to control access to databases and tables in a Data Catalog. For more information, see [Overview of Lake Formation permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/lake-formation-permissions.html) in the *AWS Lake Formation Developer Guide*.
+ Create external tables and run queries on data in the data lake. Before users in your account can run queries, a data lake account administrator registers your existing Amazon S3 paths containing source data with Lake Formation. The administrator also creates tables and grants permissions to your users. Access can be granted on databases, tables, or columns. The administrator can use data filters in Lake Formation to grant granular access control over your sensitive data stored in Amazon S3. For more information, see [Using data filters for row-level and cell-level security](#spectrum-lake-formation-data-filters).

  After the data is registered in the Data Catalog, each time users try to run queries, Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum, and the query runs. 
+ Run Redshift Spectrum queries against an automounted AWS Glue Data Catalog using IAM credentials obtained with `GetCredentials` or `GetClusterCredentials`, and manage Lake Formation permissions by database user (IAMR:username or IAM:username).

When you use Redshift Spectrum with a Data Catalog enabled for Lake Formation, one of the following must be in place:
+ An IAM role associated with the cluster that has permission to the Data Catalog.
+ A federated IAM identity configured to manage access to external resources. For more information, see [Using a federated identity to manage Amazon Redshift access to local resources and Amazon Redshift external tables](https://docs.aws.amazon.com/redshift/latest/mgmt/authorization-fas-spectrum.html).

**Important**  
You can't chain IAM roles when using Redshift Spectrum with a Data Catalog enabled for Lake Formation.

To learn more about the steps required to set up AWS Lake Formation to use with Redshift Spectrum, see [Tutorial: Creating a data lake from a JDBC source in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/getting-started-tutorial-jdbc.html) in the *AWS Lake Formation Developer Guide*. Specifically, see [Query the data in the data lake using Amazon Redshift Spectrum](https://docs.aws.amazon.com/lake-formation/latest/dg/tut-query-redshift.html) for details about integration with Redshift Spectrum. The data and AWS resources used in this topic depend on previous steps in the tutorial.

## Using data filters for row-level and cell-level security
<a name="spectrum-lake-formation-data-filters"></a>

You can define data filters in AWS Lake Formation to control your Redshift Spectrum queries' row-level and cell-level access to data defined in your Data Catalog. To set this up, you perform the following tasks:
+ Create a data filter in Lake Formation with the following information:
  + A column specification with a list of columns to include or exclude from query results.
  + A row filter expression that specifies the rows to include in the query results.

  For more information about how to create a data filter, see [Data filters in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/data-filters-about.html) in the *AWS Lake Formation Developer Guide*.
+ Create an external table in Amazon Redshift that references a table in your Lake Formation enabled Data Catalog. For details on how to query a Lake Formation table using Redshift Spectrum, see [Query the data in the data lake using Amazon Redshift Spectrum](https://docs.aws.amazon.com/lake-formation/latest/dg/tut-query-redshift.html) in the *AWS Lake Formation Developer Guide*.

After the table is defined in Amazon Redshift, you can query the Lake Formation table and access only the rows and columns that are allowed by the data filter.

For a detailed guide on how to set up row-level and cell-level security in Lake Formation, and then query using Redshift Spectrum, see [Use Amazon Redshift Spectrum with row-level and cell-level security policies defined in AWS Lake Formation](https://aws.amazon.com/blogs/big-data/use-amazon-redshift-spectrum-with-row-level-and-cell-level-security-policies-defined-in-aws-lake-formation/).

# Data files for queries in Amazon Redshift Spectrum
<a name="c-spectrum-data-files"></a>

This section describes how to create data files in Amazon S3 in a format that Redshift Spectrum supports.

The data files that you use for queries in Amazon Redshift Spectrum are commonly the same types of files that you use for other applications. For example, the same types of files are used with Amazon Athena, Amazon EMR, and Quick. You can query the data in its original format directly from Amazon S3. To do this, the data files must be in a format that Redshift Spectrum supports and be located in an Amazon S3 bucket that your cluster can access. 

The Amazon S3 bucket with the data files and the Amazon Redshift cluster must be in the same AWS Region. For information about supported AWS Regions, see [Amazon Redshift Spectrum Regions](c-spectrum-overview.md#c-spectrum-regions).

## Data formats for Redshift Spectrum
<a name="c-spectrum-data-files-formats"></a>

Redshift Spectrum supports the following structured and semi-structured data formats.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html)

In the preceding table, the headings indicate the following:
+ **Columnar** – Whether the file format physically stores data in a column-oriented structure as opposed to a row-oriented one.
+ **Supports parallel reads** – Whether the file format supports reading individual blocks within the file. Reading individual blocks enables the distributed processing of a file across multiple independent Redshift Spectrum requests instead of having to read the full file in a single request.
+ **Split unit** – For file formats that can be read in parallel, the split unit is the smallest chunk of data that a single Redshift Spectrum request can process.

**Note**  
Timestamp values in text files must be in the format `yyyy-MM-dd HH:mm:ss.SSSSSS`, as the following timestamp value shows: `2017-05-01 11:30:59.000000`.

We recommend using a columnar storage file format, such as Apache Parquet. Amazon Redshift supports the Apache Parquet v1 data file format. With a columnar storage file format, you can minimize data transfer out of Amazon S3 by selecting only the columns that you need. 

## Compression types for Redshift Spectrum
<a name="c-spectrum-data-files-compression"></a>

To reduce storage space, improve performance, and minimize costs, we strongly recommend that you compress your data files. Redshift Spectrum recognizes file compression types based on the file extension.

Redshift Spectrum supports the following compression types and extensions.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html)

Amazon Redshift also supports Zstandard (zstd) compression for Apache Parquet files and Apache Iceberg tables.

You can apply compression at different levels. Most commonly, you compress a whole file or compress individual blocks within a file. Compressing columnar formats at the file level doesn't yield performance benefits. 

For Redshift Spectrum to be able to read a file in parallel, the following must be true:
+ The file format supports parallel reads.
+ The file-level compression, if any, supports parallel reads.

It doesn't matter whether the individual split units within a file are compressed using a compression algorithm that can be read in parallel, because each split unit is processed by a single Redshift Spectrum request. An example of this is Snappy-compressed Parquet files. Individual row groups within the Parquet file are compressed using Snappy, but the top-level structure of the file remains uncompressed. In this case, the file can be read in parallel because each Redshift Spectrum request can read and process individual row groups from Amazon S3.

## Encryption for Redshift Spectrum
<a name="c-spectrum-data-files-encryption"></a>

Redshift Spectrum transparently decrypts data files that are encrypted using the following encryption options:
+ Server-side encryption (SSE-S3) using an AES-256 encryption key managed by Amazon S3.
+ Server-side encryption with keys managed by AWS Key Management Service (SSE-KMS). 

Redshift Spectrum doesn't support Amazon S3 client-side encryption. For more information on server-side encryption, see [Protecting Data Using Server-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html) in the *Amazon Simple Storage Service User Guide*.

Amazon Redshift uses massively parallel processing (MPP) to achieve fast execution of complex queries operating on large amounts of data. Redshift Spectrum extends the same principle to query external data, using multiple Redshift Spectrum instances as needed to scan files. Place the files in a separate folder for each table. 

You can optimize your data for parallel processing by doing the following:
+ If your file format or compression doesn't support reading in parallel, break large files into many smaller files. We recommend using file sizes between 64 MB and 1 GB.
+ Keep all the files about the same size. If some files are much larger than others, Redshift Spectrum can't distribute the workload evenly. 

# External schemas in Amazon Redshift Spectrum
<a name="c-spectrum-external-schemas"></a>

This topic describes how to create and use external schemas with Redshift Spectrum. External schemas are collections of tables that you use as references to access data outside your Amazon Redshift cluster. These tables contain metadata about the external data that Redshift Spectrum reads.

All external tables must be created in an external schema, which you create using a [CREATE EXTERNAL SCHEMA](r_CREATE_EXTERNAL_SCHEMA.md) statement. 

**Note**  
Some applications use the term *database* and *schema* interchangeably. In Amazon Redshift, we use the term *schema*. 

An Amazon Redshift external schema references an external database in an external data catalog. You can create the external database in Amazon Redshift, in [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/catalog.html), in [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro), or in an Apache Hive metastore, such as [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). If you create an external database in Amazon Redshift, the database resides in the Athena Data Catalog. To create a database in a Hive metastore, you need to create the database in your Hive application.

Amazon Redshift needs authorization to access the Data Catalog in Athena and the data files in Amazon S3 on your behalf. To provide that authorization, you first create an AWS Identity and Access Management (IAM) role. Then you attach the role to your cluster and provide Amazon Resource Name (ARN) for the role in the Amazon Redshift `CREATE EXTERNAL SCHEMA` statement. For more information about authorization, see [IAM policies for Amazon Redshift Spectrum](c-spectrum-iam-policies.md).

To create an external database at the same time you create an external schema, specify `FROM DATA CATALOG` and include the `CREATE EXTERNAL DATABASE` clause in your `CREATE EXTERNAL SCHEMA` statement. 

The following example creates an external schema named `spectrum_schema` using the external database `spectrum_db`.

```
create external schema spectrum_schema from data catalog 
database 'spectrum_db' 
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
create external database if not exists;
```

If you manage your data catalog using Athena, specify the Athena database name and the AWS Region in which the Athena Data Catalog is located. 

The following example creates an external schema using the default `sampledb` database in the Athena Data Catalog.

```
create external schema athena_schema from data catalog 
database 'sampledb' 
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole' 
region 'us-east-2';
```

**Note**  
The `region` parameter references the AWS Region in which the Athena Data Catalog is located, not the location of the data files in Amazon S3.

If you manage your data catalog using a Hive metastore, such as Amazon EMR, your security groups must be configured to allow traffic between the clusters. 

In the CREATE EXTERNAL SCHEMA statement, specify `FROM HIVE METASTORE` and include the metastore's URI and port number. The following example creates an external schema using a Hive metastore database named `hive_db`.

```
create external schema hive_schema
from hive metastore
database 'hive_db'
uri '172.10.10.10' port 99
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
```

To view external schemas for your cluster, query the PG\$1EXTERNAL\$1SCHEMA catalog table or the SVV\$1EXTERNAL\$1SCHEMAS view. The following example queries SVV\$1EXTERNAL\$1SCHEMAS, which joins PG\$1EXTERNAL\$1SCHEMA and PG\$1NAMESPACE.

```
select * from svv_external_schemas
```

For the full command syntax and examples, see [CREATE EXTERNAL SCHEMA](r_CREATE_EXTERNAL_SCHEMA.md).

## Working with external catalogs in Amazon Redshift Spectrum
<a name="c-spectrum-external-catalogs"></a>

The metadata for Amazon Redshift Spectrum external databases and external tables is stored in an external data catalog. By default, Redshift Spectrum metadata is stored in an Athena Data Catalog. You can view and manage Redshift Spectrum databases and tables in your Athena console. 

You can also create and manage external databases and external tables using Hive data definition language (DDL) using Athena or a Hive metastore, such as Amazon EMR. 

**Note**  
We recommend using Amazon Redshift to create and manage external databases and external tables in Redshift Spectrum.

### Viewing Redshift Spectrum databases in Athena and AWS Glue
<a name="c-spectrum-athena-external-catalogs"></a>

You can create an external database by including the CREATE EXTERNAL DATABASE IF NOT EXISTS clause as part of your CREATE EXTERNAL SCHEMA statement. In such cases, the external database metadata is stored in your Data Catalog. The metadata for external tables that you create qualified by the external schema is also stored in your Data Catalog. 

Athena and AWS Glue maintain a Data Catalog for each supported AWS Region. To view table metadata, log on to the Athena or AWS Glue console. In Athena, choose **Data sources**, your AWS Glue, then view the details of your database. In AWS Glue, choose **Databases**, your external database, then view the details of your database. 

If you create and manage your external tables using Athena, register the database using CREATE EXTERNAL SCHEMA. For example, the following command registers the Athena database named `sampledb`.

```
create external schema athena_sample
from data catalog
database 'sampledb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
region 'us-east-1';
```

When you query the SVV\$1EXTERNAL\$1TABLES system view, you see tables in the Athena `sampledb` database and also tables that you created in Amazon Redshift.

```
select * from svv_external_tables;
```

```
schemaname    | tablename        | location                                               
--------------+------------------+--------------------------------------------------------
athena_sample | elb_logs         | s3://athena-examples/elb/plaintext           
athena_sample | lineitem_1t_csv  | s3://myspectrum/tpch/1000/lineitem_csv                
athena_sample | lineitem_1t_part | s3://myspectrum/tpch/1000/lineitem_partition          
spectrum      | sales            | s3://redshift-downloads/tickit/spectrum/sales          
spectrum      | sales_part       | s3://redshift-downloads/tickit/spectrum/sales_part
```

### Registering an Apache Hive metastore database
<a name="c-spectrum-hive-metastore"></a>

If you create external tables in an Apache Hive metastore, you can use CREATE EXTERNAL SCHEMA to register those tables in Redshift Spectrum. 

In the CREATE EXTERNAL SCHEMA statement, specify the FROM HIVE METASTORE clause and provide the Hive metastore URI and port number. The IAM role must include permission to access Amazon S3 but doesn't need any Athena permissions. The following example registers a Hive metastore. 

```
create external schema if not exists hive_schema
from hive metastore
database 'hive_database'
uri 'ip-10-0-111-111.us-west-2.compute.internal' port 9083 
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole';
```

### Enabling your Amazon Redshift cluster to access your Amazon EMR cluster
<a name="c-spectrum-enabling-emr-access"></a>

If your Hive metastore is in Amazon EMR, you must give your Amazon Redshift cluster access to your Amazon EMR cluster. To do so, you create an Amazon EC2 security group. You then allow all inbound traffic to the EC2 security group from your Amazon Redshift cluster's security group and your Amazon EMR cluster's security group. Then you add the EC2 security to both your Amazon Redshift cluster and your Amazon EMR cluster.

**View your Amazon Redshift cluster's security group name**

To display the security group, do the following:

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Clusters**, then choose the cluster from the list to open its details.

1. Choose **Properties** and view the **Network and security settings** section. 

1. Find your security group in **VPC security group** and take note of it. 

****

**View the Amazon EMR master node security group name**

1. Open your Amazon EMR cluster. For more information, see [Use security configurations to set up cluster security](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-configurations.html) in the *Amazon EMR Management Guide*.

1. Under **Security and access**, make a note of the Amazon EMR master node security group name.  
![\[A screenshot highlighting the Amazon EMR master node security group name in the Amazon EMR console.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/spectrum-emr-security-groups.png)

**To create or modify an Amazon EC2 security group to allow connection between Amazon Redshift and Amazon EMR**

1. In the Amazon EC2 dashboard, choose **Security groups**. For more information, see [Security group rules](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules.html) in the *Amazon EC2 User Guide* 

1. Choose **Create security group**. 

1. If you are using VPC, choose the VPC that your Amazon Redshift and Amazon EMR clusters are in. 

1. Add an inbound rule. 

   1. For **Type**, choose **Custom TCP**. 

   1. For **Source**, choose **Custom**. 

   1. Enter the name of your Amazon Redshift security group. 

1. Add another inbound rule. 

   1. For **Type**, choose **TCP**. 

   1. For **Port Range**, enter **9083**.
**Note**  
The default port for an EMR HMS is 9083. If your HMS uses a different port, specify that port in the inbound rule and in the external schema definition. 

   1. For **Source**, choose **Custom**. 

1. Enter a security group name and description. 

1. Choose **Create security group**. 

**To add the Amazon EC2 security group you created in the previous procedure to your Amazon Redshift cluster**

1. In Amazon Redshift, choose your cluster. 

1. Choose **Properties**. 

1. View the **Network and security settings** and choose **Edit**. 

1. In **VPC security group**, choose the new security group name. 

1. Choose **Save changes**. 

**To add the Amazon EC2 security group to your Amazon EMR cluster**

1. In Amazon EMR, choose your cluster. For more information, see [Use security configurations to set up cluster security](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-configurations.html) in the *Amazon EMR Management Guide*.

1. Under **Hardware**, choose the link for the Master node. 

1. Choose the link in the **EC2 instance ID** column.   
![\[A screenshot highlighting an Amazon EC2 instance ID value in the Amazon EMR console.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/spectrum-emr-add-security-group.png)

1. For **Actions**, choose **Security**, **Change security groups**. 

1. In **Associated sercurity groups**, choose the new security group, and choose **Add security group**. 

1. Choose **Save**. 

# External tables for Redshift Spectrum
<a name="c-spectrum-external-tables"></a>

This topic describes how to create and use external tables with Redshift Spectrum. External tables are tables that you use as references to access data outside your Amazon Redshift cluster. These tables contain metadata about the external data that Redshift Spectrum reads.



You create an external table in an external schema. To create external tables, you must be the owner of the external schema or a superuser. To transfer ownership of an external schema, use [ALTER SCHEMA](r_ALTER_SCHEMA.md) to change the owner. The following example changes the owner of the `spectrum_schema` schema to `newowner`.

```
alter schema spectrum_schema owner to newowner;
```

To run a Redshift Spectrum query, you need the following permissions:
+ Usage permission on the schema 
+ Permission to create temporary tables in the current database 

The following example grants usage permission on the schema `spectrum_schema` to the `spectrumusers` user group.

```
grant usage on schema spectrum_schema to group spectrumusers;
```

The following example grants temporary permission on the database `spectrumdb` to the `spectrumusers` user group. 

```
grant temp on database spectrumdb to group spectrumusers;
```

You can create an external table in Amazon Redshift, AWS Glue, Amazon Athena, or an Apache Hive metastore. For more information, see [Getting Started Using AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/getting-started.html) in the *AWS Glue Developer Guide*, [Getting Started](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html) in the *Amazon Athena User Guide*, or [Apache Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html) in the *Amazon EMR Developer Guide*. 

If your external table is defined in AWS Glue, Athena, or a Hive metastore, you first create an external schema that references the external database. Then you can reference the external table in your SELECT statement by prefixing the table name with the schema name, without needing to create the table in Amazon Redshift. For more information, see [External schemas in Amazon Redshift Spectrum](c-spectrum-external-schemas.md). 

To allow Amazon Redshift to view tables in the AWS Glue Data Catalog, add `glue:GetTable` to the Amazon Redshift IAM role. Otherwise you might get an error similar to the following.

```
RedshiftIamRoleSession is not authorized to perform: glue:GetTable on resource: *;
```

For example, suppose that you have an external table named `lineitem_athena` defined in an Athena external catalog. In this case, you can define an external schema named `athena_schema`, then query the table using the following SELECT statement.

```
select count(*) from athena_schema.lineitem_athena;
```

To define an external table in Amazon Redshift, use the [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md) command. The external table statement defines the table columns, the format of your data files, and the location of your data in Amazon S3. Redshift Spectrum scans the files in the specified folder and any subfolders. Redshift Spectrum ignores hidden files and files that begin with a period, underscore, or hash mark ( . , \$1, or \$1) or end with a tilde (\$1). 

The following example creates a table named SALES in the Amazon Redshift external schema named `spectrum`. The data is in tab-delimited text files.

```
create external table spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://redshift-downloads/tickit/spectrum/sales/'
table properties ('numRows'='172000');
```

To view external tables, query the [SVV\$1EXTERNAL\$1TABLES](r_SVV_EXTERNAL_TABLES.md) system view. 

## Pseudocolumns
<a name="c-spectrum-external-tables-pseudocolumns"></a>

By default, Amazon Redshift creates external tables with the pseudocolumns `$path`, `$size`, and `$spectrum_oid`. Select the `$path` column to view the path to the data files on Amazon S3, and select the `$size` column to view the size of the data files for each row returned by a query. The `$spectrum_oid` column provides the ability to perform correlated queries with Redshift Spectrum. For an example, see [Example: Performing correlated subqueries in Redshift Spectrum](c_performing-correlated-subqueries-spectrum.md). You must delimit the `$path`, `$size`, and `$spectrum_oid` column names with double quotation marks. A SELECT \$1 clause doesn't return the pseudocolumns. You must explicitly include the `$path`, `$size`, and `$spectrum_oid` column names in your query, as the following example shows.

```
select "$path", "$size", "$spectrum_oid"
from spectrum.sales_part where saledate = '2008-12-01';
```

You can disable the creation of pseudocolumns for a session by setting the `spectrum_enable_pseudo_columns` configuration parameter to `false`. For more information, see [spectrum\$1enable\$1pseudo\$1columns](r_spectrum_enable_pseudo_columns.md). You can also disable only the `$spectrum_oid` pseudocolumn by setting the `enable_spectrum_oid` to `false`. For more information, see [enable\$1spectrum\$1oid](r_spectrum_enable_spectrum_oid.md). However, disabling the `$spectrum_oid` pseudocolumn also disables support for correlated queries with Redshift Spectrum.

**Important**  
Selecting `$size`, `$path`, or `$spectrum_oid` incurs charges because Redshift Spectrum scans the data files on Amazon S3 to determine the size of the result set. For more information, see [Amazon Redshift Pricing](https://aws.amazon.com/redshift/pricing/).

### Pseudocolumns example
<a name="c-spectrum-external-tables-pseudocolumns-example"></a>

The following example returns the total size of related data files for an external table.

```
select distinct "$path", "$size"
from spectrum.sales_part;

 $path                                                                    | $size
--------------------------------------------------------------------------+-------
s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/ |  1616
s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ |  1444
s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/ |  1644
```

## Partitioning Redshift Spectrum external tables
<a name="c-spectrum-external-tables-partitioning"></a>

When you partition your data, you can restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key. 

A common practice is to partition the data based on time. For example, you might choose to partition by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a data source identifier and date. 

The following procedure describes how to partition your data.

**To partition your data**

1. Store your data in folders in Amazon S3 according to your partition key. 

   Create one folder for each partition value and name the folder with the partition key and value. For example, if you partition by date, you might have folders named `saledate=2017-04-01`, `saledate=2017-04-02`, and so on. Redshift Spectrum scans the files in the partition folder and any subfolders. Redshift Spectrum ignores hidden files and files that begin with a period, underscore, or hash mark ( . , \$1, or \$1) or end with a tilde (\$1). 

1. Create an external table and specify the partition key in the PARTITIONED BY clause. 

   The partition key can't be the name of a table column. The data type can be SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, or TIMESTAMP data type. 

1. Add the partitions. 

   Using [ALTER TABLE](r_ALTER_TABLE.md) … ADD PARTITION, add each partition, specifying the partition column and key value, and the location of the partition folder in Amazon S3. You can add multiple partitions in a single ALTER TABLE … ADD statement. The following example adds partitions for `'2008-01'` and `'2008-03'`.

   ```
   alter table spectrum.sales_part add
   partition(saledate='2008-01-01') 
   location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/'
   partition(saledate='2008-03-01') 
   location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/';
   ```
**Note**  
If you use the AWS Glue catalog, you can add up to 100 partitions using a single ALTER TABLE statement.

### Partitioning data examples
<a name="c-spectrum-external-tables-partitioning-example"></a>

In this example, you create an external table that is partitioned by a single partition key and an external table that is partitioned by two partition keys.

The sample data for this example is located in an Amazon S3 bucket that gives read access to all authenticated AWS users. Your cluster and your external data files must be in the same AWS Region. The sample data bucket is in the US East (N. Virginia) Region (us-east-1). To access the data using Redshift Spectrum, your cluster must also be in us-east-1. To list the folders in Amazon S3, run the following command.

```
aws s3 ls s3://redshift-downloads/tickit/spectrum/sales_partition/
```

```
PRE saledate=2008-01/
PRE saledate=2008-03/
PRE saledate=2008-04/
PRE saledate=2008-05/
PRE saledate=2008-06/
PRE saledate=2008-12/
```

If you don't already have an external schema, run the following command. Substitute the Amazon Resource Name (ARN) for your AWS Identity and Access Management (IAM) role.

```
create external schema spectrum
from data catalog
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/myspectrumrole'
create external database if not exists;
```

#### Example 1: Partitioning with a single partition key
<a name="c-spectrum-external-tables-single-partition-example"></a>

In the following example, you create an external table that is partitioned by month.

To create an external table partitioned by month, run the following command.

```
create external table spectrum.sales_part(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
partitioned by (saledate char(10))
row format delimited
fields terminated by '|'
stored as textfile
location 's3://redshift-downloads/tickit/spectrum/sales_partition/'
table properties ('numRows'='172000');
```

To add the partitions, run the following ALTER TABLE command.

```
alter table spectrum.sales_part add
partition(saledate='2008-01') 
location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/'

partition(saledate='2008-03') 
location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/'

partition(saledate='2008-04') 
location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04/';
```

To select data from the partitioned table, run the following query.

```
select top 5 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid) 
from spectrum.sales_part, event
where spectrum.sales_part.eventid = event.eventid
  and spectrum.sales_part.pricepaid > 30
  and saledate = '2008-01'
group by spectrum.sales_part.eventid
order by 2 desc;
```

```
eventid | sum     
--------+---------
   4124 | 21179.00
   1924 | 20569.00
   2294 | 18830.00
   2260 | 17669.00
   6032 | 17265.00
```

To view external table partitions, query the [SVV\$1EXTERNAL\$1PARTITIONS](r_SVV_EXTERNAL_PARTITIONS.md) system view.

```
select schemaname, tablename, values, location from svv_external_partitions
where tablename = 'sales_part';
```

```
schemaname | tablename  | values      | location                                                                
-----------+------------+-------------+-------------------------------------------------------------------------
spectrum   | sales_part | ["2008-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01
spectrum   | sales_part | ["2008-03"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03
spectrum   | sales_part | ["2008-04"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04
```

#### Example 2: Partitioning with a multiple partition key
<a name="c-spectrum-external-tables-multi-partition-example"></a>

To create an external table partitioned by `date` and `eventid`, run the following command.

```
create external table spectrum.sales_event(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
partitioned by (salesmonth char(10), event integer)
row format delimited
fields terminated by '|'
stored as textfile
location 's3://redshift-downloads/tickit/spectrum/salesevent/'
table properties ('numRows'='172000');
```

To add the partitions, run the following ALTER TABLE command.

```
alter table spectrum.sales_event add
partition(salesmonth='2008-01', event='101') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-01/event=101/'

partition(salesmonth='2008-01', event='102') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-01/event=102/'

partition(salesmonth='2008-01', event='103') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-01/event=103/'

partition(salesmonth='2008-02', event='101') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-02/event=101/'

partition(salesmonth='2008-02', event='102') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-02/event=102/'

partition(salesmonth='2008-02', event='103') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-02/event=103/'

partition(salesmonth='2008-03', event='101') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-03/event=101/'

partition(salesmonth='2008-03', event='102') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-03/event=102/'

partition(salesmonth='2008-03', event='103') 
location 's3://redshift-downloads/tickit/spectrum/salesevent/salesmonth=2008-03/event=103/';
```

Run the following query to select data from the partitioned table.

```
select spectrum.sales_event.salesmonth, event.eventname, sum(spectrum.sales_event.pricepaid) 
from spectrum.sales_event, event
where spectrum.sales_event.eventid = event.eventid
  and salesmonth = '2008-02'
	and (event = '101'
	or event = '102'
	or event = '103')
group by event.eventname, spectrum.sales_event.salesmonth
order by 3 desc;
```

```
salesmonth | eventname       | sum    
-----------+-----------------+--------
2008-02    | The Magic Flute | 5062.00
2008-02    | La Sonnambula   | 3498.00
2008-02    | Die Walkure     |  534.00
```

## Mapping external table columns to ORC columns
<a name="c-spectrum-column-mapping-orc"></a>

You use Amazon Redshift Spectrum external tables to query data from files in ORC format. Optimized row columnar (ORC) format is a columnar storage file format that supports nested data structures. For more information about querying nested data, see [Querying Nested Data with Amazon Redshift Spectrum](tutorial-query-nested-data.md#tutorial-nested-data-overview). 

When you create an external table that references data in an ORC file, you map each column in the external table to a column in the ORC data. To do so, you use one of the following methods:
+ [Mapping by position](#orc-mapping-by-position)
+ [Mapping by column name](#orc-mapping-by-name) 

Mapping by column name is the default. 

### Mapping by position
<a name="orc-mapping-by-position"></a>

With position mapping, the first column defined in the external table maps to the first column in the ORC data file, the second to the second, and so on. Mapping by position requires that the order of columns in the external table and in the ORC file match. If the order of the columns doesn't match, then you can map the columns by name. 

**Important**  
In earlier releases, Redshift Spectrum used position mapping by default. If you need to continue using position mapping for existing tables, set the table property `orc.schema.resolution` to `position`, as the following example shows.   

```
alter table spectrum.orc_example 
set table properties('orc.schema.resolution'='position');
```

For example, the table `SPECTRUM.ORC_EXAMPLE` is defined as follows. 

```
create external table spectrum.orc_example(
int_col int,
float_col float,
nested_col struct<
  "int_col" : int,
  "map_col" : map<int, array<float >>
   >
) stored as orc
location 's3://example/orc/files/';
```

The table structure can be abstracted as follows. 

```
• 'int_col' : int
• 'float_col' : float
• 'nested_col' : struct
   o 'int_col' : int
   o 'map_col' : map
      - key : int
      - value : array
         - value : float
```

The underlying ORC file has the following file structure.

```
• ORC file root(id = 0)
   o 'int_col' : int (id = 1)
   o 'float_col' : float (id = 2)
   o 'nested_col' : struct (id = 3)
      - 'int_col' : int (id = 4)
      - 'map_col' : map (id = 5)
         - key : int (id = 6)
         - value : array (id = 7)
            - value : float (id = 8)
```

In this example, you can map each column in the external table to a column in ORC file strictly by position. The following shows the mapping.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html)

### Mapping by column name
<a name="orc-mapping-by-name"></a>

Using name mapping, you map columns in an external table to named columns in ORC files on the same level, with the same name. 

For example, suppose that you want to map the table from the previous example, `SPECTRUM.ORC_EXAMPLE`, with an ORC file that uses the following file structure.

```
• ORC file root(id = 0)
   o 'nested_col' : struct (id = 1)
      - 'map_col' : map (id = 2)
         - key : int (id = 3)
         - value : array (id = 4)
            - value : float (id = 5)
      - 'int_col' : int (id = 6)
   o 'int_col' : int (id = 7)
   o 'float_col' : float (id = 8)
```

Using position mapping, Redshift Spectrum attempts the following mapping. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html)

When you query a table with the preceding position mapping, the SELECT command fails on type validation because the structures are different. 

You can map the same external table to both file structures shown in the previous examples by using column name mapping. The table columns `int_col`, `float_col`, and `nested_col` map by column name to columns with the same names in the ORC file. The column named `nested_col` in the external table is a `struct` column with subcolumns named `map_col` and `int_col`. The subcolumns also map correctly to the corresponding columns in the ORC file by column name. 

## Creating external tables for data managed in Apache Hudi
<a name="c-spectrum-column-mapping-hudi"></a>

To query data in Apache Hudi Copy On Write (CoW) format, you can use Amazon Redshift Spectrum external tables. A Hudi Copy On Write table is a collection of Apache Parquet files stored in Amazon S3. You can read Copy On Write (CoW) tables in Apache Hudi versions 0.5.2, 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.10.0, 0.10.1, 0.11.0, and 0.11.1 that are created and modified with insert, delete, and upsert write operations. For example, bootstrap tables are not supported. For more information, see [Copy On Write Table](https://hudi.apache.org/docs/next/table_types#copy-on-write-table) in the open source Apache Hudi documentation. 

When you create an external table that references data in Hudi CoW format, you map each column in the external table to a column in the Hudi data. Mapping is done by column. 

The data definition language (DDL) statements for partitioned and unpartitioned Hudi tables are similar to those for other Apache Parquet file formats. For Hudi tables, you define `INPUTFORMAT` as `org.apache.hudi.hadoop.HoodieParquetInputFormat`. The `LOCATION` parameter must point to the Hudi table base folder that contains the `.hoodie` folder, which is required to establish the Hudi commit timeline. In some cases, a SELECT operation on a Hudi table might fail with the message No valid Hudi commit timeline found. If so, check if the `.hoodie` folder is in the correct location and contains a valid Hudi commit timeline. 

**Note**  
Apache Hudi format is only supported when you use an AWS Glue Data Catalog. It's not supported when you use an Apache Hive metastore as the external catalog. 

The DDL to define an unpartitioned table has the following format. 

```
CREATE EXTERNAL TABLE tbl_name (columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://s3-bucket/prefix'
```

The DDL to define a partitioned table has the following format. 

```
CREATE EXTERNAL TABLE tbl_name (columns)
PARTITIONED BY(pcolumn1 pcolumn1-type[,...])
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://s3-bucket/prefix'
```

To add partitions to a partitioned Hudi table, run an ALTER TABLE ADD PARTITION command where the `LOCATION` parameter points to the Amazon S3 subfolder with the files that belong to the partition.

The DDL to add partitions has the following format.

```
ALTER TABLE tbl_name
ADD IF NOT EXISTS PARTITION(pcolumn1=pvalue1[,...])
LOCATION 's3://s3-bucket/prefix/partition-path'
```

## Creating external tables for data managed in Delta Lake
<a name="c-spectrum-column-mapping-delta"></a>

To query data in Delta Lake tables, you can use Amazon Redshift Spectrum external tables. 

To access a Delta Lake table from Redshift Spectrum, generate a manifest before the query. A Delta Lake *manifest* contains a listing of files that make up a consistent snapshot of the Delta Lake table. In a partitioned table, there is one manifest per partition. A Delta Lake table is a collection of Apache Parquet files stored in Amazon S3. For more information, see [Delta Lake](https://delta.io) in the open source Delta Lake documentation. 

When you create an external table that references data in Delta Lake tables, you map each column in the external table to a column in the Delta Lake table. Mapping is done by column name. 

The DDL for partitioned and unpartitioned Delta Lake tables is similar to that for other Apache Parquet file formats. For Delta Lake tables, you define `INPUTFORMAT` as `org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat` and `OUTPUTFORMAT` as `org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat`. The `LOCATION` parameter must point to the manifest folder in the table base folder. If a SELECT operation on a Delta Lake table fails, for possible reasons see [Limitations and troubleshooting for Delta Lake tables](#c-spectrum-column-mapping-delta-limitations). 

The DDL to define an unpartitioned table has the following format. 

```
CREATE EXTERNAL TABLE tbl_name (columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://s3-bucket/prefix/_symlink_format_manifest'
```

The DDL to define a partitioned table has the following format. 

```
CREATE EXTERNAL TABLE tbl_name (columns)
PARTITIONED BY(pcolumn1 pcolumn1-type[,...])
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://s3-bucket>/prefix/_symlink_format_manifest'
```

To add partitions to a partitioned Delta Lake table, run an ALTER TABLE ADD PARTITION command where the `LOCATION` parameter points to the Amazon S3 subfolder that contains the manifest for the partition.

The DDL to add partitions has the following format.

```
ALTER TABLE tbl_name
ADD IF NOT EXISTS PARTITION(pcolumn1=pvalue1[,...])
LOCATION
's3://s3-bucket/prefix/_symlink_format_manifest/partition-path'
```

Or run DDL that points directly to the Delta Lake manifest file.

```
ALTER TABLE tbl_name
ADD IF NOT EXISTS PARTITION(pcolumn1=pvalue1[,...])
LOCATION
's3://s3-bucket/prefix/_symlink_format_manifest/partition-path/manifest'
```

### Limitations and troubleshooting for Delta Lake tables
<a name="c-spectrum-column-mapping-delta-limitations"></a>

Consider the following when querying Delta Lake tables from Redshift Spectrum:
+ If a manifest points to a snapshot or partition that no longer exists, queries fail until a new valid manifest has been generated. For example, this might result from a VACUUM operation on the underlying table,
+ Delta Lake manifests only provide partition-level consistency. 

The following table explains some potential reasons for certain errors when you query a Delta Lake table. 


| Error message | Possible reason | 
| --- | --- | 
| Delta Lake manifest in bucket *s3-bucket-1* cannot contain entries in bucket *s3-bucket-2*. | The manifest entries point to files in a different Amazon S3 bucket than the specified one.  | 
| Delta Lake files are expected to be in the same folder. | The manifest entries point to files that have a different Amazon S3 prefix than the specified one. | 
| File *filename* listed in Delta Lake manifest *manifest-path* was not found. | A file listed in the manifest wasn't found in Amazon S3.  | 
| Error fetching Delta Lake manifest. | The manifest wasn't found in Amazon S3.  | 
| Invalid S3 Path. | An entry in the manifest file isn't a valid Amazon S3 path, or the manifest file has been corrupted.  | 

# Using Apache Iceberg tables with Amazon Redshift
<a name="querying-iceberg"></a>

**Note**  
 To achieve the best performance when using Apache Iceberg tables with Amazon Redshift, you must generate column statistics for the tables using AWS Glue. For more information, see [ Generating column statistics for Iceberg tables ](https://docs.aws.amazon.com/glue/latest/dg/iceberg-generate-column-stats.html) in the *AWS Glue Developer Guide*. 

This topic describes how to use tables in Apache Iceberg format with Redshift Spectrum or Redshift Serverless. Apache Iceberg is a high-performance format for huge analytic tables.

You can use Redshift Spectrum or Redshift Serverless to query Apache Iceberg tables cataloged in the AWS Glue Data Catalog. Apache Iceberg is an open-source table format for data lakes. For more information, see [Apache Iceberg](https://iceberg.apache.org/) in the Apache Iceberg documentation.

Amazon Redshift provides transactional consistency for querying Apache Iceberg tables. You can manipulate the data in your tables using ACID (atomicity, consistency, isolation, durability) compliant services such as Amazon Athena and Amazon EMR while running queries using Amazon Redshift. Amazon Redshift can use the table statistics stored in Apache Iceberg metadata to optimize query plans and reduce file scans during query processing. With Amazon Redshift SQL, you can join Redshift tables with data lake tables.

To get started using Iceberg tables with Amazon Redshift:

1. Create an Apache Iceberg table on an AWS Glue Data Catalog database using a compatible service such as Amazon Athena or Amazon EMR. To create an Iceberg table using Athena, see [Using Apache Iceberg tables](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html) in the *Amazon Athena User Guide*.

1. Create an Amazon Redshift cluster or Redshift Serverless workgroup with an associated IAM role that allows access to your data lake. For information on how to create clusters or workgroups, see [Get started with Amazon Redshift provisioned data warehouses](https://docs.aws.amazon.com/redshift/latest/gsg/new-user.html) and [Get started with Redshift Serverless data warehouses](https://docs.aws.amazon.com/redshift/latest/gsg/new-user-serverless.html) in the *Amazon Redshift Getting Started Guide*.

1. Connect to your cluster or workgroup using query editor v2 or a third-party SQL client. For information about how to connect using query editor v2, see [Connecting to an Amazon Redshift data warehouse using SQL client tools](https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-to-cluster.html) in the *Amazon Redshift Management Guide*.

1. Create an external schema in your Amazon Redshift database for a specific Data Catalog database that includes your Iceberg tables. For information about creating an external schema, see [External schemas in Amazon Redshift Spectrum](c-spectrum-external-schemas.md).

1. Run SQL queries to access the Iceberg tables in the external schema you created.

## Considerations when using Apache Iceberg tables with Amazon Redshift
<a name="querying-iceberg-considerations"></a>

Consider the following when using Amazon Redshift with Iceberg tables:
+ **Iceberg version support** – Amazon Redshift supports running queries against the following versions of Iceberg tables:
  + Version 1 defines how large analytic tables are managed using immutable data files.
  + Version 2 adds the ability to support row-level updates and deletes while keeping the existing data files unchanged, and handling table data changes using delete files. 

   For the difference between version 1 and version 2 tables, see [Format version changes](https://iceberg.apache.org/spec/#appendix-e-format-version-changes) in the Apache Iceberg documentation.
+ **Adding partitions** – You don't need to manually add partitions for your Apache Iceberg tables. New partitions in Apache Iceberg tables are automatically detected by Amazon Redshift and no manual operation is needed to update partitions in the table definition. Any changes in partition specification are also automatically applied to your queries without any user intervention.
+ **Ingesting Iceberg data into Amazon Redshift** – You can use INSERT INTO or CREATE TABLE AS commands to import data from your Iceberg table into a local Amazon Redshift table. You currently cannot use the COPY command to ingest the contents of an Apache Iceberg table into a local Amazon Redshift table.
+ **Materialized views** – You can create materialized views on Apache Iceberg tables like any other external table in Amazon Redshift. The same considerations for other data lake table formats apply to Apache Iceberg tables. Automatic query rewriting and automatic materialized views on data lake tables are currently not supported.
+ **AWS Lake Formation fine-grained access control** – Amazon Redshift supports AWS Lake Formation fine-grained access control on Apache Iceberg tables.
+ **User-defined data handling parameters ** – Amazon Redshift supports user-defined data handling parameters on Apache Iceberg tables. You use user-defined data handling parameters on existing files to tailor the data being queried in external tables to avoid scan errors. These parameters provide capabilities to handle mismatches between the table schema and the actual data on files. You can use user-defined data handling parameters on Apache Iceberg tables as well.
+ **Time travel queries** – Time travel queries are currently not supported with Apache Iceberg tables.
+ **Pricing** – When you access Iceberg tables from a cluster, you are charged Redshift Spectrum pricing. When you access Iceberg tables from a workgroup, you are charged Redshift Serverless pricing. For information about Redshift Spectrum and Redshift Serverless pricing, see [Amazon Redshift pricing](https://aws.amazon.com/redshift/pricing/).
+ **Metadata caching** – Metadata caching assumes metadata files are immutable based on the [Iceberg specification](https://iceberg.apache.org/spec/#file-system-operations). Metadata file immutability is a requirement for data integrity in Amazon Redshift.
+ **Federated identity** – Federated identity is not supported when writing to Apache Iceberg tables. This includes using the SESSION keyword for the IAM\$1ROLE parameter when creating external schemas. For more information about IAM\$1ROLE parameters, see [CREATE EXTERNAL SCHEMA](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html).

# Supported data types with Apache Iceberg tables
<a name="querying-iceberg-supported-data-types"></a>

This topic describes the supported data types that Redshift Spectrum can read from tables in Apache Iceberg format.

Amazon Redshift can query Iceberg tables that contain the following data types:

```
binary
boolean
date
decimal
double
float
int
list
long
map
string
struct
timestamp without time zone
```

When you create and define an Iceberg table, use the Amazon Redshift data type names in the SQL statement. Redshift automatically maps them to the corresponding Iceberg types. For more information about Iceberg data types, see the [Schemas for Iceberg](https://iceberg.apache.org/docs/latest/schemas/) in the Apache Iceberg documentation.

When reading from Iceberg tables, Iceberg data-types are mapped into Redshift data-types as shown in the below table: 


****  

| Iceberg type | Amazon Redshift type | Notes | 
| --- | --- | --- | 
| boolean | boolean | - | 
| - | tinyint | Not supported for Iceberg tables. | 
| - | smallint | Not supported for Iceberg tables. | 
| int | int | - | 
| long | bigint | - | 
| double | double precision | - | 
| float | real | - | 
| decimal(P, S) | decimal(P, S) | P is precision, S is scale. | 
| - | char | Not supported for Iceberg tables. | 
| string | varchar(16384) | Strings larger than 16384 are truncated to 16384. | 
| binary | varbyte(64000) | - | 
| date | date | - | 
| time | - | - | 
| timestamp | timestamp | - | 
| timestamptz | timestampz | - | 
| list<E> | SUPER | - | 
| map<K,V> | SUPER | - | 
| struct<...> | SUPER | - | 
| fixed(L) | - | The fixed(L) type isn't currently supported in Redshift Spectrum. | 
| uuid | - | The uuid type isn't currently supported in Redshift Spectrum. | 
| variant | - | Amazon Redshift doesn't support Iceberg V3. | 
| geometry | - | Amazon Redshift doesn't support Iceberg V3. | 
| geography | - | Amazon Redshift doesn't support Iceberg V3. | 
| timestamp\$1ns | - | Amazon Redshift doesn't support Iceberg V3. | 
| timestamptz\$1ns | - | Amazon Redshift doesn't support Iceberg V3. | 
| Unknown | - | Amazon Redshift doesn't support Iceberg V3. | 

The following data-types are supported when creating Iceberg tables from Redshift. Redshift data-types are mapped into Iceberg data-types as shown in the following table. 


****  

| Amazon Redshift type | Amazon Redshift alias | Iceberg type | Notes | 
| --- | --- | --- | --- | 
| integer | int, int4 | int | - | 
| bigint | int8 | long | - | 
| decimal | numeric | decimal(p,S) | - | 
| real | float4 | float | - | 
| double precision | float8, float | double | - | 
| varchar | charactter varying,nvarchar, text | string | The varchar(n) data type is not supported when creating an Iceberg table. | 
| date | - | date | - | 
| timestamp | - | timestamp | - | 
| timestamptz | - | timestamptz | - | 
| boolean | - | boolean | - | 

When writing to Iceberg tables, in addition to the data-types mentioned in the previous table, some source-data types are type promoted to their compatible Iceberg types as shown in the following table.


| Amazon Redshift type | Iceberg type | 
| --- | --- | 
|  `tinyint`  |  `int`  | 
|  `smallint`  |  `int`  | 
|  `varchar(n)`  |  `string`  | 

Attempting to use data types that are not supported will result in syntax errors. When you create an Iceberg table with `CREATE TABLE AS SELECT` clause, you can add explicit cast to work around the type difference.

For example, suppose you have a Redshift RMS table with the following schema:

```
CREATE TABLE rms_t (c1 int, c2 char(20));
```

If you want to create an Iceberg table using `rms_t` as the source, you need an explicit cast for the `c2` column, because the `varchar(n)` type is not supported:

```
CREATE TABLE ext_schema.iceberg_t AS SELECT c1, c2::varchar FROM rms_t;
```

For more information about data types in Amazon Redshift, see [Data types](c_Supported_data_types.md).

# Referencing Iceberg tables in Amazon Redshift
<a name="referencing-iceberg-tables"></a>

Amazon Redshift provides multiple ways to reference Apache Iceberg tables stored in your data lake. You can use external schemas to create references to Data Catalog databases containing Iceberg tables, or use three-part notation for direct access to auto-mounted catalogs.

## Using external schemas to reference Iceberg tables
<a name="referencing-iceberg-external-schemas"></a>

External schemas provide a way to reference tables in your Data Catalog from within Amazon Redshift. When you create an external schema, you establish a connection between your Amazon Redshift database and a specific Data Catalog database that contains your Iceberg tables.

To create an external schema for Iceberg tables:

```
CREATE EXTERNAL SCHEMA schema_name
FROM DATA CATALOG
DATABASE 'glue_database_name'
IAM_ROLE 'arn:aws:iam::account-id:role/role-name';
```

After creating the external schema, you can query Iceberg tables using two-part notation:

```
SELECT * FROM schema_name.iceberg_table_name;
```

You can also join Iceberg tables with local Amazon Redshift tables:

```
SELECT r.customer_id, i.order_date, r.customer_name
FROM local_customers r
JOIN schema_name.iceberg_orders i 
ON r.customer_id = i.customer_id;
```

## Using three-part notation with auto-mounted catalogs
<a name="referencing-iceberg-three-part-notation"></a>

Three-part notation allows you to directly reference tables in auto-mounted catalogs without creating external schemas. This method is particularly useful when working with Amazon S3 table buckets federated with AWS Lake Formation. For information about setting up automatic mounting of the Data Catalog, see [Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog](https://aws.amazon.com/blogs/big-data/simplify-external-object-access-in-amazon-redshift-using-automatic-mounting-of-the-aws-glue-data-catalog/).

The syntax for three-part notation is:

```
"catalog_name".database_name.table_name
```

For example, to query an Iceberg table in an auto-mounted Amazon S3 table catalog:

```
SELECT * FROM "my_table_bucket@s3tablescatalog".my_database.my_iceberg_table;
```

For more information about integrating Amazon S3 table buckets with Amazon Redshift, see [Integrating S3 Tables with Amazon Redshift](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-redshift.html) in the *Amazon S3 User Guide*.

You can also use the `USE` statement to set a default catalog and database:

```
USE "my_table_bucket@s3tablescatalog".my_database;
SELECT * FROM my_iceberg_table;
```

To set a search path for schema resolution:

```
USE "my_table_bucket@s3tablescatalog";
SET search_path TO my_database;
SELECT * FROM my_iceberg_table;
```

## Best practices for referencing Iceberg tables
<a name="referencing-iceberg-best-practices"></a>

Consider the following best practices when referencing Iceberg tables in Amazon Redshift:
+ **Use descriptive schema names** – When creating external schemas, use names that clearly indicate the source and purpose of the data, such as `sales_data_lake` or `customer_analytics`.
+ **Leverage table statistics** – Ensure that column statistics are generated for your Iceberg tables using AWS Glue to optimize query performance. Amazon Redshift uses these statistics for query planning and optimization.
+ **Consider data freshness** – Iceberg tables may be updated by other services while you're querying them. Amazon Redshift provides transactional consistency, ensuring you see a consistent snapshot of the data during your query execution.
+ **Use appropriate IAM permissions** – Ensure that your Amazon Redshift cluster or workgroup has the necessary IAM permissions to access the Amazon S3 locations where your Iceberg tables are stored, as well as the Data Catalog metadata.
+ **Monitor query performance** – Use Amazon Redshift query monitoring features to track the performance of queries against Iceberg tables and optimize as needed.

## Common referencing patterns
<a name="referencing-iceberg-examples"></a>

The following examples demonstrate common patterns for referencing Iceberg tables:

**Aggregating data across multiple Iceberg tables:**

```
SELECT 
    region,
    SUM(sales_amount) as total_sales,
    COUNT(*) as transaction_count
FROM data_lake.sales_transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY region
ORDER BY total_sales DESC;
```

**Joining Iceberg tables with local Amazon Redshift tables:**

```
SELECT 
    c.customer_name,
    c.customer_tier,
    SUM(o.order_amount) as total_orders
FROM customers c
JOIN data_lake.order_history o ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.customer_name, c.customer_tier;
```

**Using three-part notation with complex queries:**

```
WITH recent_orders AS (
    SELECT customer_id, order_date, order_amount
    FROM "analytics_bucket@s3tablescatalog".ecommerce.orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT 
    customer_id,
    COUNT(*) as order_count,
    AVG(order_amount) as avg_order_value
FROM recent_orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
```

# Writing to Apache Iceberg tables
<a name="iceberg-writes"></a>

With Amazon Redshift, you can create and write to Apache Iceberg tables stored in Amazon S3 and Amazon S3 table buckets. Writing Iceberg data directly from Amazon Redshift streamlines your data management by eliminating extra tools. Iceberg tables must be registered with AWS Glue Data Catalog. 

You can use lakehouse architecture with Apache Iceberg tables while simultaneously taking advantage of Amazon Redshift's powerful SQL analytics across both warehouses and lakes. You also gain immediate access to advanced Amazon Redshift features like materialized views on your Iceberg tables, significantly enhancing your analytical capabilities without adding complexity.

Iceberg writes is supported on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless instances.

**Topics**
+ [

# SQL commands
](iceberg-writes-sql-syntax.md)
+ [

# Transaction semantics
](iceberg-writes-transaction-semantics.md)
+ [

# Best practices
](iceberg-writes-best-practices.md)

# SQL commands
<a name="iceberg-writes-sql-syntax"></a>

Apache Iceberg tables in Amazon Redshift provide a powerful way to manage large analytic datasets in your data lake. These tables support ACID transactions, schema evolution, and time travel capabilities while maintaining high performance for analytics workloads. Using Apache Iceberg tables, you can efficiently organize and partition your data, control file formats and compression, and seamlessly integrate with other AWS services. 

You can create partitioned and unpartitioned Iceberg tables using `CREATE TABLE ... USING ICEBERG` and `CREATE TABLE ... USING ICEBERG AS SELECT` commands. You can reference Iceberg tables using either external schema notation (`external_schema.table_name`) or three-part notation (`"catalog_name".database_name.table_name`). The examples in this section demonstrate both methods.

After you create a table, you can add data using standard `INSERT` commands. Keep in mind that while Amazon Redshift works with many Iceberg data types, you might need to convert some data formats when inserting information. 

You can view Iceberg tables using `SHOW TABLES` command. If you want to remove a table from the AWS Glue Data Catalog, you can use the `DROP TABLE` command. Note that this only removes the table registration. The actual data will stay in storage until you delete it separately.

You can also modify existing data using `DELETE`, `UPDATE`, and `MERGE` commands. All other SQL statements, such as `ALTER TABLE`, are not yet supported on Iceberg tables.

It's possible for you to write into an Iceberg table that is not created by Amazon Redshift. However, there are some limitations:
+ The table must be an Iceberg v2 table.
+ The table must be using Parquet as default data format.
+ The table must not have metadata compression set to True.
+ The table must not enable Write-Audit-Publish (WAP).

The following sections demonstrate SQL syntax for creating, inserting, modifying, and managing Iceberg tables in Amazon Redshift.

**Contents**
+ [

## CREATE TABLE
](#iceberg-writes-create-table)
+ [

## CREATE TABLE AS SELECT
](#iceberg-writes-create-table-as-select)
+ [

## SHOW TABLE
](#iceberg-writes-show-table)
+ [

## INSERT INTO
](#iceberg-writes-insert-into)
+ [

## DELETE
](#iceberg-writes-delete)
+ [

## UPDATE
](#iceberg-writes-update)
+ [

## MERGE
](#iceberg-writes-merge)
+ [

## DROP TABLE
](#iceberg-writes-drop-table)

## CREATE TABLE
<a name="iceberg-writes-create-table"></a>

```
CREATE TABLE [IF NOT EXISTS] <external_schema>.<table_name> (
  column_name data_type [, ...]
)
USING ICEBERG
[LOCATION 's3://your-bucket-name/prefix/']
[PARTITIONED BY [[column_name | transform_function]], ...]
[TABLE PROPERTIES ('compression_type'='<compression_value>')]
```

You can also use three-part notation for S3 table buckets:

```
CREATE TABLE "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name> (
  column_name data_type [, ...]
)
USING ICEBERG
[PARTITIONED BY [[column_name | transform_function]], ...]
[TABLE PROPERTIES ('compression_type'='<compression_value>')]
```

Note that `<external_schema>` must be an existing external schema name in which the external table will be created. For more information about how to create and manage external schemas, see [CREATE EXTERNAL SCHEMA](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html) in the Amazon Redshift documentation.

The `LOCATION` clause defines the table location for this newly created Iceberg table. For Amazon S3 tables, `LOCATION` cannot be specified as the table location is determined by Amazon S3 tables catalog (`s3tablescatalog`). 

In all other cases, `LOCATION` is required, and it should be an empty location, meaning there are no existing Amazon S3 objects sharing this same bucket and prefix. Note that the Amazon S3 bucket region must be in the same region as Amazon Redshift cluster. 

However, AWS provides a method to replicate data from Iceberg tables stored in an AWS Glue Data Catalog in one AWS Region to a different AWS Region, which allows you to replicate the write to a different region. For more information, see [Replicate data across AWS Regions](https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/best-practices-workloads.html#workloads-replication).

`PARTITIONED BY` defines the Iceberg table partition. Amazon Redshift supports all Iceberg v2 partition transforms except for `void`. Here is the list of transforms that are supported:
+ **identity**
+ **bucket[N]**
+ **truncate[W]**
+ **year**
+ **month**
+ **day**
+ **hour**

For the full definitions of these transforms and the compatible data types, see [Partition Transforms](https://iceberg.apache.org/spec/#partition-transforms) in the Apache Iceberg documentation.

The `PARTITIONED BY` supports multi-level partitioning. For example, you can run the following command:

```
CREATE TABLE ...
USING ICEBERG
LOCATION ...
PARTITIONED BY (bucket(16, id), year(ship_date));
```

However, Amazon Redshift doesn't support using a single column in more than one transform. For example, the following syntax is not supported:

```
CREATE TABLE ...
USING ICEBERG
LOCATION ...
PARTITIONED BY (bucket(16, ship_date), year(ship_date));
```

The `TABLE PROPERTIES` clause defines the extra table properties for this Iceberg table. The only table property we support is `compression_type` which defines the default Parquet data file compression. If this is not specified, `snappy` is used as the compression codec. The possible values for `compression_type` are: `zstd`, `brotli`, `gzip`, `snappy`, and `uncompressed`.

**Note**  
`CREATE TABLE ... LIKE ...` is not supported for Iceberg tables. Iceberg tables also don't support column constraints and column attributes like RMS table does.

Alternatively, you can create and populate an Iceberg table in a single operation using `CREATE TABLE AS SELECT`:

## CREATE TABLE AS SELECT
<a name="iceberg-writes-create-table-as-select"></a>

```
CREATE TABLE <external_schema>.<table_name> [(
  column_name[, ...]
)]
USING ICEBERG
[LOCATION 's3://your-bucket-name/prefix/']
[PARTITIONED BY [[column_name | transform_function]], ...]
[TABLE PROPERTIES ('compression_type'='<compression-value>')]
AS
SELECT query
```

You can also use three-part notation to create tables in auto-mounted catalogs:

```
CREATE TABLE "<catalog_name>".<database_name>.<table_name> [(
  column_name[, ...]
)]
USING ICEBERG
[LOCATION 's3://your-bucket-name/prefix/']
[PARTITIONED BY [[column_name | transform_function]], ...]
[TABLE PROPERTIES ('compression_type'='<compression-value>')]
AS
SELECT query
```

This is similar to the `CREATE TABLE` statement except that `CREATE` is followed by a `SELECT` statement to populate the table with `SELECT` query results.

The `CREATE TABLE` clause here no longer allows you to specify the data types as the column data types will be decided by the `SELECT` query.

If the `SELECT` query fails for any reason, this query will fail and the Iceberg table will not be created.

You can view the structure of your Iceberg tables using `SHOW TABLE`:

## SHOW TABLE
<a name="iceberg-writes-show-table"></a>

```
SHOW TABLE <external_schema>.<table_name>
```

You can also use three-part notation with auto-mounted catalogs:

```
SHOW TABLE "<catalog_name>".<database_name>.<table_name>
```

`SHOW TABLE` displays the `CREATE TABLE` statement for Iceberg table. The command will show the appropriate results based on the type of the table. The following is an example of the `SHOW TABLE` output for Iceberg table:

```
CREATE TABLE my_schema.items (id int, price decimal(5, 2))
USING ICEBERG
LOCATION 's3://my_s3_bucket/items/'
PARTITIONED BY (bucket(16, id))
TABLE PROPERTIES ('compression_type'='snappy')
```

**Note**  
For Amazon S3 tables, since the table location is managed by Amazon S3 tables catalog, the `LOCATION` clause will be omitted in the `SHOW TABLE` results.

After creating tables, you can add data using `INSERT INTO`:

## INSERT INTO
<a name="iceberg-writes-insert-into"></a>

```
INSERT INTO <external_schema>.<table_name> [(column_name [, ...])] VALUES (...)
INSERT INTO <external_schema>.<table_name> [(column_name [, ...])] (SELECT query)

-- Using three-part notation for S3 table buckets:
INSERT INTO "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name> [(column_name [, ...])] VALUES (...)
INSERT INTO "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name> [(column_name [, ...])] (SELECT query)
```

You can `INSERT INTO` existing Iceberg table using the above syntax. If `VALUES` clause is used, you provide the values for columns listed by `column_name`, or all columns if `column_name` part is omitted.

When data is inserted into partitioned table, new rows are distributed according to the predefined partition specification. If for any reason the `SELECT` query fails, the query will fail and no data will be inserted into the Iceberg table.

## DELETE
<a name="iceberg-writes-delete"></a>

The `DELETE` query for Iceberg table uses the existing `DELETE` syntax in the RMS table:

```
[ WITH [RECURSIVE] common_table_expression [, common_table_expression , ...] ]
DELETE [ FROM ] iceberg_table
[ { USING } table_name, ... ] [ WHERE condition ]
```

You can also use three-part notation for S3 table buckets:

```
[ WITH [RECURSIVE] common_table_expression [, common_table_expression , ...] ]
DELETE [ FROM ] "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name>
[ { USING } table_name, ... ] [ WHERE condition ]
```

The `iceberg_table` can be referenced using the `<external_schema>.<external_table_name>` form, or use the 3-part notation for auto mounted catalog. See [Referencing Iceberg tables in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/referencing-iceberg-tables.html).

The `table_name` in the `USING` clause will be used to join with the target table for deleting rows that are satisfied with the `WHERE` condition. The `table_name` can be an Iceberg table or a Amazon Redshift RMS table.

Since Iceberg uses hidden partition scheme, user can use `DELETE` query to remove partitions, achieving the same effect as `ALTER TABLE ... DROP PARTITION ...` for Hive tables.

For example, when we have partitioned Iceberg table like below:

```
CREATE TABLE my_external_schema.lineitem
(l_item_id int,
 l_ship_date varchar,
 ...
)
USING ICEBERG
LOCATION ...
PARTITIONED BY l_ship_date;
```

Then we can easily remove a partition using query like this:

```
DELETE FROM my_external_schema.lineitem WHERE l_ship_date = '20251231';
```

For query like this, Amazon Redshift will optimize the execution to only conduct metadata only operation and short circuit the execution. Thus unlike normal `DELETE` query, the metadata only delete query doesn't show execution steps in `EXPLAIN`:

```
explain DELETE FROM my_external_schema.lineitem WHERE l_ship_date = '20251231';

 QUERY PLAN
------------
"XN Seq Scan Metadata of my_external_schema.lineitem location: "s3://s3-path//table-location" format:ICEBERG (cost=0.00..0.01 rows=0 width=0)"
(0 rows)
```

## UPDATE
<a name="iceberg-writes-update"></a>

The `UPDATE` query syntax for Iceberg table is very similar to the existing `UPDATE` syntax for the RMS table:

```
[ WITH [RECURSIVE] common_table_expression [, common_table_expression , ...] ]
UPDATE iceberg_table [ [ AS ] alias ] SET column = { expression } [,...]
[ FROM fromlist ]
[ WHERE condition ]
```

You can also use three-part notation for S3 table buckets:

```
[ WITH [RECURSIVE] common_table_expression [, common_table_expression , ...] ]
UPDATE "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name> [ [ AS ] alias ] SET column = { expression } [,...]
[ FROM fromlist ]
[ WHERE condition ]
```

The `iceberg_table` can be referenced using the `<external_schema>.<external_table_name>` form, or use the 3-part notation for auto mounted catalog. See [Referencing Iceberg tables in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/referencing-iceberg-tables.html).

You can update a table by referencing information in other tables. List these other tables in the FROM clause or use a subquery as part of the WHERE condition. The source tables can be either Iceberg tables or Amazon Redshift RMS tables.

`UPDATE` can also run on partitioned table. When `UPDATE` changes column values that belongs to current partition spec, the new updated row would be inserted into the new partition based on the newly updated value.

For example, when we have a partitioned Iceberg table like below:

```
CREATE TABLE my_external_schema.lineitem
(l_item_id int,
 l_ship_date varchar,
 ...
)
USING ICEBERG
LOCATION ...
PARTITIONED BY l_ship_date;

INSERT INTO my_external_schema.lineitem VALUES (10099, '20251231', ...);
```

And when we run below update query:

```
UPDATE my_external_schema.lineitem SET l_ship_date = '20260101'
WHERE l_item_id = 10099;
```

we will move this row with `l_item_id` 10099 from partition `20251231` to new partition `20260101`.

It's also important to note that it's possible `UPDATE` has multiple candidate values. Consider below query:

```
CREATE TABLE my_ext_schema.t1(x1 int, y1 int) USING ICEBERG LOCATION ...;
CREATE TABLE my_ext_schema.t2(x2 int, y2 int) USING ICEBERG LOCATION ...;
INSERT INTO my_ext_schema.t1 VALUES (1,10), (2,20), (3,30);
INSERT INTO my_ext_schema.t2 VALUES (2,40), (2,50);
UPDATE my_ext_schema.t1 SET y1=y2 FROM my_ext_schema.t2 WHERE x1=x2;
```

In this case, y1 can be 40 or 50. The result is nondeterministic. You can set the configuration parameter `error_on_nondeterministic_update` to true to force query error when such case happens. This is consistent with the existing RMS table `UPDATE` behavior. For more, refer to [error\$1on\$1nondeterministic\$1update](https://docs.aws.amazon.com/redshift/latest/dg/r_error_on_nondeterministic_update.html).

## MERGE
<a name="iceberg-writes-merge"></a>

The `MERGE` query conditionally merges rows from a source table into a target table. It shares the same `MERGE` query syntax as the existing RMS table:

```
MERGE INTO target_iceberg_table USING source_table [ [ AS ] alias ]
ON match_condition
[ WHEN MATCHED THEN { UPDATE SET col_name = { expr } [,...] | DELETE }
  WHEN NOT MATCHED THEN INSERT [ ( col_name [,...] ) ]
  VALUES ( { expr } [, ...] )
| REMOVE DUPLICATES ]
```

You can also use three-part notation for S3 table buckets:

```
MERGE INTO "<table_bucket_name>@s3tablescatalog".<database_name>.<table_name> USING source_table [ [ AS ] alias ]
ON match_condition
[ WHEN MATCHED THEN { UPDATE SET col_name = { expr } [,...] | DELETE }
  WHEN NOT MATCHED THEN INSERT [ ( col_name [,...] ) ]
  VALUES ( { expr } [, ...] )
| REMOVE DUPLICATES ]
```

The `target_iceberg_table` can be referenced using the `<external_schema>.<external_table_name>` form, or use the 3-part notation for auto mounted catalog. See [Referencing Iceberg tables in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/referencing-iceberg-tables.html).

The `source_table` can be either an Iceberg table or a Amazon Redshift RMS table.

When `REMOVE DUPLICATES` is used, the `MERGE` command uses simplified mode. For more details about simplified mode, please refer to the original `MERGE` [command document](https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html).

While executing `MERGE` query, Amazon Redshift generates and stores intermediate data files in the target table location. These files will be garbage collected at the end of the query. Because of this, `MERGE` query would require `DELETE` permission on Amazon S3 bucket in order to work properly. An insufficient permission error would throw if the garbage collection operation is failed. For Amazon S3 tables the garbage collection is managed by Amazon S3 tables service. Hence `DELETE` permission is not required to execute `MERGE` query.

## DROP TABLE
<a name="iceberg-writes-drop-table"></a>

To remove an Iceberg table from the catalog, use the `DROP TABLE` command:

```
DROP TABLE <external_schema>.<table_name>
```

You can also use three-part notation with auto-mounted catalogs:

```
DROP TABLE "<catalog_name>".<database_name>.<table_name>
```

Dropping an Iceberg table is a metadata only operation. It removes the table entry from AWS Glue Data Catalog and Amazon S3 table catalog, if this is an Amazon S3 table. Amazon Redshift doesn't clean up or delete any existing data file or metadata files under the table location. You can use features in AWS Glue and Amazon S3 tables to remove orphaned files. For AWS Glue, see [Deleting orphan files](https://docs.aws.amazon.com/glue/latest/dg/orphan-file-deletion.html). For Amazon S3 tables, see [Table maintenance](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html).

# Transaction semantics
<a name="iceberg-writes-transaction-semantics"></a>

Redshift Iceberg write queries support ACID and snapshot isolation. Write transactions have guaranteed atomicity and do not produce partial updates when a query fails unexpectedly. 

Multiple Iceberg transactions can run concurrently, and if two transactions try to modify the same table or partition concurrently, the transaction commit fails. This ensures data integrity. When this happens, you must resolve the conflicts manually and rerun the failed queries. Amazon Redshift doesn't automatically retry and resolve the conflicts.

A single Iceberg write query is always treated as a single auto-commit transaction. When an Iceberg write query, such as CREATE or INSERT query, is included in an explicit transaction block, no other queries can run within the same transaction block. The transaction will fail.

Following are some examples. The first example demonstrates that a single statement query always auto-commits after the query finishes. In this scenario, you're creating a new sales orders table:

```
CREATE TABLE sales_schema.orders (
    order_id int, 
    customer_id int, 
    order_date date, 
    total_amount decimal(10,2)
) USING ICEBERG LOCATION 's3://my-data-lake/sales/orders/';
```

This example is an explicit transaction block for inserting a customer order using three-part notation for an S3 table bucket. The transaction doesn't auto-commit after the INSERT query, but instead commits and inserts the order data with the COMMIT command:

```
BEGIN;
INSERT INTO "analytics_bucket@s3tablescatalog".sales_db.orders VALUES (12345, 9876, '2024-10-30', 299.99);
COMMIT;
```

This example is an explicit transaction block rollback scenario where you're testing an order insertion but decide to cancel it. The transaction doesn't auto-commit after the INSERT query, but instead rolls back with the ROLLBACK command without inserting the test order.

```
BEGIN;
INSERT INTO sales_schema.orders VALUES (12346, 5432, '2024-10-30', 150.75);
ROLLBACK;
```

This final example demonstrates how, when you try to run another statement within the same transaction block as the INSERT query, the transaction fails without inserting the order data. In this scenario, you're attempting to insert an order and immediately query the table: 

```
BEGIN;
INSERT INTO sales_schema.orders VALUES (12347, 7890, '2024-10-30', 425.50);
SELECT * FROM sales_schema.orders WHERE order_id = 12347;
```

The only exception to this is the `DROP TABLE` statement, which always behaves as an auto-commit statement and can't run within an explicit transaction block. This is to maintain the same behavior as the `DROP TABLE` on an external table. For more information, see [DROP TABLE](https://docs.aws.amazon.com/redshift/latest/dg/r_DROP_TABLE.html).

**Note**  
Iceberg write SQLs cannot be executed from within stored procedure. 

# Best practices
<a name="iceberg-writes-best-practices"></a>

Consider the following best practices when you write to an Apache Iceberg table:
+ For small, frequent writes or streaming workloads, consider using compaction features provided by AWS Glue Data Catalog or Amazon S3 tables to optimize file sizes for reads.
+ The `DROP TABLE` command deregisters the table from the AWS Glue Data Catalog or the Amazon S3 tables catalog, but your files still remain. You can use features in AWS Glue and Amazon S3 tables to remove orphaned files. For AWS Glue, see [Deleting orphan files](https://docs.aws.amazon.com/glue/latest/dg/orphan-file-deletion.html). For Amazon S3 tables, see [Table maintenance](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html).

# Amazon Redshift Spectrum query performance
<a name="c-spectrum-external-performance"></a>

This topic describes how to improve Redshift Spectrum query performance.

Look at the query plan to find what steps have been pushed to the Amazon Redshift Spectrum layer. 

The following steps are related to the Redshift Spectrum query:
+ S3 Seq Scan 
+ S3 HashAggregate 
+ S3 Query Scan
+ Seq Scan PartitionInfo
+ Partition Loop 

The following example shows the query plan for a query that joins an external table with a local table. Note the S3 Seq Scan and S3 HashAggregate steps that were run against the data on Amazon S3.

```
explain
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) 
from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;
```

```
QUERY PLAN                                                                                                                                                                                
-----------------------------------------------------------------------------
XN Limit  (cost=1001055770628.63..1001055770628.65 rows=10 width=31)                                                                                                                      
  ->  XN Merge  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                               
        Merge Key: sum(sales.derived_col2)                                                                                                                                                
        ->  XN Network  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                       
              Send to leader                                                                                                                                                              
              ->  XN Sort  (cost=1001055770628.63..1001055770629.13 rows=200 width=31)                                                                                                    
                    Sort Key: sum(sales.derived_col2)                                                                                                                                     
                    ->  XN HashAggregate  (cost=1055770620.49..1055770620.99 rows=200 width=31)                                                                                           
                          ->  XN Hash Join DS_BCAST_INNER  (cost=3119.97..1055769620.49 rows=200000 width=31)                                                                             
                                Hash Cond: ("outer".derived_col1 = "inner".eventid)                                                                                                       
                                ->  XN S3 Query Scan sales  (cost=3010.00..5010.50 rows=200000 width=31)                                                                                  
                                      ->  S3 HashAggregate  (cost=3010.00..3010.50 rows=200000 width=16)                                                                                  
                                            ->  S3 Seq Scan spectrum.sales location:"s3://redshift-downloads/tickit/spectrum/sales" format:TEXT  (cost=0.00..2150.00 rows=172000 width=16)
                                                  Filter: (pricepaid > 30.00)                                                                                                             
                                ->  XN Hash  (cost=87.98..87.98 rows=8798 width=4)                                                                                                        
                                      ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=4)
```

Note the following elements in the query plan:
+ The `S3 Seq Scan` node shows the filter `pricepaid > 30.00` was processed in the Redshift Spectrum layer.

  A filter node under the `XN S3 Query Scan` node indicates predicate processing in Amazon Redshift on top of the data returned from the Redshift Spectrum layer.
+ The `S3 HashAggregate` node indicates aggregation in the Redshift Spectrum layer for the group by clause (`group by spectrum.sales.eventid`).

Following are ways to improve Redshift Spectrum performance:
+ Use Apache Parquet formatted data files. Parquet stores data in a columnar format, so Redshift Spectrum can eliminate unneeded columns from the scan. When data is in text-file format, Redshift Spectrum needs to scan the entire file.
+ Use multiple files to optimize for parallel processing. Keep your file sizes larger than 64 MB. Avoid data size skew by keeping files about the same size. For information about Apache Parquet files and configuration recommendations, see [File Format: Configurations](https://parquet.apache.org/docs/file-format/configurations/) in the *Apache Parquet Documentation*.
+ Use the fewest columns possible in your queries.
+ Put your large fact tables in Amazon S3 and keep your frequently used, smaller dimension tables in your local Amazon Redshift database.
+ Update external table statistics by setting the TABLE PROPERTIES numRows parameter. Use [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md) or [ALTER TABLE](r_ALTER_TABLE.md) to set the TABLE PROPERTIES numRows parameter to reflect the number of rows in the table. Amazon Redshift doesn't analyze external tables to generate the table statistics that the query optimizer uses to generate a query plan. If table statistics aren't set for an external table, Amazon Redshift generates a query execution plan. Amazon Redshift generates this plan based on the assumption that external tables are the larger tables and local tables are the smaller tables.
+ The Amazon Redshift query planner pushes predicates and aggregations to the Redshift Spectrum query layer whenever possible. When large amounts of data are returned from Amazon S3, the processing is limited by your cluster's resources. Redshift Spectrum scales automatically to process large requests. Thus, your overall performance improves whenever you can push processing to the Redshift Spectrum layer. 
+ Write your queries to use filters and aggregations that are eligible to be pushed to the Redshift Spectrum layer.

  The following are examples of some operations that can be pushed to the Redshift Spectrum layer:
  + GROUP BY clauses
  + Comparison conditions and pattern-matching conditions, such as LIKE.
  + Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX.
  + String functions.

  Operations that can't be pushed to the Redshift Spectrum layer include DISTINCT and ORDER BY.
+ Use partitions to limit the data that is scanned. Partition your data based on your most common query predicates, then prune partitions by filtering on partition columns. For more information, see [Partitioning Redshift Spectrum external tables](c-spectrum-external-tables.md#c-spectrum-external-tables-partitioning).

  Query [SVL\$1S3PARTITION](r_SVL_S3PARTITION.md) to view total partitions and qualified partitions.
+  Use AWS Glue's statistics generator to compute column-level statistics for AWS Glue Data Catalog tables. Once AWS Glue generates statistics for tables in the Data Catalog, Amazon Redshift Spectrum automatically uses those statistics to optimize the query plan. For more information about computing column-level statistics using AWS Glue, see [Working with column statistics](https://docs.aws.amazon.com/glue/latest/dg/column-statistics.html) in the *AWS Glue Developer Guide*. 

# Data handling options
<a name="t_setting-data-handling-options"></a>

This topic describes how to configure how Redshift Spectrum handles data in unexpected formats.

You can set table parameters when you create external tables to tailor the data being queried in external tables. Otherwise, scan errors can occur. For more information, see TABLE PROPERTIES in [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md). For examples, see [Data handling examples](r_CREATE_EXTERNAL_TABLE_examples.md#r_CREATE_EXTERNAL_TABLE_examples-data-handling). For a list of errors, see [SVL\$1SPECTRUM\$1SCAN\$1ERROR](r_SVL_SPECTRUM_SCAN_ERROR.md).

You can set the following TABLE PROPERTIES when you create external tables to specify input handling for data being queried in external tables.
+ `column_count_mismatch_handling` to identify if the file contains less or more values for a row than the number of columns specified in the external table definition. 
+ `invalid_char_handling` to specify input handling for invalid characters in columns containing VARCHAR, CHAR, and string data. When you specify REPLACE for `invalid_char_handling`, you can specify the replacement character to use.
+ `numeric_overflow_handling` to specify cast overflow handling in columns containing integer and decimal data.
+ `surplus_bytes_handling` to specify input handling for surplus bytes in columns containing VARBYTE data. 
+ `surplus_char_handling` to specify input handling for surplus characters in columns containing VARCHAR, CHAR, and string data.

You can set a configuration option to cancel queries that exceed a maximum number of errors. For more information, see [spectrum\$1query\$1maxerror](r_spectrum_query_maxerror.md).

# Example: Performing correlated subqueries in Redshift Spectrum
<a name="c_performing-correlated-subqueries-spectrum"></a>

This topic describes how to perform correlated subqueries in Redshift Spectrum. A correlated subquery is a query that uses values from the outer query.

You can perform correlated subqueries in Redshift Spectrum. The `$spectrum_oid` pseudocolumn provides the ability to perform correlated queries with Redshift Spectrum. To perform a correlated subquery, the pseudocolumn `$spectrum_oid` must be enabled but doesn't appear in the SQL statement. For more information, see [Pseudocolumns](c-spectrum-external-tables.md#c-spectrum-external-tables-pseudocolumns).

To create the external schema and external tables for this example, see [Getting started with Amazon Redshift Spectrum](c-getting-started-using-spectrum.md).

Following is an example of a correlated subquery in Redshift Spectrum. 

```
select *
from myspectrum_schema.sales s
where exists
( select *
from myspectrum_schema.listing l
where l.listid = s.listid )
order by salesid
limit 5;
```

```
salesid  listid   sellerid   buyerid   eventid   dateid  qtysold  pricepaid   commission   saletime
1        1        36861      21191     7872      1875    4        728         109.2        2008-02-18 02:36:48
2        4        8117       11498     4337      1983    2        76          11.4         2008-06-06 05:00:16	
3        5        1616       17433     8647      1983    2        350         52.5         2008-06-06 08:26:17	
4        5        1616       19715     8647      1986    1        175         26.25        2008-06-09 08:38:52	
5        6        47402      14115     8240      2069    2        154         23.1         2008-08-31 09:17:02
```

# Metrics in Amazon Redshift Spectrum
<a name="c-spectrum-metrics"></a>

This topic describes system views that you can use to monitor Redshift Spectrum queries.

You can monitor Amazon Redshift Spectrum queries using the following system views:
+ [SVL\$1S3QUERY](r_SVL_S3QUERY.md)

  Use the SVL\$1S3QUERY view to get details about Redshift Spectrum queries (S3 queries) at the segment and node slice level.
+ [SVL\$1S3QUERY\$1SUMMARY](r_SVL_S3QUERY_SUMMARY.md)

  Use the SVL\$1S3QUERY\$1SUMMARY view to get a summary of all Amazon Redshift Spectrum queries (S3 queries) that have been run on the system.

The following are some things to look for in SVL\$1S3QUERY\$1SUMMARY: 
+ The number of files that were processed by the Redshift Spectrum query. 
+ The number of bytes scanned from Amazon S3. The cost of a Redshift Spectrum query is reflected in the amount of data scanned from Amazon S3.
+ The number of bytes returned from the Redshift Spectrum layer to the cluster. A large amount of data returned might affect system performance. 
+ The maximum duration and average duration of Redshift Spectrum requests. Long-running requests might indicate a bottleneck. 

# Query troubleshooting in Amazon Redshift Spectrum
<a name="c-spectrum-troubleshooting"></a>



This topic is a reference for common issues you might encounter with Amazon Redshift Spectrum queries. 

To view errors generated by Redshift Spectrum queries, query the [SVL\$1S3LOG](r_SVL_S3LOG.md) system table.

**Topics**
+ [

## Retries exceeded
](#spectrum-troubleshooting-retries-exceeded)
+ [

## Access throttled
](#spectrum-troubleshooting-access-throttled)
+ [

## Resource limit exceeded
](#spectrum-troubleshooting-resource-limit-exceeded)
+ [

## No rows returned for a partitioned table
](#spectrum-troubleshooting-no-rows-partitioned-table)
+ [

## Not authorized error
](#spectrum-troubleshooting-not-authorized-error)
+ [

## Incompatible data formats
](#spectrum-troubleshooting-incompatible-data-format)
+ [

## Syntax error when using Hive DDL in Amazon Redshift
](#spectrum-troubleshooting-syntax-error-using-hive-ddl)
+ [

## Permission to create temporary tables
](#spectrum-troubleshooting-permission-to-create-temp-tables)
+ [

## Invalid range
](#spectrum-troubleshooting-invalid-range)
+ [

## Invalid Parquet version number
](#spectrum-troubleshooting-invalid-parquet-version)
+ [

## Required field missing from external partition information
](#spectrum-troubleshooting-required-field-missing)

## Retries exceeded
<a name="spectrum-troubleshooting-retries-exceeded"></a>

If an Amazon Redshift Spectrum request times out, the request is canceled and resubmitted. After five failed retries, the query fails with the following error.

```
error:  Spectrum Scan Error: Retries exceeded
```

Possible causes include the following: 
+ Large file sizes (greater than 1 GB). Check your file sizes in Amazon S3 and look for large files and file size skew. Break up large files into smaller files, between 100 MB and 1 GB. Try to make files about the same size. 
+ Slow network throughput. Try your query later. 

## Access throttled
<a name="spectrum-troubleshooting-access-throttled"></a>

Amazon Redshift Spectrum is subject to the service quotas of other AWS services. Under high usage, Redshift Spectrum requests might be required to slow down, resulting in the following error.

```
error:  Spectrum Scan Error: Access throttled
```

Two types of throttling can happen:
+ Access throttled by Amazon S3.
+ Access throttled by AWS KMS.

The error context provides more details about the type of throttling. Following, you can find causes and possible resolutions for this throttling.

### Access throttled by Amazon S3
<a name="spectrum-troubleshooting-access-throttled-s3"></a>

Amazon S3 might throttle a Redshift Spectrum request if the read request rate on a [prefix](https://docs.aws.amazon.com/glossary/latest/reference/glos-chap.html#keyprefix) is too high. For information about a GET/HEAD request rate that you can achieve in Amazon S3, see [Optimizing Amazon S3 Performance](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html) in *Amazon Simple Storage Service User Guide.* The Amazon S3 GET/HEAD request rate takes into account all GET/HEAD requests on a prefix so different applications accessing the same prefix share the total requests rate.

If your Redshift Spectrum requests frequently get throttled by Amazon S3, reduce the number of Amazon S3 GET/HEAD requests that Redshift Spectrum makes to Amazon S3. To do this, try merging small files into larger files. We recommend using file sizes of 64 MB or larger.

Also consider partitioning your Redshift Spectrum tables to benefit from early filtering and to reduce the number of files accessed in Amazon S3. For more information, see [Partitioning Redshift Spectrum external tables](c-spectrum-external-tables.md#c-spectrum-external-tables-partitioning). 

### Access throttled by AWS KMS
<a name="spectrum-troubleshooting-access-throttled-kms"></a>

If you store your data in Amazon S3 using server-side encryption (SSE-S3 or SSE-KMS), Amazon S3 calls an API operation to AWS KMS for each file that Redshift Spectrum accesses. These requests count toward your cryptographic operations quota; for more information, see [AWS KMS Request Quotas](https://docs.aws.amazon.com/kms/latest/developerguide/requests-per-second.html). For more information on SSE-S3 and SSE-KMS, see [Protecting Data Using Server-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) and [Protecting Data Using Server-Side Encryption with KMS keys Stored in AWS KMS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) in *Amazon Simple Storage Service User Guide.*

A first step to reduce the number of requests that Redshift Spectrum makes to AWS KMS is to reduce the number of files accessed. To do this, try merging small files into larger files. We recommend using file sizes of 64 MB or larger.

If your Redshift Spectrum requests frequently get throttled by AWS KMS, consider requesting a quota increase for your AWS KMS request rate for cryptographic operations. To request a quota increase, see [AWS Service Limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) in the *Amazon Web Services General Reference*. 

## Resource limit exceeded
<a name="spectrum-troubleshooting-resource-limit-exceeded"></a>

Redshift Spectrum enforces an upper bound on the amount of memory a request can use. A Redshift Spectrum request that requires more memory fails, resulting in the following error.

```
error:  Spectrum Scan Error: Resource limit exceeded
```

There are two common reasons that can cause a Redshift Spectrum request to overrun its memory allowance:
+ Redshift Spectrum processes a large chunk of data that can't be split in smaller chunks.
+ A large aggregation step is processed by Redshift Spectrum.

We recommend using a file format that supports parallel reads with split sizes of 128 MB or less. See [Data files for queries in Amazon Redshift Spectrum](c-spectrum-data-files.md) for supported file formats and generic guidelines for data file creation. When using file formats or compression algorithms that don't support parallel reads, we recommend keeping file sizes between 64 MB and 128 MB.

## No rows returned for a partitioned table
<a name="spectrum-troubleshooting-no-rows-partitioned-table"></a>

If your query returns zero rows from a partitioned external table, check whether a partition has been added for this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using `ALTER TABLE … ADD PARTITION`. Query the [SVV\$1EXTERNAL\$1PARTITIONS](r_SVV_EXTERNAL_PARTITIONS.md) view to find existing partitions. Run `ALTER TABLE … ADD PARTITION` for each missing partition. 

## Not authorized error
<a name="spectrum-troubleshooting-not-authorized-error"></a>

Verify that the IAM role for the cluster allows access to the Amazon S3 file objects. If your external database is on Amazon Athena, verify that the IAM role allows access to Athena resources. For more information, see [IAM policies for Amazon Redshift Spectrum](c-spectrum-iam-policies.md).

## Incompatible data formats
<a name="spectrum-troubleshooting-incompatible-data-format"></a>

For a columnar file format, such as Apache Parquet, the column type is embedded with the data. The column type in the CREATE EXTERNAL TABLE definition must match the column type of the data file. If there is a mismatch, you receive an error similar to the following:

```
File 'https://s3bucket/location/file has an incompatible Parquet schema
for column ‘s3://s3bucket/location.col1'. Column type: VARCHAR, Par
```

The error message might be truncated due to the limit on message length. To retrieve the complete error message, including column name and column type, query the [SVL\$1S3LOG](r_SVL_S3LOG.md) system view.

The following example queries SVL\$1S3LOG for the last query completed.

```
select message 
from svl_s3log 
where query = pg_last_query_id() 
order by query,segment,slice;
```

The following is an example of a result that shows the full error message.

```
                            message
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––-
Spectrum Scan Error. File 'https://s3bucket/location/file has an incompatible
Parquet schema for column ' s3bucket/location.col1'. 
Column type: VARCHAR, Parquet schema:\noptional int64 l_orderkey [i:0 d:1 r:0]\n
```

To correct the error, alter the external table to match the column type of the Parquet file. 

## Syntax error when using Hive DDL in Amazon Redshift
<a name="spectrum-troubleshooting-syntax-error-using-hive-ddl"></a>

Amazon Redshift supports data definition language (DDL) for CREATE EXTERNAL TABLE that is similar to Hive DDL. However, the two types of DDL aren't always exactly the same. If you copy Hive DDL to create or alter Amazon Redshift external tables, you might encounter syntax errors. The following are examples of differences between Amazon Redshift and Hive DDL: 
+ Amazon Redshift requires single quotation marks (') where Hive DDL supports double quotation marks (").
+ Amazon Redshift doesn't support the STRING data type. Use VARCHAR instead.

## Permission to create temporary tables
<a name="spectrum-troubleshooting-permission-to-create-temp-tables"></a>

To run Redshift Spectrum queries, the database user must have permission to create temporary tables in the database. The following example grants temporary permission on the database `spectrumdb` to the `spectrumusers` user group. 

```
grant temp on database spectrumdb  to group spectrumusers;
```

For more information, see [GRANT](r_GRANT.md).

## Invalid range
<a name="spectrum-troubleshooting-invalid-range"></a>

Redshift Spectrum expects that files in Amazon S3 that belong to an external table are not overwritten during a query. If this happens, it can result in the following error.

```
Error: HTTP response error code: 416 Message: InvalidRange The requested range is not satisfiable
```

To avoid the error, make sure Amazon S3 files are not overwritten while they are queried with Redshift Spectrum.

## Invalid Parquet version number
<a name="spectrum-troubleshooting-invalid-parquet-version"></a>

Redshift Spectrum checks the metadata of each Apache Parquet file it accesses. If the check fails, it can result in an error similar to the following:

```
File 'https://s3.region.amazonaws.com/s3bucket/location/file has an invalid version number
```

There are two common reasons that can cause the check to fail:
+ The Parquet file has been overwritten during the query (see [Invalid range](#spectrum-troubleshooting-invalid-range)).
+ The Parquet file is corrupt.

## Required field missing from external partition information
<a name="spectrum-troubleshooting-required-field-missing"></a>

When you try to add a partition to an external table in an external catalog, you might get the following error:

```
Error: The required field (<field_name>) is missing from the external partition information. Add missing field in partition and retry. Partition location: <partition_path>
```

This error means that one of the partitions in the external table that was used in your query has missing partition metadata information. This can happen in the following cases:
+  You added a partition to an external table in an external catalog, such as the AWS Glue Data Catalog, with partial information. 
+  You queried a partitioned table in Amazon Redshift while adding or updating partitions for the corresponding table in an external catalog, such as the AWS Glue Data Catalog. 

Following are the fields that must be filled when retrieving a partition from the AWS Glue Data Catalog:
+  StorageDescriptor 
  +  InputFormat 
  +  OutputFormat 
  +  SerDeInfo 
+  Values 

You can query [SVV\$1EXTERNAL\$1PARTITIONS](r_SVV_EXTERNAL_PARTITIONS.md) to find existing partitions and see details on their fields.

For a full list of AWS Glue Data Catalog partition fields, see [ Partition](https://docs.aws.amazon.com/glue/latest/webapi/API_Partition.html) in the *AWS Glue Web API Reference*.

# Tutorial: Querying nested data with Amazon Redshift Spectrum
<a name="tutorial-query-nested-data"></a>

This tutorial demonstrates how to query nested data with Redshift Spectrum. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

**Topics**
+ [

## Overview
](#tutorial-nested-data-overview)
+ [

## Step 1: Create an external table that contains nested data
](#tutorial-nested-data-create-table)
+ [

## Step 2: Query your nested data in Amazon S3 with SQL extensions
](#tutorial-query-nested-data-sqlextensions)
+ [

# Nested data use cases
](nested-data-use-cases.md)
+ [

# Nested data limitations (preview)
](nested-data-restrictions.md)
+ [

# Serializing complex nested JSON
](serializing-complex-JSON.md)

## Overview
<a name="tutorial-nested-data-overview"></a>

Amazon Redshift Spectrum supports querying nested data in Parquet, ORC, JSON, and Ion file formats. Redshift Spectrum accesses the data using external tables. You can create external tables that use the complex data types `struct`, `array`, and `map`.

For example, suppose that your data file contains the following data in Amazon S3 in a folder named `customers`. Although there isn't a single root element, each JSON object in this sample data represents a row in a table. 

```
{"id": 1,
 "name": {"given": "John", "family": "Smith"},
 "phones": ["123-457789"],
 "orders": [{"shipdate": "2018-03-01T11:59:59.000Z", "price": 100.50},
            {"shipdate": "2018-03-01T09:10:00.000Z", "price": 99.12}]
}
{"id": 2,
 "name": {"given": "Jenny", "family": "Doe"},
 "phones": ["858-8675309", "415-9876543"],
 "orders": []
}
{"id": 3,
 "name": {"given": "Andy", "family": "Jones"},
 "phones": [],
 "orders": [{"shipdate": "2018-03-02T08:02:15.000Z", "price": 13.50}]
}
```

You can use Amazon Redshift Spectrum to query nested data in files. The following tutorial shows you how to do so with Apache Parquet data.

### Prerequisites
<a name="tutorial-nested-data-prereq"></a>

If you are not using Redshift Spectrum yet, follow the steps in the [Getting started with Amazon Redshift Spectrum](c-getting-started-using-spectrum.md) before continuing.

To create an external schema, replace the IAM role ARN in the following command with the role ARN you created in [Create an IAM role](c-getting-started-using-spectrum.md#c-getting-started-using-spectrum-create-role). Then run the command in your SQL client.

```
create external schema spectrum 
from data catalog 
database 'myspectrum_db' 
iam_role 'arn:aws:iam::123456789012:role/myspectrum_role'
create external database if not exists;
```

## Step 1: Create an external table that contains nested data
<a name="tutorial-nested-data-create-table"></a>

You can view the [source data](https://s3.amazonaws.com/redshift-downloads/tickit/spectrum/customers/customer_file1) by downloading it from Amazon S3. 

To create the external table for this tutorial, run the following command. 

```
CREATE EXTERNAL TABLE spectrum.customers (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones array<varchar(20)>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

In the example preceding, the external table `spectrum.customers` uses the `struct` and `array` data types to define columns with nested data. Amazon Redshift Spectrum supports querying nested data in Parquet, ORC, JSON, and Ion file formats. The `STORED AS` parameter is `PARQUET` for Apache Parquet files. The `LOCATION` parameter has to refer to the Amazon S3 folder that contains the nested data or files. For more information, see [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md).

You can nest `array` and `struct` types at any level. For example, you can define a column named `toparray` as shown in the following example.

```
toparray array<struct<nestedarray:
         array<struct<morenestedarray: 
         array<string>>>>>
```

You can also nest `struct` types as shown for column `x` in the following example.

```
x struct<a: string,
         b: struct<c: integer,
                   d: struct<e: string>
                  >
        >
```

## Step 2: Query your nested data in Amazon S3 with SQL extensions
<a name="tutorial-query-nested-data-sqlextensions"></a>

Redshift Spectrum supports querying `array`, `map`, and `struct` complex types through extensions to the Amazon Redshift SQL syntax. 

### Extension 1: Access to columns of structs
<a name="nested-data-sqlextension1"></a>

You can extract data from `struct` columns using a dot notation that concatenates field names into paths. For example, the following query returns given and family names for customers. The given name is accessed by the long path `c.name.given`. The family name is accessed by the long path `c.name.family`. 

```
SELECT c.id, c.name.given, c.name.family
FROM   spectrum.customers c;
```

The preceding query returns the following data.

```
id | given | family
---|-------|-------
1  | John  | Smith
2  | Jenny | Doe
3  | Andy  | Jones
(3 rows)
```

A `struct` can be a column of another `struct`, which can be a column of another `struct`, at any level. The paths that access columns in such deeply nested `struct`s can be arbitrarily long. For example, see the definition for the column `x` in the following example.

```
x struct<a: string,
         b: struct<c: integer, 
                      d: struct<e: string>
                  >
        >
```

You can access the data in `e` as `x.b.d.e`.

### Extension 2: Ranging over arrays in a FROM clause
<a name="nested-data-sqlextension2"></a>

You can extract data from `array` columns (and, by extension, `map` columns) by specifying the `array` columns in a `FROM` clause in place of table names. The extension applies to the `FROM` clause of the main query, and also the `FROM` clauses of subqueries.

You can reference `array` elements by position, such as `c.orders[0]`. (preview)

By combining ranging over `arrays` with joins, you can achieve various kinds of unnesting, as explained in the following use cases. 

#### Unnesting using inner joins
<a name="unnest-inner-joins"></a>

The following query selects customer IDs and order ship dates for customers that have orders. The SQL extension in the FROM clause `c.orders o` depends on the alias `c`.

```
SELECT c.id, o.shipdate
FROM   spectrum.customers c, c.orders o
```

For each customer `c` that has orders, the `FROM` clause returns one row for each order `o` of the customer `c`. That row combines the customer row `c` and the order row `o`. Then the `SELECT` clause keeps only the `c.id` and `o.shipdate`. The result is the following.

```
id|      shipdate
--|----------------------
1 |2018-03-01  11:59:59
1 |2018-03-01  09:10:00
3 |2018-03-02  08:02:15
(3 rows)
```

The alias `c` provides access to the customer fields, and the alias `o` provides access to the order fields. 

The semantics are similar to standard SQL. You can think of the `FROM` clause as running the following nested loop, which is followed by `SELECT` choosing the fields to output. 

```
for each customer c in spectrum.customers
  for each order o in c.orders
     output c.id and o.shipdate
```

Therefore, if a customer doesn't have an order, the customer doesn't appear in the result.

You can also think of this as the `FROM` clause performing a `JOIN` with the `customers` table and the `orders` array. In fact, you can also write the query as shown in the following example.

```
SELECT c.id, o.shipdate
FROM   spectrum.customers c INNER JOIN c.orders o ON true
```

**Note**  
If a schema named `c` exists with a table named `orders`, then `c.orders` refers to the table `orders`, and not the array column of `customers`.

#### Unnesting using left joins
<a name="unnest-left-joins"></a>

The following query outputs all customer names and their orders. If a customer hasn't placed an order, the customer's name is still returned. However, in this case, the order columns are NULL, as shown in the following example for Jenny Doe.

```
SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price
FROM   spectrum.customers c LEFT JOIN c.orders o ON true
```

The preceding query returns the following data.

```
id  |  given  | family  |    shipdate          | price
----|---------|---------|----------------------|--------
 1  |  John   | Smith   | 2018-03-01  11:59:59 | 100.5
 1  |  John   | Smith   | 2018-03-01  09:10:00 |  99.12
 2  |  Jenny  | Doe     |                      |
 3  |  Andy   | Jones   | 2018-03-02  08:02:15 |  13.5
 (4 rows)
```

### Extension 3: Accessing an array of scalars directly using an alias
<a name="nested-data-sqlextension3"></a>

When an alias `p` in a `FROM` clause ranges over an array of scalars, the query refers to the values of `p` as `p`. For example, the following query produces pairs of customer names and phone numbers.

```
SELECT c.name.given, c.name.family, p AS phone
FROM   spectrum.customers c LEFT JOIN c.phones p ON true
```

The preceding query returns the following data.

```
given  |  family  |  phone
-------|----------|-----------
John   | Smith    | 123-4577891
Jenny  | Doe      | 858-8675309
Jenny  | Doe      | 415-9876543
Andy   | Jones    | 
(4 rows)
```

### Extension 4: Accessing elements of maps
<a name="nested-data-sqlextension4"></a>

Redshift Spectrum treats the `map` data type as an `array` type that contains `struct` types with a `key` column and a `value` column. The `key` must be a `scalar`; the value can be any data type. 

For example, the following code creates an external table with a `map` for storing phone numbers.

```
CREATE EXTERNAL TABLE spectrum.customers2 (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones map<varchar(20), varchar(20)>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

Because a `map` type behaves like an `array` type with columns `key` and `value`, you can think of the preceding schemas as if they were the following.

```
CREATE EXTERNAL TABLE spectrum.customers3 (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones array<struct<key:varchar(20), value:varchar(20)>>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

The following query returns the names of customers with a mobile phone number and returns the number for each name. The map query is treated as the equivalent of querying a nested `array` of `struct` types. The following query only returns data if you have created the external table as described previously. 

```
SELECT c.name.given, c.name.family, p.value 
FROM   spectrum.customers c, c.phones p 
WHERE  p.key = 'mobile';
```

**Note**  
The `key` for a `map` is a `string` for Ion and JSON file types.

# Nested data use cases
<a name="nested-data-use-cases"></a>

This topic describes use cases for nested data. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

You can combine the extensions described previously with the usual SQL features. The following use cases illustrate some common combinations. These examples help demonstrate how you can use nested data. They aren't part of the tutorial.

**Topics**
+ [

## Ingesting nested data
](#ingesting-nested-data)
+ [

## Aggregating nested data with subqueries
](#aggregating-with-subquery)
+ [

## Joining Amazon Redshift and nested data
](#joining-redshift-data)

## Ingesting nested data
<a name="ingesting-nested-data"></a>

You can use a `CREATE TABLE AS` statement to ingest data from an external table that contains complex data types. The following query extracts all customers and their phone numbers from the external table, using `LEFT JOIN`, and stores them in the Amazon Redshift table `CustomerPhones`. 

```
CREATE TABLE CustomerPhones AS
SELECT  c.name.given, c.name.family, p AS phone
FROM    spectrum.customers c LEFT JOIN c.phones p ON true;
```

## Aggregating nested data with subqueries
<a name="aggregating-with-subquery"></a>

You can use a subquery to aggregate nested data. The following example illustrates this approach. 

```
SELECT c.name.given, c.name.family, (SELECT COUNT(*) FROM c.orders o) AS ordercount 
FROM   spectrum.customers c;
```

The following data is returned.

```
given   |  family  |  ordercount
--------|----------|--------------
 Jenny  |  Doe     |       0
 John   |  Smith   |       2
 Andy   |  Jones   |       1
 (3 rows)
```

**Note**  
When you aggregate nested data by grouping by the parent row, the most efficient way is the one shown in the previous example. In that example, the nested rows of `c.orders` are grouped by their parent row `c`. Alternatively, if you know that `id` is unique for each `customer` and `o.shipdate` is never null, you can aggregate as shown in the following example. However, this approach generally isn't as efficient as the previous example. 

```
SELECT    c.name.given, c.name.family, COUNT(o.shipdate) AS ordercount 
FROM      spectrum.customers c LEFT JOIN c.orders o ON true 
GROUP BY  c.id, c.name.given, c.name.family;
```

You can also write the query by using a subquery in the `FROM` clause that refers to an alias (`c`) of the ancestor query and extracts array data. The following example demonstrates this approach.

```
SELECT c.name.given, c.name.family, s.count AS ordercount
FROM   spectrum.customers c, (SELECT count(*) AS count FROM c.orders o) s;
```

## Joining Amazon Redshift and nested data
<a name="joining-redshift-data"></a>

You can also join Amazon Redshift data with nested data in an external table. For example, suppose that you have the following nested data in Amazon S3. 

```
CREATE EXTERNAL TABLE spectrum.customers2 (
  id      int,
  name    struct<given:varchar(20), family:varchar(20)>,
  phones  array<varchar(20)>,
  orders  array<struct<shipdate:timestamp, item:int>>
);
```

Suppose also that you have the following table in Amazon Redshift.

```
CREATE TABLE prices (
  id int,
  price double precision
);
```

The following query finds the total number and amount of each customer's purchases based on the preceding. The following example is only an illustration. It only returns data if you have created the tables described previously. 

```
SELECT   c.name.given, c.name.family, COUNT(o.date) AS ordercount, SUM(p.price) AS ordersum 
FROM     spectrum.customers2 c, c.orders o, prices p ON o.item = p.id  
GROUP BY c.id, c.name.given, c.name.family;
```

# Nested data limitations (preview)
<a name="nested-data-restrictions"></a>

This topic describes limitations for reading nested data with Redshift Spectrum. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

**Note**  
The limitations marked (preview) in the following list only apply to preview clusters created in the following Regions.  
US East (Ohio) (us-east-2)
US East (N. Virginia) (us-east-1)
US West (N. California) (us-west-1)
Asia Pacific (Tokyo) (ap-northeast-1)
Europe (Ireland) (eu-west-1)
Europe (Stockholm) (eu-north-1)
For information about setting up Preview clusters, see [Creating a preview cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#cluster-preview) in the *Amazon Redshift Management Guide*. 

The following limitations apply to nested data:
+ An `array` or `map` type can contain other `array` or `map` types as long as queries on the nested `arrays` or `maps` don't return `scalar` values. (preview) 
+ Amazon Redshift Spectrum supports complex data types only as external tables.
+  Subquery result columns must be top-level. (preview)
+ If an `OUTER JOIN` expression refers to a nested table, it can refer only to that table and its nested arrays (and maps). If an `OUTER JOIN` expression doesn't refer to a nested table, it can refer to any number of non-nested tables.
+ If a `FROM` clause in a subquery refers to a nested table, it can't refer to any other table.
+ If a subquery depends on a nested table that refers to a parent table, the subquery can only use the parent table in the `FROM` clause. You can't use the parent in any other clauses, such as a `SELECT` or `WHERE` clause. For example, the following query doesn't run because the subquery's `SELECT` clause refers to the parent table `c`. 

  ```
  SELECT c.name.given 
  FROM   spectrum.customers c 
  WHERE (SELECT COUNT(c.id) FROM c.phones p WHERE p LIKE '858%') > 1;
  ```

  The following query works because the parent `c` is used only in the `FROM` clause of the subquery.

  ```
  SELECT c.name.given 
  FROM   spectrum.customers c 
  WHERE (SELECT COUNT(*) FROM c.phones p WHERE p LIKE '858%') > 1;
  ```
+ A subquery that accesses nested data anywhere other than the `FROM` clause must return a single value. The only exceptions are `(NOT) EXISTS` operators in a `WHERE` clause.
+ `(NOT) IN` is not supported.
+ The maximum nesting depth for all nested types is 100. This restriction applies to all file formats (Parquet, ORC, Ion, and JSON).
+ Aggregation subqueries that access nested data can only refer to `arrays` and `maps` in their `FROM` clause, not to an external table. 
+ Querying the pseudocolumns of nested data in a Redshift Spectrum table is not supported. For more information, see [Pseudocolumns](c-spectrum-external-tables.md#c-spectrum-external-tables-pseudocolumns). 
+ When extracting data from array or map columns by specifying them in a `FROM` clause, you can only select values from those columns if the values are `scalar`. For example, the following queries both try to `SELECT` elements from inside an array. The query that selects `arr.a` works because `arr.a` is a `scalar` value. The second query doesn't work because `array` is an array extracted from `s3.nested table` in the `FROM` clause. (preview)

  ```
  SELECT array_column FROM s3.nested_table;
  
  array_column
  -----------------
  [{"a":1},{"b":2}]
                          
  SELECT arr.a FROM s3.nested_table t, t.array_column arr;
  
  arr.a
  -----
  1
  
  --This query fails to run.
  SELECT array FROM s3.nested_table tab, tab.array_column array;
  ```

  You can’t use an array or map in the `FROM` clause that itself comes from another array or map. To select arrays or other complex structures that are nested inside other arrays, consider using indexes in the `SELECT` statement.

# Serializing complex nested JSON
<a name="serializing-complex-JSON"></a>

This topic demonstrates how to serialize nested data in JSON format. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

An alternate to methods demonstrated in this tutorial is to query top-level nested collection columns as serialized JSON. You can use the serialization to inspect, convert, and ingest nested data as JSON with Redshift Spectrum. This method is supported for ORC, JSON, Ion, and Parquet formats. Use the session configuration parameter `json_serialization_enable` to configure the serialization behavior. When set, complex JSON data types are serialized to VARCHAR(65535). The nested JSON can be accessed with [JSON functions](json-functions.md). For more information, see [json\$1serialization\$1enable](r_json_serialization_enable.md).

For example, without setting `json_serialization_enable`, the following queries that access nested columns directly fail. 

```
SELECT * FROM spectrum.customers LIMIT 1;

=> ERROR:  Nested tables do not support '*' in the SELECT clause.

SELECT name FROM spectrum.customers LIMIT 1;

=> ERROR:  column "name" does not exist in customers
```

Setting `json_serialization_enable` enables querying top-level collections directly. 

```
SET json_serialization_enable TO true;

SELECT * FROM spectrum.customers order by id LIMIT 1;

id | name                                 | phones         | orders
---+--------------------------------------+----------------+----------------------------------------------------------------------------------------------------------------------
1  | {"given": "John", "family": "Smith"} | ["123-457789"] | [{"shipdate": "2018-03-01T11:59:59.000Z", "price": 100.50}, {"shipdate": "2018-03-01T09:10:00.000Z", "price": 99.12}]          
 
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": "John", "family": "Smith"}
```

Consider the following items when serializing nested JSON.
+ When collection columns are serialized as VARCHAR(65535), their nested subfields can't be accessed directly as part of the query syntax (for example, in the filter clause). However, JSON functions can be used to access nested JSON. 
+ The following specialized representations are not supported: 
  + ORC unions
  + ORC maps with complex type keys
  + Ion datagrams
  + Ion SEXP
+ Timestamps are returned as ISO serialized strings.
+ Primitive map keys are promoted to string (for example, `1` to `"1"`).
+ Top-level null values are serialized as NULLs.
+ If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.

## Serializing complex types containing JSON strings
<a name="serializing-complex-JSON-strings"></a>

By default, string values contained in nested collections are serialized as escaped JSON strings. Escaping might be undesirable when the strings are valid JSON. Instead you might want to write nested subelements or fields that are VARCHAR directly as JSON. Enable this behavior with the `json_serialization_parse_nested_strings` session-level configuration. When both `json_serialization_enable` and `json_serialization_parse_nested_strings` are set, valid JSON values are serialized inline without escape characters. When the value is not valid JSON, it is escaped as if the `json_serialization_parse_nested_strings` configuration value was not set. For more information, see [json\$1serialization\$1parse\$1nested\$1strings](r_json_serialization_parse_nested_strings.md).

For example, assume the data from the previous example contained JSON as a `structs` complex type in the `name` VARCHAR(20) field: 

```
name
---------
{"given": "{\"first\":\"John\",\"middle\":\"James\"}", "family": "Smith"}
```

When `json_serialization_parse_nested_strings` is set, the `name` column is serialized as follows: 

```
SET json_serialization_enable TO true;
SET json_serialization_parse_nested_strings TO true;
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": {"first":"John","middle":"James"}, "family": "Smith"}
```

Instead of being escaped like this:

```
SET json_serialization_enable TO true;
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": "{\"first\":\"John\",\"middle\":\"James\"}", "family": "Smith"}
```