Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Tutorial: Querying nested data with Amazon Redshift Spectrum
<a name="tutorial-query-nested-data"></a>

This tutorial demonstrates how to query nested data with Redshift Spectrum. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

**Topics**
+ [Overview](#tutorial-nested-data-overview)
+ [Step 1: Create an external table that contains nested data](#tutorial-nested-data-create-table)
+ [Step 2: Query your nested data in Amazon S3 with SQL extensions](#tutorial-query-nested-data-sqlextensions)
+ [Nested data use cases](nested-data-use-cases.md)
+ [Nested data limitations (preview)](nested-data-restrictions.md)
+ [Serializing complex nested JSON](serializing-complex-JSON.md)

## Overview
<a name="tutorial-nested-data-overview"></a>

Amazon Redshift Spectrum supports querying nested data in Parquet, ORC, JSON, and Ion file formats. Redshift Spectrum accesses the data using external tables. You can create external tables that use the complex data types `struct`, `array`, and `map`.

For example, suppose that your data file contains the following data in Amazon S3 in a folder named `customers`. Although there isn't a single root element, each JSON object in this sample data represents a row in a table. 

```
{"id": 1,
 "name": {"given": "John", "family": "Smith"},
 "phones": ["123-457789"],
 "orders": [{"shipdate": "2018-03-01T11:59:59.000Z", "price": 100.50},
            {"shipdate": "2018-03-01T09:10:00.000Z", "price": 99.12}]
}
{"id": 2,
 "name": {"given": "Jenny", "family": "Doe"},
 "phones": ["858-8675309", "415-9876543"],
 "orders": []
}
{"id": 3,
 "name": {"given": "Andy", "family": "Jones"},
 "phones": [],
 "orders": [{"shipdate": "2018-03-02T08:02:15.000Z", "price": 13.50}]
}
```

You can use Amazon Redshift Spectrum to query nested data in files. The following tutorial shows you how to do so with Apache Parquet data.

### Prerequisites
<a name="tutorial-nested-data-prereq"></a>

If you are not using Redshift Spectrum yet, follow the steps in the [Getting started with Amazon Redshift Spectrum](c-getting-started-using-spectrum.md) before continuing.

To create an external schema, replace the IAM role ARN in the following command with the role ARN you created in [Create an IAM role](c-getting-started-using-spectrum.md#c-getting-started-using-spectrum-create-role). Then run the command in your SQL client.

```
create external schema spectrum 
from data catalog 
database 'myspectrum_db' 
iam_role 'arn:aws:iam::123456789012:role/myspectrum_role'
create external database if not exists;
```

## Step 1: Create an external table that contains nested data
<a name="tutorial-nested-data-create-table"></a>

You can view the [source data](https://s3.amazonaws.com/redshift-downloads/tickit/spectrum/customers/customer_file1) by downloading it from Amazon S3. 

To create the external table for this tutorial, run the following command. 

```
CREATE EXTERNAL TABLE spectrum.customers (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones array<varchar(20)>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

In the example preceding, the external table `spectrum.customers` uses the `struct` and `array` data types to define columns with nested data. Amazon Redshift Spectrum supports querying nested data in Parquet, ORC, JSON, and Ion file formats. The `STORED AS` parameter is `PARQUET` for Apache Parquet files. The `LOCATION` parameter has to refer to the Amazon S3 folder that contains the nested data or files. For more information, see [CREATE EXTERNAL TABLE](r_CREATE_EXTERNAL_TABLE.md).

You can nest `array` and `struct` types at any level. For example, you can define a column named `toparray` as shown in the following example.

```
toparray array<struct<nestedarray:
         array<struct<morenestedarray: 
         array<string>>>>>
```

You can also nest `struct` types as shown for column `x` in the following example.

```
x struct<a: string,
         b: struct<c: integer,
                   d: struct<e: string>
                  >
        >
```

## Step 2: Query your nested data in Amazon S3 with SQL extensions
<a name="tutorial-query-nested-data-sqlextensions"></a>

Redshift Spectrum supports querying `array`, `map`, and `struct` complex types through extensions to the Amazon Redshift SQL syntax. 

### Extension 1: Access to columns of structs
<a name="nested-data-sqlextension1"></a>

You can extract data from `struct` columns using a dot notation that concatenates field names into paths. For example, the following query returns given and family names for customers. The given name is accessed by the long path `c.name.given`. The family name is accessed by the long path `c.name.family`. 

```
SELECT c.id, c.name.given, c.name.family
FROM   spectrum.customers c;
```

The preceding query returns the following data.

```
id | given | family
---|-------|-------
1  | John  | Smith
2  | Jenny | Doe
3  | Andy  | Jones
(3 rows)
```

A `struct` can be a column of another `struct`, which can be a column of another `struct`, at any level. The paths that access columns in such deeply nested `struct`s can be arbitrarily long. For example, see the definition for the column `x` in the following example.

```
x struct<a: string,
         b: struct<c: integer, 
                      d: struct<e: string>
                  >
        >
```

You can access the data in `e` as `x.b.d.e`.

### Extension 2: Ranging over arrays in a FROM clause
<a name="nested-data-sqlextension2"></a>

You can extract data from `array` columns (and, by extension, `map` columns) by specifying the `array` columns in a `FROM` clause in place of table names. The extension applies to the `FROM` clause of the main query, and also the `FROM` clauses of subqueries.

You can reference `array` elements by position, such as `c.orders[0]`. (preview)

By combining ranging over `arrays` with joins, you can achieve various kinds of unnesting, as explained in the following use cases. 

#### Unnesting using inner joins
<a name="unnest-inner-joins"></a>

The following query selects customer IDs and order ship dates for customers that have orders. The SQL extension in the FROM clause `c.orders o` depends on the alias `c`.

```
SELECT c.id, o.shipdate
FROM   spectrum.customers c, c.orders o
```

For each customer `c` that has orders, the `FROM` clause returns one row for each order `o` of the customer `c`. That row combines the customer row `c` and the order row `o`. Then the `SELECT` clause keeps only the `c.id` and `o.shipdate`. The result is the following.

```
id|      shipdate
--|----------------------
1 |2018-03-01  11:59:59
1 |2018-03-01  09:10:00
3 |2018-03-02  08:02:15
(3 rows)
```

The alias `c` provides access to the customer fields, and the alias `o` provides access to the order fields. 

The semantics are similar to standard SQL. You can think of the `FROM` clause as running the following nested loop, which is followed by `SELECT` choosing the fields to output. 

```
for each customer c in spectrum.customers
  for each order o in c.orders
     output c.id and o.shipdate
```

Therefore, if a customer doesn't have an order, the customer doesn't appear in the result.

You can also think of this as the `FROM` clause performing a `JOIN` with the `customers` table and the `orders` array. In fact, you can also write the query as shown in the following example.

```
SELECT c.id, o.shipdate
FROM   spectrum.customers c INNER JOIN c.orders o ON true
```

**Note**  
If a schema named `c` exists with a table named `orders`, then `c.orders` refers to the table `orders`, and not the array column of `customers`.

#### Unnesting using left joins
<a name="unnest-left-joins"></a>

The following query outputs all customer names and their orders. If a customer hasn't placed an order, the customer's name is still returned. However, in this case, the order columns are NULL, as shown in the following example for Jenny Doe.

```
SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price
FROM   spectrum.customers c LEFT JOIN c.orders o ON true
```

The preceding query returns the following data.

```
id  |  given  | family  |    shipdate          | price
----|---------|---------|----------------------|--------
 1  |  John   | Smith   | 2018-03-01  11:59:59 | 100.5
 1  |  John   | Smith   | 2018-03-01  09:10:00 |  99.12
 2  |  Jenny  | Doe     |                      |
 3  |  Andy   | Jones   | 2018-03-02  08:02:15 |  13.5
 (4 rows)
```

### Extension 3: Accessing an array of scalars directly using an alias
<a name="nested-data-sqlextension3"></a>

When an alias `p` in a `FROM` clause ranges over an array of scalars, the query refers to the values of `p` as `p`. For example, the following query produces pairs of customer names and phone numbers.

```
SELECT c.name.given, c.name.family, p AS phone
FROM   spectrum.customers c LEFT JOIN c.phones p ON true
```

The preceding query returns the following data.

```
given  |  family  |  phone
-------|----------|-----------
John   | Smith    | 123-4577891
Jenny  | Doe      | 858-8675309
Jenny  | Doe      | 415-9876543
Andy   | Jones    | 
(4 rows)
```

### Extension 4: Accessing elements of maps
<a name="nested-data-sqlextension4"></a>

Redshift Spectrum treats the `map` data type as an `array` type that contains `struct` types with a `key` column and a `value` column. The `key` must be a `scalar`; the value can be any data type. 

For example, the following code creates an external table with a `map` for storing phone numbers.

```
CREATE EXTERNAL TABLE spectrum.customers2 (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones map<varchar(20), varchar(20)>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

Because a `map` type behaves like an `array` type with columns `key` and `value`, you can think of the preceding schemas as if they were the following.

```
CREATE EXTERNAL TABLE spectrum.customers3 (
  id     int,
  name   struct<given:varchar(20), family:varchar(20)>,
  phones array<struct<key:varchar(20), value:varchar(20)>>,
  orders array<struct<shipdate:timestamp, price:double precision>>
)
STORED AS PARQUET
LOCATION 's3://redshift-downloads/tickit/spectrum/customers/';
```

The following query returns the names of customers with a mobile phone number and returns the number for each name. The map query is treated as the equivalent of querying a nested `array` of `struct` types. The following query only returns data if you have created the external table as described previously. 

```
SELECT c.name.given, c.name.family, p.value 
FROM   spectrum.customers c, c.phones p 
WHERE  p.key = 'mobile';
```

**Note**  
The `key` for a `map` is a `string` for Ion and JSON file types.

# Nested data use cases
<a name="nested-data-use-cases"></a>

This topic describes use cases for nested data. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

You can combine the extensions described previously with the usual SQL features. The following use cases illustrate some common combinations. These examples help demonstrate how you can use nested data. They aren't part of the tutorial.

**Topics**
+ [Ingesting nested data](#ingesting-nested-data)
+ [Aggregating nested data with subqueries](#aggregating-with-subquery)
+ [Joining Amazon Redshift and nested data](#joining-redshift-data)

## Ingesting nested data
<a name="ingesting-nested-data"></a>

You can use a `CREATE TABLE AS` statement to ingest data from an external table that contains complex data types. The following query extracts all customers and their phone numbers from the external table, using `LEFT JOIN`, and stores them in the Amazon Redshift table `CustomerPhones`. 

```
CREATE TABLE CustomerPhones AS
SELECT  c.name.given, c.name.family, p AS phone
FROM    spectrum.customers c LEFT JOIN c.phones p ON true;
```

## Aggregating nested data with subqueries
<a name="aggregating-with-subquery"></a>

You can use a subquery to aggregate nested data. The following example illustrates this approach. 

```
SELECT c.name.given, c.name.family, (SELECT COUNT(*) FROM c.orders o) AS ordercount 
FROM   spectrum.customers c;
```

The following data is returned.

```
given   |  family  |  ordercount
--------|----------|--------------
 Jenny  |  Doe     |       0
 John   |  Smith   |       2
 Andy   |  Jones   |       1
 (3 rows)
```

**Note**  
When you aggregate nested data by grouping by the parent row, the most efficient way is the one shown in the previous example. In that example, the nested rows of `c.orders` are grouped by their parent row `c`. Alternatively, if you know that `id` is unique for each `customer` and `o.shipdate` is never null, you can aggregate as shown in the following example. However, this approach generally isn't as efficient as the previous example. 

```
SELECT    c.name.given, c.name.family, COUNT(o.shipdate) AS ordercount 
FROM      spectrum.customers c LEFT JOIN c.orders o ON true 
GROUP BY  c.id, c.name.given, c.name.family;
```

You can also write the query by using a subquery in the `FROM` clause that refers to an alias (`c`) of the ancestor query and extracts array data. The following example demonstrates this approach.

```
SELECT c.name.given, c.name.family, s.count AS ordercount
FROM   spectrum.customers c, (SELECT count(*) AS count FROM c.orders o) s;
```

## Joining Amazon Redshift and nested data
<a name="joining-redshift-data"></a>

You can also join Amazon Redshift data with nested data in an external table. For example, suppose that you have the following nested data in Amazon S3. 

```
CREATE EXTERNAL TABLE spectrum.customers2 (
  id      int,
  name    struct<given:varchar(20), family:varchar(20)>,
  phones  array<varchar(20)>,
  orders  array<struct<shipdate:timestamp, item:int>>
);
```

Suppose also that you have the following table in Amazon Redshift.

```
CREATE TABLE prices (
  id int,
  price double precision
);
```

The following query finds the total number and amount of each customer's purchases based on the preceding. The following example is only an illustration. It only returns data if you have created the tables described previously. 

```
SELECT   c.name.given, c.name.family, COUNT(o.date) AS ordercount, SUM(p.price) AS ordersum 
FROM     spectrum.customers2 c, c.orders o, prices p ON o.item = p.id  
GROUP BY c.id, c.name.given, c.name.family;
```

# Nested data limitations (preview)
<a name="nested-data-restrictions"></a>

This topic describes limitations for reading nested data with Redshift Spectrum. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

**Note**  
The limitations marked (preview) in the following list only apply to preview clusters created in the following Regions.  
US East (Ohio) (us-east-2)
US East (N. Virginia) (us-east-1)
US West (N. California) (us-west-1)
Asia Pacific (Tokyo) (ap-northeast-1)
Europe (Ireland) (eu-west-1)
Europe (Stockholm) (eu-north-1)
For information about setting up Preview clusters, see [Creating a preview cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#cluster-preview) in the *Amazon Redshift Management Guide*. 

The following limitations apply to nested data:
+ An `array` or `map` type can contain other `array` or `map` types as long as queries on the nested `arrays` or `maps` don't return `scalar` values. (preview) 
+ Amazon Redshift Spectrum supports complex data types only as external tables.
+  Subquery result columns must be top-level. (preview)
+ If an `OUTER JOIN` expression refers to a nested table, it can refer only to that table and its nested arrays (and maps). If an `OUTER JOIN` expression doesn't refer to a nested table, it can refer to any number of non-nested tables.
+ If a `FROM` clause in a subquery refers to a nested table, it can't refer to any other table.
+ If a subquery depends on a nested table that refers to a parent table, the subquery can only use the parent table in the `FROM` clause. You can't use the parent in any other clauses, such as a `SELECT` or `WHERE` clause. For example, the following query doesn't run because the subquery's `SELECT` clause refers to the parent table `c`. 

  ```
  SELECT c.name.given 
  FROM   spectrum.customers c 
  WHERE (SELECT COUNT(c.id) FROM c.phones p WHERE p LIKE '858%') > 1;
  ```

  The following query works because the parent `c` is used only in the `FROM` clause of the subquery.

  ```
  SELECT c.name.given 
  FROM   spectrum.customers c 
  WHERE (SELECT COUNT(*) FROM c.phones p WHERE p LIKE '858%') > 1;
  ```
+ A subquery that accesses nested data anywhere other than the `FROM` clause must return a single value. The only exceptions are `(NOT) EXISTS` operators in a `WHERE` clause.
+ `(NOT) IN` is not supported.
+ The maximum nesting depth for all nested types is 100. This restriction applies to all file formats (Parquet, ORC, Ion, and JSON).
+ Aggregation subqueries that access nested data can only refer to `arrays` and `maps` in their `FROM` clause, not to an external table. 
+ Querying the pseudocolumns of nested data in a Redshift Spectrum table is not supported. For more information, see [Pseudocolumns](c-spectrum-external-tables.md#c-spectrum-external-tables-pseudocolumns). 
+ When extracting data from array or map columns by specifying them in a `FROM` clause, you can only select values from those columns if the values are `scalar`. For example, the following queries both try to `SELECT` elements from inside an array. The query that selects `arr.a` works because `arr.a` is a `scalar` value. The second query doesn't work because `array` is an array extracted from `s3.nested table` in the `FROM` clause. (preview)

  ```
  SELECT array_column FROM s3.nested_table;
  
  array_column
  -----------------
  [{"a":1},{"b":2}]
                          
  SELECT arr.a FROM s3.nested_table t, t.array_column arr;
  
  arr.a
  -----
  1
  
  --This query fails to run.
  SELECT array FROM s3.nested_table tab, tab.array_column array;
  ```

  You can’t use an array or map in the `FROM` clause that itself comes from another array or map. To select arrays or other complex structures that are nested inside other arrays, consider using indexes in the `SELECT` statement.

# Serializing complex nested JSON
<a name="serializing-complex-JSON"></a>

This topic demonstrates how to serialize nested data in JSON format. Nested data is data that contains nested fields. Nested fields are fields that are joined together as a single entity, such as arrays, structs, or objects. 

An alternate to methods demonstrated in this tutorial is to query top-level nested collection columns as serialized JSON. You can use the serialization to inspect, convert, and ingest nested data as JSON with Redshift Spectrum. This method is supported for ORC, JSON, Ion, and Parquet formats. Use the session configuration parameter `json_serialization_enable` to configure the serialization behavior. When set, complex JSON data types are serialized to VARCHAR(65535). The nested JSON can be accessed with [JSON functions](json-functions.md). For more information, see [json\$1serialization\$1enable](r_json_serialization_enable.md).

For example, without setting `json_serialization_enable`, the following queries that access nested columns directly fail. 

```
SELECT * FROM spectrum.customers LIMIT 1;

=> ERROR:  Nested tables do not support '*' in the SELECT clause.

SELECT name FROM spectrum.customers LIMIT 1;

=> ERROR:  column "name" does not exist in customers
```

Setting `json_serialization_enable` enables querying top-level collections directly. 

```
SET json_serialization_enable TO true;

SELECT * FROM spectrum.customers order by id LIMIT 1;

id | name                                 | phones         | orders
---+--------------------------------------+----------------+----------------------------------------------------------------------------------------------------------------------
1  | {"given": "John", "family": "Smith"} | ["123-457789"] | [{"shipdate": "2018-03-01T11:59:59.000Z", "price": 100.50}, {"shipdate": "2018-03-01T09:10:00.000Z", "price": 99.12}]          
 
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": "John", "family": "Smith"}
```

Consider the following items when serializing nested JSON.
+ When collection columns are serialized as VARCHAR(65535), their nested subfields can't be accessed directly as part of the query syntax (for example, in the filter clause). However, JSON functions can be used to access nested JSON. 
+ The following specialized representations are not supported: 
  + ORC unions
  + ORC maps with complex type keys
  + Ion datagrams
  + Ion SEXP
+ Timestamps are returned as ISO serialized strings.
+ Primitive map keys are promoted to string (for example, `1` to `"1"`).
+ Top-level null values are serialized as NULLs.
+ If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.

## Serializing complex types containing JSON strings
<a name="serializing-complex-JSON-strings"></a>

By default, string values contained in nested collections are serialized as escaped JSON strings. Escaping might be undesirable when the strings are valid JSON. Instead you might want to write nested subelements or fields that are VARCHAR directly as JSON. Enable this behavior with the `json_serialization_parse_nested_strings` session-level configuration. When both `json_serialization_enable` and `json_serialization_parse_nested_strings` are set, valid JSON values are serialized inline without escape characters. When the value is not valid JSON, it is escaped as if the `json_serialization_parse_nested_strings` configuration value was not set. For more information, see [json\$1serialization\$1parse\$1nested\$1strings](r_json_serialization_parse_nested_strings.md).

For example, assume the data from the previous example contained JSON as a `structs` complex type in the `name` VARCHAR(20) field: 

```
name
---------
{"given": "{\"first\":\"John\",\"middle\":\"James\"}", "family": "Smith"}
```

When `json_serialization_parse_nested_strings` is set, the `name` column is serialized as follows: 

```
SET json_serialization_enable TO true;
SET json_serialization_parse_nested_strings TO true;
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": {"first":"John","middle":"James"}, "family": "Smith"}
```

Instead of being escaped like this:

```
SET json_serialization_enable TO true;
SELECT name FROM spectrum.customers order by id LIMIT 1;

name
---------
{"given": "{\"first\":\"John\",\"middle\":\"James\"}", "family": "Smith"}
```