Amazon Ion Hive SerDe - Amazon Athena

Amazon Ion Hive SerDe

You can use the Amazon Ion Hive SerDe to query data stored in Amazon Ion format. Amazon Ion is a richly-typed, self-describing, open source data format. The Amazon Ion format is used by services such as Amazon Quantum Ledger Database (Amazon QLDB) and in the open source SQL query language PartiQL.

Amazon Ion has binary and text formats that are interchangeable. This feature combines the ease of use of text with the efficiency of binary encoding.

To query Amazon Ion data from Athena, you can use the Amazon Ion Hive SerDe, which serializes and deserializes Amazon Ion data. Deserialization allows you to run queries on the Amazon Ion data or read it for writing out into a different format like Parquet or ORC. Serialization lets you generate data in the Amazon Ion format by using CREATE TABLE AS SELECT (CTAS) or INSERT INTO queries to copy data from existing tables.

Note

Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets. Unlike other JSON SerDe libraries, the Amazon Ion SerDe does not expect each row of data to be on a single line. This feature is useful if you want to query JSON datasets that are in "pretty print" format or otherwise break up the fields in a row with newline characters.

For additional information and examples of querying Amazon Ion with Athena, see Analyze Amazon Ion datasets using Amazon Athena.

Serialization library name

The serialization library name for the Amazon Ion SerDe is com.amazon.ionhiveserde.IonHiveSerDe. For source code information, see Amazon Ion Hive SerDe on GitHub.com.

Considerations and limitations

  • Duplicated fields – Amazon Ion structs are ordered and support duplicated fields, while Hive's STRUCT<> and MAP<> do not. Thus, when you deserialize a duplicated field from an Amazon Ion struct, a single value is chosen non deterministically, and the others are ignored.

  • External symbol tables unsupported – Currently, Athena does not support external symbol tables or the following Amazon Ion Hive SerDe properties:

    • ion.catalog.class

    • ion.catalog.file

    • ion.catalog.url

    • ion.symbol_table_imports

  • File extensions – Amazon Ion uses file extensions to determine which compression codec to use for deserializing Amazon Ion files. As such, compressed files must have the file extension that corresponds to the compression algorithm used. For example, if ZSTD is used, corresponding files should have the extension .zst.

  • Homogeneous data – Amazon Ion has no restrictions on the data types that can be used for values in particular fields. For example, two different Amazon Ion documents might have a field with the same name that have different data types. However, because Hive uses a schema, all values that you extract to a single Hive column must have the same data type.

  • Map key type restrictions – When you serialize data from another format into Amazon Ion, ensure that the map key type is one of STRING, VARCHAR, or CHAR. Although Hive allows you to use any primitive data type as a map key, Amazon Ion symbols must be a string type.

  • Union type – Athena does not currently support the Hive union type.

  • Double data type – Amazon Ion does not currently support the double data type.