Data conversion when exporting to an Amazon S3 bucket for Amazon RDS
When you export a DB snapshot to an Amazon S3 bucket, Amazon RDS converts data to,
exports data in, and stores data in the Parquet format. For more information about
Parquet, see the Apache
Parquet
Parquet stores all data as one of the following primitive types:
-
BOOLEAN
-
INT32
-
INT64
-
INT96
-
FLOAT
-
DOUBLE
-
BYTE_ARRAY – A variable-length byte array, also known as binary
-
FIXED_LEN_BYTE_ARRAY – A fixed-length byte array used when the values have a constant size
The Parquet data types are few to reduce the complexity of reading and writing the format.
Parquet provides logical types for extending primitive types. A logical type is implemented as an annotation with the data in a
LogicalType
metadata field. The logical type annotation explains how to
interpret the primitive type.
When the STRING
logical type annotates a BYTE_ARRAY
type, it
indicates that the byte array should be interpreted as a UTF-8 encoded character string.
After an export task completes, Amazon RDS notifies you if any string conversion occurred.
The underlying data exported is always the same as the data from the source. However,
due to the encoding difference in UTF-8, some characters might appear different from the
source when read in tools such as Athena.
For more information, see Parquet
logical type definitions
MySQL and MariaDB data type mapping to Parquet
The following table shows the mapping from MySQL and MariaDB data types to Parquet data types when data is converted and exported to Amazon S3.
Source data type | Parquet primitive type | Logical type annotation | Conversion notes |
---|---|---|---|
Numeric data types | |||
BIGINT | INT64 | ||
BIGINT UNSIGNED | FIXED_LEN_BYTE_ARRAY(9) | DECIMAL(20,0) | Parquet supports only signed types, so the mapping requires an additional byte (8 plus 1) to store the BIGINT_UNSIGNED type. |
BIT | BYTE_ARRAY | ||
DECIMAL | INT32 | DECIMAL(p,s) | If the source value is less than 231, it's stored as INT32. |
INT64 | DECIMAL(p,s) | If the source value is 231 or greater, but less than 263, it's stored as INT64. | |
FIXED_LEN_BYTE_ARRAY(N) | DECIMAL(p,s) | If the source value is 263 or greater, it's stored as FIXED_LEN_BYTE_ARRAY(N). | |
BYTE_ARRAY | STRING | Parquet doesn't support Decimal precision greater than 38. The Decimal value is converted to a string in a BYTE_ARRAY type and encoded as UTF8. | |
DOUBLE | DOUBLE | ||
FLOAT | DOUBLE | ||
INT | INT32 | ||
INT UNSIGNED | INT64 | ||
MEDIUMINT | INT32 | ||
MEDIUMINT UNSIGNED | INT64 | ||
NUMERIC | INT32 | DECIMAL(p,s) |
If the source value is less than 231, it's stored as INT32. |
INT64 | DECIMAL(p,s) | If the source value is 231 or greater, but less than 263, it's stored as INT64. | |
FIXED_LEN_ARRAY(N) | DECIMAL(p,s) | If the source value is 263 or greater, it's stored as FIXED_LEN_BYTE_ARRAY(N). | |
BYTE_ARRAY | STRING | Parquet doesn't support Numeric precision greater than 38. This Numeric value is converted to a string in a BYTE_ARRAY type and encoded as UTF8. | |
SMALLINT | INT32 | ||
SMALLINT UNSIGNED | INT32 | ||
TINYINT | INT32 | ||
TINYINT UNSIGNED | INT32 | ||
String data types | |||
BINARY | BYTE_ARRAY | ||
BLOB | BYTE_ARRAY | ||
CHAR | BYTE_ARRAY | ||
ENUM | BYTE_ARRAY | STRING | |
LINESTRING | BYTE_ARRAY | ||
LONGBLOB | BYTE_ARRAY | ||
LONGTEXT | BYTE_ARRAY | STRING | |
MEDIUMBLOB | BYTE_ARRAY | ||
MEDIUMTEXT | BYTE_ARRAY | STRING | |
MULTILINESTRING | BYTE_ARRAY | ||
SET | BYTE_ARRAY | STRING | |
TEXT | BYTE_ARRAY | STRING | |
TINYBLOB | BYTE_ARRAY | ||
TINYTEXT | BYTE_ARRAY | STRING | |
VARBINARY | BYTE_ARRAY | ||
VARCHAR | BYTE_ARRAY | STRING | |
Date and time data types | |||
DATE | BYTE_ARRAY | STRING | A date is converted to a string in a BYTE_ARRAY type and encoded as UTF8. |
DATETIME | INT64 | TIMESTAMP_MICROS | |
TIME | BYTE_ARRAY | STRING | A TIME type is converted to a string in a BYTE_ARRAY and encoded as UTF8. |
TIMESTAMP | INT64 | TIMESTAMP_MICROS | |
YEAR | INT32 | ||
Geometric data types | |||
GEOMETRY | BYTE_ARRAY | ||
GEOMETRYCOLLECTION | BYTE_ARRAY | ||
MULTIPOINT | BYTE_ARRAY | ||
MULTIPOLYGON | BYTE_ARRAY | ||
POINT | BYTE_ARRAY | ||
POLYGON | BYTE_ARRAY | ||
JSON data type | |||
JSON | BYTE_ARRAY | STRING |
PostgreSQL data type mapping to Parquet
The following table shows the mapping from PostgreSQL data types to Parquet data types when data is converted and exported to Amazon S3.
PostgreSQL data type | Parquet primitive type | Logical type annotation | Mapping notes |
---|---|---|---|
Numeric data types | |||
BIGINT | INT64 | ||
BIGSERIAL | INT64 | ||
DECIMAL | BYTE_ARRAY | STRING | A DECIMAL type is converted to a string in a BYTE_ARRAY type and
encoded as UTF8. This conversion is to avoid complications due to data precision and data values that are not a number (NaN). |
DOUBLE PRECISION | DOUBLE | ||
INTEGER | INT32 | ||
MONEY | BYTE_ARRAY | STRING | |
REAL | FLOAT | ||
SERIAL | INT32 | ||
SMALLINT | INT32 | INT_16 | |
SMALLSERIAL | INT32 | INT_16 | |
String and related data types | |||
ARRAY | BYTE_ARRAY | STRING |
An array is converted to a string and encoded as BINARY (UTF8). This conversion is to avoid complications due to data precision, data values that are not a number (NaN), and time data values. |
BIT | BYTE_ARRAY | STRING | |
BIT VARYING | BYTE_ARRAY | STRING | |
BYTEA | BINARY | ||
CHAR | BYTE_ARRAY | STRING | |
CHAR(N) | BYTE_ARRAY | STRING | |
ENUM | BYTE_ARRAY | STRING | |
NAME | BYTE_ARRAY | STRING | |
TEXT | BYTE_ARRAY | STRING | |
TEXT SEARCH | BYTE_ARRAY | STRING | |
VARCHAR(N) | BYTE_ARRAY | STRING | |
XML | BYTE_ARRAY | STRING | |
Date and time data types | |||
DATE | BYTE_ARRAY | STRING | |
INTERVAL | BYTE_ARRAY | STRING | |
TIME | BYTE_ARRAY | STRING | |
TIME WITH TIME ZONE | BYTE_ARRAY | STRING | |
TIMESTAMP | BYTE_ARRAY | STRING | |
TIMESTAMP WITH TIME ZONE | BYTE_ARRAY | STRING | |
Geometric data types | |||
BOX | BYTE_ARRAY | STRING | |
CIRCLE | BYTE_ARRAY | STRING | |
LINE | BYTE_ARRAY | STRING | |
LINESEGMENT | BYTE_ARRAY | STRING | |
PATH | BYTE_ARRAY | STRING | |
POINT | BYTE_ARRAY | STRING | |
POLYGON | BYTE_ARRAY | STRING | |
JSON data types | |||
JSON | BYTE_ARRAY | STRING | |
JSONB | BYTE_ARRAY | STRING | |
Other data types | |||
BOOLEAN | BOOLEAN | ||
CIDR | BYTE_ARRAY | STRING | Network data type |
COMPOSITE | BYTE_ARRAY | STRING | |
DOMAIN | BYTE_ARRAY | STRING | |
INET | BYTE_ARRAY | STRING | Network data type |
MACADDR | BYTE_ARRAY | STRING | |
OBJECT IDENTIFIER | N/A | ||
PG_LSN | BYTE_ARRAY | STRING | |
RANGE | BYTE_ARRAY | STRING | |
UUID | BYTE_ARRAY | STRING |