Using Parquet data
Neptune Analytics supports importing data using the Parquet format. A Parquet file must contain a header row and the column values. The remainder of the files are interpreted based on the corresponding header column. The header should contain predefined system column names and/or user-defined column names. Aside from the header row and column values, a Parquet file also has metadata which is stored in-line with the Parquet file, and is used in the reading and decoding of said data.
Note
Compression for Parquet format is not supported at this time.
System column headers
The required and allowed system column headers are different for vertex files and edge files. Each system column can appear only once in a header. All labels are case sensitive.
Note
The ~id
(:ID
) column in edge
(relationship
) files in
Parquet
format are not supported. They are ignored if provided in any of the edge
(relationship
) files.
Vertex headers
-
~id
- Required. Anid
for the vertex. -
~label
- Optional. A label for the vertex, multiple label values are supported, separated by semicolons (;
).
Edge headers
-
~from
- Required. The vertexid
of the from vertex. -
~to
- Required. The vertexid
of the to vertex. -
~label
- Optional. A label for the edge. Edges can only have a single label.
Property column headers
Unlike the property column headers of the CSV format, the property column headers of the Parquet format only need to have the property names, there is no need to have the type names nor the cardinality.
There are however, some special column types in the Parquet format that requires annotation in the metadata, including
Any
type, Date
type, and dateTime
type. For more details of Any
type,
Date
type, and dateTime
type, please refer to using CSV data.
The following object is an example of the metadata that has Any
type column, Date
type column and
dateTime
type column annotated:
"metadata": { "anyTypeColumns": ["UserCol1"], "dateTypeColumns": ["UserCol2"], "dateTimeTypeColumns": ["UserCol3"] }
Note
Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters.