Creating a single schema for each Amazon S3 include path
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.
To help illustrate this option, suppose that you define a crawler with an include path
s3://bucket/table1/
. When the crawler runs, it finds two JSON files
with the following characteristics:
-
File 1 –
S3://bucket/table1/year=2017/data1.json
-
File content –
{“A”: 1, “B”: 2}
-
Schema –
A:int, B:int
-
File 2 –
S3://bucket/table1/year=2018/data2.json
-
File content –
{“C”: 3, “D”: 4}
-
Schema –
C: int, D: int
By default, the crawler creates two tables, named year_2017
and year_2018
because the schemas are not sufficiently similar.
However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table.
The table has the schema A:int,B:int,C:int,D:int
and partitionKey
year:string
.