Defining and managing classifiers
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
AWS Glue invokes custom classifiers first, in the order that you specify in your crawler
definition. Depending on the results that are returned from custom classifiers, AWS Glue might also
invoke built-in classifiers. If a classifier returns certainty=1.0
during
processing, it indicates that it's 100 percent certain that it can create the correct schema.
AWS Glue then uses the output of that classifier.
If no classifier returns certainty=1.0
, AWS Glue uses the output of the classifier
that has the highest certainty. If no classifier returns a certainty greater than
0.0
, AWS Glue returns the default classification string of
UNKNOWN
.
When do I use a classifier?
You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can't recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data.
For more information about creating a classifier using the AWS Glue console, see Creating classifiers using the AWS Glue console.
Custom classifiers
The output of a classifier includes a string that indicates the file's classification or
format (for example, json
) and the schema of the file. For custom classifiers,
you define the logic for creating the schema based on the type of classifier. Classifier types
include defining schemas based on grok patterns, XML tags, and JSON paths.
If you change a classifier definition, any data that was previously crawled using the classifier is not reclassified. A crawler keeps track of previously crawled data. New data is classified with the updated classifier, which might result in an updated schema. If the schema of your data has evolved, update the classifier to account for any schema changes when your crawler runs. To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier.
For more information about creating custom classifiers in AWS Glue, see Writing custom classifiers for diverse data formats.
Note
If your data format is recognized by one of the built-in classifiers, you don't need to create a custom classifier.
Built-in classifiers
AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.
If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent
certainty, it invokes the built-in classifiers in the order shown in the following table. The
built-in classifiers return a result to indicate whether the format matches
(certainty=1.0
) or does not match (certainty=0.0
). The first
classifier that has certainty=1.0
provides the classification string and schema
for a metadata table in your Data Catalog.
Classifier type | Classification string | Notes |
---|---|---|
Apache Avro | avro |
Reads the schema at the beginning of the file to determine format. |
Apache ORC | orc |
Reads the file metadata to determine format. |
Apache Parquet | parquet |
Reads the schema at the end of the file to determine format. |
JSON | json |
Reads the beginning of the file to determine format. |
Binary JSON | bson |
Reads the beginning of the file to determine format. |
XML | xml |
Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document.
For information about creating a custom XML classifier to specify rows in the document, see Writing XML custom classifiers. |
Amazon Ion | ion |
Reads the beginning of the file to determine format. |
Combined Apache log | combined_apache |
Determines log formats through a grok pattern. |
Apache log | apache |
Determines log formats through a grok pattern. |
Linux kernel log | linux_kernel |
Determines log formats through a grok pattern. |
Microsoft log | microsoft_log |
Determines log formats through a grok pattern. |
Ruby log | ruby_logger |
Reads the beginning of the file to determine format. |
Squid 3.x log | squid |
Reads the beginning of the file to determine format. |
Redis monitor log | redismonlog |
Reads the beginning of the file to determine format. |
Redis log | redislog |
Reads the beginning of the file to determine format. |
CSV | csv |
Checks for the following delimiters: comma (,), pipe (|), tab (\t), semicolon
(;), and Ctrl-A (\u0001). Ctrl-A is the Unicode control character for Start Of
Heading . |
Amazon Redshift | redshift |
Uses JDBC connection to import metadata. |
MySQL | mysql |
Uses JDBC connection to import metadata. |
PostgreSQL | postgresql |
Uses JDBC connection to import metadata. |
Oracle database | oracle |
Uses JDBC connection to import metadata. |
Microsoft SQL Server | sqlserver |
Uses JDBC connection to import metadata. |
Amazon DynamoDB | dynamodb |
Reads data from the DynamoDB table. |
Files in the following compressed formats can be classified:
-
ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
-
BZIP
-
GZIP
-
LZ4
-
Snappy (supported for both standard and Hadoop native Snappy formats)
Built-in CSV classifier
The built-in CSV classifier parses CSV file contents to determine the schema for an AWS Glue table. This classifier checks for the following delimiters:
Comma (,)
Pipe (|)
Tab (\t)
Semicolon (;)
Ctrl-A (\u0001)
Ctrl-A is the Unicode control character for
Start Of Heading
.
To be classified as CSV, the table schema must have at least two columns and two rows
of data. The CSV classifier uses a number of heuristics to determine whether a header is
present in a given file. If the classifier can't determine a header from the first row of
data, column headers are displayed as col1
, col2
,
col3
, and so on. The built-in CSV classifier determines whether to infer a
header by evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue
regex
requirements for a column name.The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.
Note
If the built-in CSV classifier does not create your AWS Glue table as you want, you might be able to use one of the following alternatives:
Change the column names in the Data Catalog, set the
SchemaChangePolicy
to LOG, and set the partition output configuration toInheritFromTable
for future crawler runs.Create a custom grok classifier to parse the data and assign the columns that you want.
The built-in CSV classifier creates tables referencing the
LazySimpleSerDe
as the serialization library, which is a good choice for type inference. However, if the CSV data contains quoted strings, edit the table definition and change the SerDe library toOpenCSVSerDe
. Adjust any inferred types to STRING, set theSchemaChangePolicy
to LOG, and set the partitions output configuration toInheritFromTable
for future crawler runs. For more information about SerDe libraries, see SerDe Reference in the Amazon Athena User Guide.