Regex SerDe
The Regex SerDe uses a regular expression (regex) to deserialize data by extracting regex groups into table columns.
If a row in the data does not match the regex, then all columns in the row are returned as
NULL
. If a row matches the regex but has fewer groups than expected, the
missing groups are NULL
. If a row in the data matches the regex but has more
columns than groups in the regex, the additional columns are ignored.
For more information, see Class RegexSerDe
Serialization library name
The serialization library name for the Regex SerDe is
org.apache.hadoop.hive.serde2.RegexSerDe
. For source code information,
see Class RegexSerDe
Example
The following example creates a table from CloudFront logs using the RegExSerDe. Replace
myregion
in
s3://athena-examples-
with the region identifier where you run Athena (for example,
myregion
/cloudfront/plaintext/s3://athena-examples-us-west-1/cloudfront/plaintext/
).
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://athena-examples-
myregion
/cloudfront/plaintext/';