# Log Parsing Functions Amazon Kinesis Data Analytics features the following functions for log parsing: + [FAST\$1REGEX\$1LOG\$1PARSER](sql-reference-fast-regex-log-parser.md) works similarly to the regex parser, but takes several "shortcuts" to ensure faster results. For example, the fast regex parser stops at the first match it finds (known as "lazy" semantics.) + [FIXED\$1COLUMN\$1LOG\$1PARSE](sql-reference-fixed-column-log-parse.md) parses fixed-width fields and automatically converts them to the given SQL types. + [REGEX\$1LOG\$1PARSE](sql-reference-regex-log-parse.md) uses the default Java regular expression parser. For more information about this parser, see [Pattern](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) in the Java Platform documentation on the Oracle website. + [SYS\$1LOG\$1PARSE](sql-reference-sys-log-parse.md) processes entries commonly found in UNIX/Linux system logs. + [VARIABLE\$1COLUMN\$1LOG\$1PARSE](sql-reference-variable-column-log-parse.md) splits an input string (its first argument, ) into fields separated by a delimiter character or delimiter string. + [W3C\$1LOG\$1PARSE](sql-reference-w3c-log-parse.md) processes entries in W3C-predefined-format logs. # FAST\$1REGEX\$1LOG\$1PARSER ``` FAST_REGEX_LOG_PARSE('input_string', 'fast_regex_pattern') ``` The FAST\$1REGEX\$1LOG\$1PARSE works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step.) The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024). See sample usage below at First FRLP Example and at Further FRLP Examples. ## FAST\$1REGEX\$1LOG\$1PARSER (FRLP) FAST\$1REGEX\$1LOG\$1PARSER uses a lazy search - it stops at the first match. By contrast, the [REGEX\$1LOG\$1PARSE](sql-reference-regex-log-parse.md) is greedy unless possessive quantifiers are used. FAST\$1REGEX\$1LOG\$1PARSE scans the supplied input string for all the characters specified by the Fast Regex pattern. + All characters in that input string must be accounted for by the characters and scan groups defined in the Fast Regex pattern. Scan groups define the fields-or-columns resulting when a scan is successful. + If all characters in the input\$1string are accounted for when the Fast Regex pattern is applied, then FRLP creates an output field (column) from each parenthetical expression in that Fast Regex pattern, in left-to-right order. The first (leftmost) parenthetical expression creates the first output field, the next (second) parenthetical expression creates the second output field, up through the last parenthetical expression creating the last output field. + If the input\$1string contains any characters not accounted for (matched) by applying Fast Regex pattern, then FRLP returns no fields at all. ## Character Class Symbols for Fast Regex Fast Regex uses a different set of character class symbols from the regular regex parser: | Symbol or Construct | Meaning | | --- | --- | | - | Character range, including endpoints | | [ charclasses ] | Character class | | [^ charclasses ] | Negated character class | | \$1 | Union | | & | Intersection | | ? | Zero or one occurrence | | \$1 | Zero or more occurrences | | \$1 | One or more occurrences | | \$1n\$1 | n occurrences | | \$1n,\$1 | n or more occurrences | | \$1n,m\$1 | n to m occurrences, including both | | . | Any single character | | \$1 | The empty language | | @ | Any string | | "" | A string) | | ( ) | The empty string) | | ( unionexp ) | Precedence override | | < > | Named pattern | | | Numerical interval | | charexp:= | A single non-reserved character | | \$1 | A single character) | We support the following POSIX standard identifiers as named patterns: - "[0-9]" - "[A-Z]" - "[a-z]" - "[\$1u0000-\$1u007F]" - "\$1" - "\$1" - "[\$1\$1"\$1\$1%&'()\$1\$1,-./:;<=>?@[\$1\$1\$1]^\$1`\$1\$1\$1\$1]" - "[ \$1t]" - "[ \$1t\$1n\$1f\$1r\$1u000B]" - "[\$1u0000-\$1u001F\$1u007F]" - "0-9a-fA-F" - "\$1" - "" First FRLP Example This first example uses the Fast Regex pattern '(.\$1)\$1(.\$1.\$1)\$1.\$1' ``` select t.r."COLUMN1", t.r."COLUMN2" from . . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_(._.*)_.*'))) t(r); +------------------------+-----------------------+ | COLUMN1 | COLUMN2 | +------------------------+-----------------------+ | Mary_had | a_little_lamb | +------------------------+-----------------------+ 1 row selected ``` 1. The scan of input\$1string ('Mary\$1had\$1a\$1little\$1lamb') begins with the 1st group defined in Fast Regex pattern: (.\$1), which means "find any character 0 or more times." '**(.\$1)**\$1(.\$1.\$1)\$1.\$1' 1. This group specification, defining the first column to be parsed, asks the Fast Regex Log Parser to accept input string characters starting from the input string's first character until it finds the next group in the Fast Regex Pattern or the next literal character or string that is not inside a group (not in parentheses). In this example, the next literal character after the first group is an underscore: '(.\$1)**\$1**(.\$1.\$1)\$1.\$1' 1. The parser scans each character in the input string until it finds the next specification in the Fast Regex pattern: an underscore: '(.\$1)\$1**(.\$1.\$1)**\$1.\$1' 1. Group-2 thus begins with "a\$1l". Next, the parser needs to determine the end of this group, using the remaining specification in the pattern: '(.\$1)\$1(.\$1.\$1)**\$1.\$1**' **Note** Character-strings or literals specified in the pattern but not inside a group must be found in the input string but will not be included in any output field. If the Fast Regex pattern had omitted the final asterisk, no results would be obtained. ## Further FRLP Examples The next example uses a "\$1", which means repeat the last expression 1 or more times ("\$1" means 0 or more times). ### Example A In this case, the longest prefix is the first underscore. The first field/column group will match on "Mary" and the second will not match. ``` select t.r."COLUMN1", t.r."COLUMN2" from . . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_+(._.*)'))) t(r); +----------+----------+ | COLUMN1 | COLUMN2 | +----------+----------+ +----------+----------+ No rows selected ``` The preceding example returns no fields because the "\$1" required there be at least one more underscore-in-a-row; and the input\$1string does not have that. ### Example B In the following case, the '\$1' is superfluous because of the lazy semantics: ``` select t.r."COLUMN1", t.r."COLUMN2" from . . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(.*)'))) t(r); +-------------------------+-------------------------+ | COLUMN1 | COLUMN2 | +-------------------------+-------------------------+ | Mary | had_a_little_lamb | +-------------------------+-------------------------+ 1 row selected ``` The preceding example succeeds in returning two fields because after finding the multiple underscores required by the "\$1\$1" specification, the group-2 specification (.\$1) accepts all remaining characters in the .input\$1string. Underscores do not appear trailing "Mary" nor leading "had" because the "\$1\$1" specification is not enclosed in parentheses. As mentioned in the introduction, "lazy" in regular expression parsing parlance means don't parse more than you need to at each step; "Greedy" means parse as much as you can at each step. The first case in this topic, A, fails because when it gets to the first underscore, the regex processor has no way of knowing without backtracking that it can't use the underscore to match "\$1\$1", and FRLP doesn't backtrack, whereas [REGEX\$1LOG\$1PARSE](sql-reference-regex-log-parse.md) does. The search directly above, B, gets turned into three searches: ``` (.*)_ _*(._ .*) ``` Notice that the second field group gets split between the second and third searches, also that "\$1\$1" is considered the same as "\$1\$1\$1" (that is, it considers "underscore repeat-underscore-1-or-more-times" the same as "underscore underscore repeat-underscore-0-or-more-times".) Case A demonstrates the main difference between REGEX\$1LOG\$1PARSE and FAST\$1REGEX\$1LOG\$1PARSE, because the search in A would work under REGEX\$1LOG\$1PARSE because that function would use backtracking. ### Example C In the following example, the plus is not superfluous, because the " (any alphabetic char) is fixed length thus will be used as a delimiter for the " \$1" search. ``` select t.r."COLUMN1", t.r."COLUMN2" from . . . . . . . . . . . . .> (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(.*)'))) t(r); +----------------------------+----------------------------+ | COLUMN1 | COLUMN2 | +----------------------------+----------------------------+ | Mary | had_a_little_lamb | +----------------------------+----------------------------+ 1 row selected '(.*) +(.*)' gets converted into three regular expressions: '.* ' ' *' '.*$' ``` Each is matched in turn using lazy semantics. The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024). # FIXED\$1COLUMN\$1LOG\$1PARSE Parses fixed-width fields and automatically converts them to the given SQL types. ``` FIXED_COLUMN_LOG_PARSE ( , ) := ' [,...]' := TYPE [ NOT NULL ] START [FOR ] ``` Starting position of column is 0. Column specifications for types DATE,TIME and TIMESTAMP support a format parameter allowing the user to specify exact time component layout. The parser uses the Java class [java.lang.SimpleDateFormat](http://docs.oracle.com/javase/1.5.0/docs/api/java/text/SimpleDateFormat.html) to parse the strings for types DATE, TIME and TIMESTAMP. The [Date and Time Patterns](sql-reference-parse-timestamp-format.md) topic gives a full description and examples of timestamp format strings. The following is an example of a column definition with a format string: ``` "name" TYPE TIMESTAMP 'dd/MMM/yyyy:HH:mm:ss' ``` **Related Topics** [REGEX\$1LOG\$1PARSE](sql-reference-regex-log-parse.md) # REGEX\$1LOG\$1PARSE ``` REGEX_LOG_PARSE (,,) := [OBJECT] := [ ] {, }* ``` Parses a character string based on Java Regular Expression patterns as defined in [java.util.regex.pattern](http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html). Columns are based on match groups defined in the regex-pattern. Each group defines a column, and the groups are processed from left to right. Failure to match produces a NULL value result: If the regular expression does not match the string passed as the first parameter, NULL is returned. The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024). ## Examples ### Example Dataset The examples following are based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run each example, you need an Amazon Kinesis Analytics application that has the sample stock ticker input stream. To learn how to create an Analytics application and configure the sample stock ticker input stream, see the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. The sample stock dataset has the schema following. ``` (ticker_symbol VARCHAR(4), sector VARCHAR(16), change REAL, price REAL) ``` ### Example 1: Return results from two capture groups The following code example searches the contents of the `sector` field for a letter `E` and the character that follows it, and then searches for a letter R, and returns it and all characters following it: ``` CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (match1 VARCHAR(1024), match2 VARCHAR(1024)); CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM T.REC.COLUMN1, T.REC.COLUMN2 FROM (SELECT STREAM SECTOR, REGEX_LOG_PARSE(SECTOR, '.*([E].).*([R].*)') AS REC FROM SOURCE_SQL_STREAM_001) AS T; ``` The preceding code example produces results similar to the following: ![\[Table showing ROWTIME, MATCH1, and MATCH2 columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-regex-log-parse-example-1.png) ### Example 2: Return a stream field and results from two capture groups The following code example returns the `sector` field, and searches the contents of the `sector` field for a letter `E` and returns it and the character that follows it, and then searches for a letter R, and returns it and all characters following it: ``` CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (sector VARCHAR(24), match1 VARCHAR(24), match2 VARCHAR(24)); CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM T.SECTOR, T.REC.COLUMN1, T.REC.COLUMN2 FROM (SELECT STREAM SECTOR, REGEX_LOG_PARSE(SECTOR, '.*([E].).*([R].*)') AS REC FROM SOURCE_SQL_STREAM_001) AS T; ``` The preceding code example produces results similar to the following: ![\[Table showing data entries for ROWTIME, SECTOR, MATCH1, and MATCH2 columns with healthcare and energy sectors.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-regex-log-parse-example-2.png) ## For more information, see [FAST\$1REGEX\$1LOG\$1PARSER](sql-reference-fast-regex-log-parser.md). ## Quick Regex Reference For full details on Regex, see [java.util.regex.pattern](http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html) | | | | --- |--- | | [xyz] Find single character of: x, y or z [^abc] Find any single character except: x, y, or z [r-z] Find any single character between r-z [r-zR-Z] Find any single character between r-z or R-Z ^ Start of line \$1 End of line \$1A Start of string \$1z End of string . Any single character \$1s Find any whitespace character \$1S Find any non-whitespace character \$1d Find any digit \$1D Find any non-digit | \$1w Find any word character (letter, number, underscore) \$1W Find any non-word character \$1b Find any word boundary (...) Capture everything enclosed (x\$1y) Find x or y (also works with symbols such as \$1d or \$1s) x? Find zero or one of x (also works with symbols such as \$1d or \$1s) x\$1 Find zero or more of x (also works with symbols such as \$1d or \$1s) x\$1 Find one or more of x (also works with symbols such as \$1d or \$1s) x\$13\$1 Find exactly 3 of x (also works with symbols such as \$1d or \$1s) x\$13,\$1 Find 3 or more of x (also works with symbols such as \$1d or \$1s) x\$13,6\$1 Find between 3 and 6 of x (also works with symbols such as \$1d or \$1s) | # SYS\$1LOG\$1PARSE Parses the standard syslog format: ``` Mon DD HH:MM:SS server message ``` SYS\$1LOG\$1PARSE processes entries commonly found in UNIX/Linux system logs. System log entries start with a timestamp and are followed with a free form text field. SYS\$1LOG\$1PARSE output consists of two columns. The first column is named "COLUMN1" and is SQL data type TIMESTAMP. The second column is named "COLUMN2" and is SQL type VARCHAR(). **Note** For more information about SYSLOG, see [IETF RFC3164](https://tools.ietf.org/html/rfc3164). For more information about date-time patterns and matching, see [Date and Time Patterns](sql-reference-parse-timestamp-format.md). # VARIABLE\$1COLUMN\$1LOG\$1PARSE ``` VARIABLE_COLUMN_LOG_PARSE( , , [ , , ] ) := | := := '[, ...]' := TYPE [ NOT NULL ] :=

:= ' [ ]' ``` VARIABLE\$1COLUMN\$1LOG\$1PARSE splits an input string (its first argument, ) into fields separated by a delimiter character or delimiter string. Thus it handles comma-separated values or tab-separated values. It can be combined with [FIXED\$1COLUMN\$1LOG\$1PARSE](sql-reference-fixed-column-log-parse.md) to handle something like maillog, where some fields are fixed-length and others are variable-length. **Note** Parsing of binary files is not supported. The arguments and are optional. Specifying an allows the value of a field to contain an embedded delimiter. As a simple example, if the specified a comma, and the specified a backslash, then an input of "a,b' would be split into two fields "a" and "b", but an input of "a\$1,b" would result in a single field "a,b". Since Amazon Kinesis Data Analytics supports [Expressions and Literals](sql-reference-expressions.md), a tab can also be a delimiter, specified using a unicode escape, e.g., u&'\$10009', which is a string consisting only of a tab character. Specifying a is another way to hide an embedded delimiter. The should be a one or two character expression: the first is used as the character; the second, if present, is used as the character. If only one character is supplied, it is used as both to begin and to end quoted strings. When the input includes a quoted string, that is, a string enclosed in the characters specified as , then that string appears in one field, even if it contains a delimiter. Note that the and are single characters and can be different. The can be used to start and end the quoted string, or the can start the quoted string and the used to end that quoted string. When a list of columns is supplied as the second parameter , the column specifications () for types DATE, TIME, and TIMESTAMP support a format parameter allowing the user to specify exact time component layout. The parser uses the Java class [java.lang.SimpleDateFormat](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) to parse the strings for those types. [Date and Time Patterns](sql-reference-parse-timestamp-format.md) gives a full description of timestamp format strings, with examples. The following is an example of a column definition with a format string: ``` "name" TYPE TIMESTAMP 'dd/MMM/yyyy:HH:mm:ss' ``` By default, the output columns are named COLUMN1, COLUMN2, COLUMN3, etc., each of SQL data type VARCHAR(1024). # W3C\$1LOG\$1PARSE ``` W3C_LOG_PARSE( , ) := ' | ' := COMMON | COMMON WITH VHOST | NCSA EXTENDED | REFERER | AGENT | IIS := [an Apache log format specifier] ``` ## W3C Predefined Formats Specifying the following W3C-predefined-format names summarizes using the format specifiers indicated, as shown in the following statement: ``` select stream W3C_LOG_PARSE(message, 'COMMON') r from w3ccommon t; ``` | Format Name | W3C Name | Format Specifiers | | --- | --- | --- | | COMMON | Common Log Format (CLF) | %h %l %u %t "%r" %>s %b | | COMMON WITH VHOST | Common Log Format with Virtual Host | %v %h %l %u %t "%r" %>s %b | | NCSA EXTENDED | NCSA extended/combined log format | %h %l %u %t "%r" %>s %b "%[Referer]i" "%[User-agent]i" | | REFERER | Referer log format | %[Referer]i ---> %U | | AGENT | Agent (Browser) log format | %[User-agent]i | ## W3C Format Specifiers The format specifiers are listed below. W3C\$1LOG\$1PARSE automatically detects these specifiers and output records with one column for each specifier. The column's type is automatically chosen based on the possible outputs of the specifier. For example, %b represents the number of bytes sent in processing an HTTP request, so the column type is numeric. For %B, however, zero bytes is represented by a dash - forcing the column type to be text. Note A explains what the "..." and "<" or ">" markings shown in the specifier table mean. The following table lists W3C format specifiers alphabetically by command. | Format Specifier | Explanation | | --- | --- | | % | The percent sign (Apache 2.0.44 and later) | | %...a | Remote IP-address | | %...A | Local IP-address | | %...B | Size of response in bytes, excluding HTTP headers. | | %...b | Size of response in bytes, excluding HTTP headers, in CLF format, which means that when no bytes are sent, uses a '-' rather than a 0. | | %...[Customerdata]C | The contents of cookie Customerdata in the request sent to the server. | | %...D | The time taken to serve the request, in microseconds. | | %...[CUSTOMERDATA]e | The contents of the environment variable CUSTOMERDATA | | %...f | Filename | | %...h | Remote host | | %...H | The request protocol | | %...[Customerdata]i | The contents of Customerdata: header line(s) in the request sent to the server. | | %...l | Remote logname (from identd, if supplied) | | %...m | The request method | | %...[Customerdata]n | The contents of note Customerdata from another module. | | %...[Customerdata]o | The contents of Customerdata: header line(s) in the reply. | | %...p | The canonical port of the server serving the request | | %...P | The process ID of the child that serviced the request. | | %...[format]P | The process ID or thread id of the child that serviced the request. Valid formats are pid and tid. (Apache 2.0.46 and later) | | %...q | The query string (prepended with a ? if a query string exists, otherwise an empty string) | | %...r | First line of request | | %...s | Status. For requests that got internally redirected, this is the status of the \$1original\$1 request --- %...>s for the last. | | %...t | Time, in common log format time format (standard English format) | | %...[format]t | The time, in the form given by format, which should be in strimmer(3) format. (potentially localized) | | %...T | The time taken to serve the request, in seconds. | | %...u | Remote user (from auth; may be bogus if return status (%s) is 401) | | %...U | The URL path requested, not including any query string. | | %...v | The canonical ServerName of the server serving the request. | | %...V | The server name according to the UseCanonicalName setting. | | %...X | Connection status when response is completed X = connection aborted before the response completed. \$1 = connection may be kept alive after the response is sent. - = connection will be closed after the response is sent. (The %..X directive was %...c in late versions of Apache 1.3, but this conflicted with the historical ssl %...[var]c syntax.) | | :%...I: | Bytes received, including request and headers, cannot be zero. You need to enable [mod\$1logio](http://httpd.apache.org/docs/2.0/mod/mod_logio.html) to use this. | | :%...O: | Bytes sent, including headers, cannot be zero. You need to enable [mod\$1logio](http://httpd.apache.org/docs/2.0/mod/mod_logio.html) to use this. | **Note** Some W3C format specifiers are shown as containing a "..." indication or a "<" or ">", which are optional controls on suppressing or redirecting the output of that specifier. The "..." can either be empty (as in the COMMON specification "\$1%h %u %r \$1%s %b") or it can indicate conditions for including the item. The conditions are a list of HTTP status codes, possibly preceded by "\$1", and if the specified condition is not met, then the column or field returned shows "-". For example, as described in the [Apache documentation](http://httpd.apache.org/docs/2.0/mod/mod_log_config.html), specifying "%400,501[User-agent]i" will log the User-agent only on 400 errors and 501 errors (Bad Request, Not Implemented). Similarly, "%\$1200,304,302[Referer]i" will log the Referer: on all requests that fail to return some sort of normal status. The modifiers "<" and ">" can be used to choose whether the original or final (respectively) request should be consulted when a request has been internally redirected. By default, the % directives %s, %U, %T, %D, and %r look at the original request while all others look at the final request. So for example, %>s can be used to record the final status of the request and % The categories are bytes sent, connection status, content of environmental variable, filename, host, IP, notes, protocol, query string, replies, requests, and time. For the markings "..." or "<" or "<", see the previous note. | Function or Category | W3C Format Specifiers | | --- | --- | | **Bytes sent, excluding HTTP headers** | | with a "0" when no bytes are sent | %...B | | with a "-" (CLF format) when no bytes are sent | %...b | | Bytes received, including request and headers, cannot be zero Must enable [mod\$1logio](http://httpd.apache.org/docs/2.0/mod/mod_logio.html) to use this. | :% ... I: | | Bytes sent, including headers, cannot be zero Must enable [mod\$1logio](http://httpd.apache.org/docs/2.0/mod/mod_logio.html) to use this. | :%... O: | | **Connection status when response is completed** | | Connection aborted before the response completed | X | | Connection may be kept alive after the response is sent | \$1 | | Connection will be closed after the response is sent | - | | The %..X directive was %...c in late versions of Apache 1.3, but this conflicted with the historical ssl %...[var]c syntax. | | **Environment variable CUSTOMERDATA** | | contents | %...[CUSTOMERDATA]e | | **Filename** | %...f | | **Host (remote)** | %...h | | **Protocol** | %...H | | **IP addresses** | | Remote | %...a | | Local | %...A | | **Notes** | | Contents of note Customerdata from another module | %...[Customerdata]n | | **Protocol (request)** | %...H | | **Query string** If query exists, prepended with a ? If not, the empty string. | %...q | | **Replies** | | Contents of Customerdata (header lines in the reply) | %...[Customerdata]o | The W3C format specifiers for the response and time categories are listed following table. | Function or Category | W3C Format Specifiers | | --- | --- | | **Requests** | | Canonical port of the server serving the request | %...p | | Contents of cookie Customerdata in the request sent to server | %... [Customerdata]C | | Contents of BAR:header line(s) | %... [BAR]i | | First line sent: | %...r | | Microseconds taken to serve a request | %...D | | Protocol | %...H | | Process ID of the child that serviced the request | %...P | | Process ID or thread id of the child that serviced the request. Valid formats are pid and tid. (Apache 2.0.46 and later) | %...[format]P | | Remote logname (from identd, if supplied) | %...l | | Remote user: (from auth; may be bogus if return status (%s) is 401) | %...u | | Server (canonical ServerName) serving the request | %...v | | Server name by the UseCanonicalName setting | %...V | | Request method | %...m | | Return status | %s | | Seconds taken to serve the request | %...T | | Status of the \$1original\$1 request that was internally redirected | %...s | | Status of the last request | %...>s | | URL path requested, not including any query string | %...U | | **Time** | | Common log format time format (standard English format) | %...t | | Time in strftime(3) format, potentially localized | %...[format]t | | Seconds taken to serve the request | %...T | ## W3C Examples W3C\$1LOG\$1PARSE supports access to logs generated by W3C-compliant applications like the Apache web server, producing output rows with one column for each specifier. The data types are derived from the log entry description specifiers listed in the [Apache mod\$1log\$1config](http://httpd.apache.org/docs/2.0/mod/mod_log_config.html?#formats) specification. ## Example 1 The input in this example is taken from an Apache log file and is representative of the COMMON log format. ### Input ``` (192.168.254.30 - John [24/May/2004:22:01:02 -0700] "GET /icons/apache_pb.gif HTTP/1.1" 304 0), (192.168.254.30 - Jane [24/May/2004:22:01:02 -0700] "GET /icons/small/dir.gif HTTP/1.1" 304 0); ``` ### DDL ``` CREATE OR REPLACE PUMP weblog AS SELECT STREAM l.r.COLUMN1, l.r.COLUMN2, l.r.COLUMN3, l.r.COLUMN4, l.r.COLUMN5, l.r.COLUMN6, l.r.COLUMN7 FROM (SELECT STREAM W3C_LOG_PARSE(message, 'COMMON') FROM "weblog_read) AS l(r); ``` ### Output ``` 192.168.254.30 - John [24/May/2004:22:01:02 -0700] GET /icons/apache_pb.gif HTTP/1.1 304 0 192.168.254.30 - Jane [24/May/2004:22:01:02 -0700] GET /icons/small/dir.gif HTTP/1.1 304 0 ``` ### The specification of COMMON in the FROM clause means the Common Log Format (CLF), which uses the specifiers %h %l %u %t "%r" %>s %b. The [W3C-predefined formats](https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-w3c-log-parse.html#sql-reference-w3c-log-parse-predefined) shows the COMMON and other predefined specifier sets. The specification of COMMON in the FROM clause means the Common Log Format (CLF), which uses the specifiers %h %l %u %t "%r" %>s %b. The table below, Specifiers used by the Common Log Format, describes the specifiers used by COMMON in the FROM clause. ## Specifiers Used by the Common Log Format | Output Column | Format Specifier | Returns | | --- | --- | --- | | COLUMN1 | %h | The IP address of the remote host | | COLUMN2 | %l | The remote logname | | COLUMN3 | %u | The remote user | | COLUMN4 | %t | The time | | COLUMN5 | "%r" | The first line of the request | | COLUMN6 | %>s | The status: For internally redirected requests, the status of the \$1original\$1 request --- %...>s for the last. | | COLUMN7 | %b | The number of bytes sent, excluding HTTP headers | ## Example 2 The DDL in this example shows how to rename output columns and filter out unneeded columns. ### DDL ``` CREATE OR REPLACE VIEW "Schema1".weblogreduced AS SELECT STREAM CAST(s.COLUMN3 AS VARCHAR(5)) AS LOG_USER, CAST(s.COLUMN1 AS VARCHAR(15)) AS ADDRESS, CAST(s.COLUMN4 AS VARCHAR(30)) as TIME_DATES FROM "Schema1".weblog s; ``` ### Output ``` +----------+-----------------+--------------------------------+ | LOG_USER | ADDRESS | TIME_DATES | | | | | +----------+-----------------+--------------------------------+ | Jane | 192.168.254.30 | [24/May/2004:22:01:02 -0700] | | | | | | John | 192.168.254.30 | [24/May/2004:22:01:02 -0700] | +----------+-----------------+--------------------------------+ ``` ## W3C Customized Formats The same results would be created by naming the specifiers directly rather than using the "COMMON" name, as shown following: ``` CREATE OR REPLACE FOREIGN STREAM schema1.weblog SERVER logfile_server OPTIONS (LOG_PATH '/path/to/logfile', ENCODING 'UTF-8', SLEEP_INTERVAL '10000', MAX_UNCHANGED_STATS '10', PARSER 'W3C', PARSER_FORMAT '%h %l %u %t \"%r\" %>s %b'); or CREATE FOREIGN STREAM "Schema1".weblog_read SERVER "logfile_server" OPTIONS (log_path '/path/to/logfile', encoding 'UTF-8', sleep_interval '10000', max_unchanged_stats '10'); CREATE OR REPLACE VIEW "Schema1".weblog AS SELECT STREAM l.r.COLUMN1, l.r.COLUMN2, l.r.COLUMN3, l.r.COLUMN4, l.r.COLUMN5, l.r.COLUMN6 FROM (SELECT STREAM W3C_LOG_PARSE(message, '%h %l %u %t \"%r\" %>s %b') FROM "Schema1".weblog_read) AS l(r); ``` **Note** If you change %t to [%t], the date column contains the following: ``` 24/May/2004:22:01:02 -0700 ``` (instead of `[24/May/2004:22:01:02 -0700]`)