SELECT
Retrieves rows of data from zero or more tables.
Note
This topic provides summary information for reference. Comprehensive information about
using SELECT
and the SQL language is beyond the scope of this
documentation. For information about using SQL that is specific to Athena, see Considerations and limitations for SQL queries
in Amazon Athena and
Run SQL queries in Amazon Athena. For
an example of creating a database, creating a table, and running a SELECT
query on the table in Athena, see Get started.
Synopsis
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression
[, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT [ count | ALL ] ]
Note
Reserved words in SQL SELECT statements must be enclosed in double quotes. For more information, see Reserved keywords to escape in SQL SELECT statements.
Parameters
- [ WITH with_query [, ....] ]
-
You can use
WITH
to flatten nested queries, or to simplify subqueries.Using the
WITH
clause to create recursive queries is supported starting in Athena engine version 3. The maximum recursion depth is 10.The
WITH
clause precedes theSELECT
list in a query and defines one or more subqueries for use within theSELECT
query.Each subquery defines a temporary table, similar to a view definition, which you can reference in the
FROM
clause. The tables are used only when the query runs.with_query
syntax is:subquery_table_name [ ( column_name [, ...] ) ] AS (subquery)
Where:
-
subquery_table_name
is a unique name for a temporary table that defines the results of theWITH
clause subquery. Eachsubquery
must have a table name that can be referenced in theFROM
clause. -
column_name [, ...]
is an optional list of output column names. The number of column names must be equal to or less than the number of columns defined bysubquery
. -
subquery
is any query statement.
-
- [ ALL | DISTINCT ] select_expression
-
select_expression
determines the rows to be selected. Aselect_expression
can use one of the following formats:expression [ [ AS ] column_alias ] [, ...]
row_expression.* [ AS ( column_alias [, ...] ) ]
relation.*
*
-
The
expression [ [ AS ] column_alias ]
syntax specifies an output column. The optional[AS] column_alias
syntax specifies a custom heading name to be used for the column in the output. -
For
row_expression.* [ AS ( column_alias [, ...] ) ]
,row_expression
is an arbitrary expression of data typeROW
. The fields of the row define the output columns to be included in the result. -
For
relation.*
, the columns ofrelation
are included in the result. This syntax does not permit the use of column aliases. -
The asterisk
*
specifies that all columns be included in the result set. -
In the result set, the order of columns is the same as the order of their specification by the select expression. If a select expression returns multiple columns, the column order follows the order used in the source relation or row type expression.
-
When column aliases are specified, the aliases override preexisting column or row field names. If the select expression does not have column names, zero-indexed anonymous column names (
_col0
,_col1
,_col2, ...
) are displayed in the output. -
ALL
is the default. UsingALL
is treated the same as if it were omitted; all rows for all columns are selected and duplicates are kept. -
Use
DISTINCT
to return only distinct values when a column contains duplicate values.
-
- FROM from_item [, ...]
-
Indicates the input to the query, where
from_item
can be a view, a join construct, or a subquery as described below.The
from_item
can be either:-
table_name [ [ AS ] alias [ (column_alias [, ...]) ] ]
Where
table_name
is the name of the target table from which to select rows,alias
is the name to give the output of theSELECT
statement, andcolumn_alias
defines the columns for thealias
specified.
-OR-
-
join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
Where
join_type
is one of:-
[ INNER ] JOIN
-
LEFT [ OUTER ] JOIN
-
RIGHT [ OUTER ] JOIN
-
FULL [ OUTER ] JOIN
-
CROSS JOIN
-
ON join_condition | USING (join_column [, ...])
Where usingjoin_condition
allows you to specify column names for join keys in multiple tables, and usingjoin_column
requiresjoin_column
to exist in both tables.
-
-
- [ WHERE condition ]
-
Filters results according to the
condition
you specify, wherecondition
generally has the following syntax.column_name
operator
value
[[[AND | OR]column_name
operator
value
] ...]The
operator
can be one of the comparators=
,>
,<
,>=
,<=
,<>
,!=
.The following subquery expressions can also be used in the
WHERE
clause.-
[NOT] BETWEEN
– Specifies a range between two integers, as in the following example. If the column data type isinteger_A
ANDinteger_B
varchar
, the column must be cast to integer first.SELECT DISTINCT processid FROM "webdata"."impressions" WHERE cast(processid as int) BETWEEN 1500 and 1800 ORDER BY processid
-
[NOT] LIKE
– Searches for the pattern specified. Use the percent sign (value
%
) as a wildcard character, as in the following example.SELECT * FROM "webdata"."impressions" WHERE referrer LIKE '%.org'
-
[NOT] IN (
– Specifies a list of possible values for a column, as in the following example.value
[,value
[, ...])SELECT * FROM "webdata"."impressions" WHERE referrer IN ('example.com','example.net','example.org')
-
- [ GROUP BY [ ALL | DISTINCT ] grouping_expressions [, ...] ]
-
Divides the output of the
SELECT
statement into rows with matching values.ALL
andDISTINCT
determine whether duplicate grouping sets each produce distinct output rows. If omitted,ALL
is assumed.grouping_expressions
allow you to perform complex grouping operations. You can use complex grouping operations to perform analysis that requires aggregation on multiple sets of columns in a single query.The
grouping_expressions
element can be any function, such asSUM
,AVG
, orCOUNT
, performed on input columns.GROUP BY
expressions can group output by input column names that don't appear in the output of theSELECT
statement.All output expressions must be either aggregate functions or columns present in the
GROUP BY
clause.You can use a single query to perform analysis that requires aggregating multiple column sets.
Athena supports complex aggregations using
GROUPING SETS
,CUBE
andROLLUP
.GROUP BY GROUPING SETS
specifies multiple lists of columns to group on.GROUP BY CUBE
generates all possible grouping sets for a given set of columns.GROUP BY ROLLUP
generates all possible subtotals for a given set of columns. Complex grouping operations do not support grouping on expressions composed of input columns. Only column names are allowed.You can often use
UNION ALL
to achieve the same results as theseGROUP BY
operations, but queries that useGROUP BY
have the advantage of reading the data one time, whereasUNION ALL
reads the underlying data three times and may produce inconsistent results when the data source is subject to change. - [ HAVING condition ]
-
Used with aggregate functions and the
GROUP BY
clause. Controls which groups are selected, eliminating groups that don't satisfycondition
. This filtering occurs after groups and aggregates are computed. - [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] union_query] ]
-
UNION
,INTERSECT
, andEXCEPT
combine the results of more than oneSELECT
statement into a single query.ALL
orDISTINCT
control the uniqueness of the rows included in the final result set.UNION
combines the rows resulting from the first query with the rows resulting from the second query. To eliminate duplicates,UNION
builds a hash table, which consumes memory. For better performance, consider usingUNION ALL
if your query does not require the elimination of duplicates. MultipleUNION
clauses are processed left to right unless you use parentheses to explicitly define the order of processing.INTERSECT
returns only the rows that are present in the results of both the first and the second queries.EXCEPT
returns the rows from the results of the first query, excluding the rows found by the second query.ALL
causes all rows to be included, even if the rows are identical.DISTINCT
causes only unique rows to be included in the combined result set. - [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
-
Sorts a result set by one or more output
expression
.When the clause contains multiple expressions, the result set is sorted according to the first
expression
. Then the secondexpression
is applied to rows that have matching values from the first expression, and so on.Each
expression
may specify output columns fromSELECT
or an ordinal number for an output column by position, starting at one.ORDER BY
is evaluated as the last step after anyGROUP BY
orHAVING
clause.ASC
andDESC
determine whether results are sorted in ascending or descending order. The default sorting order is ascending (ASC
). The default null ordering isNULLS LAST
, regardless of ascending or descending sort order. - [ OFFSET count [ ROW | ROWS ] ]
-
Use the
OFFSET
clause to discard a number of leading rows from the result set. If theORDER BY
clause is present, theOFFSET
clause is evaluated over a sorted result set, and the set remains sorted after the skipped rows are discarded. If the query has noORDER BY
clause, it is arbitrary which rows are discarded. If the count specified byOFFSET
equals or exceeds the size of the result set, the final result is empty. - LIMIT [ count | ALL ]
-
Restricts the number of rows in the result set to
count
.LIMIT ALL
is the same as omitting theLIMIT
clause. If the query has noORDER BY
clause, the results are arbitrary. - TABLESAMPLE [ BERNOULLI | SYSTEM ] (percentage)
-
Optional operator to select rows from a table based on a sampling method.
BERNOULLI
selects each row to be in the table sample with a probability ofpercentage
. All physical blocks of the table are scanned, and certain rows are skipped based on a comparison between the samplepercentage
and a random value calculated at runtime.With
SYSTEM
, the table is divided into logical segments of data, and the table is sampled at this granularity.Either all rows from a particular segment are selected, or the segment is skipped based on a comparison between the sample
percentage
and a random value calculated at runtime.SYSTEM
sampling is dependent on the connector. This method does not guarantee independent sampling probabilities. - [ UNNEST (array_or_map) [WITH ORDINALITY] ]
-
Expands an array or map into a relation. Arrays are expanded into a single column. Maps are expanded into two columns (key, value).
You can use
UNNEST
with multiple arguments, which are expanded into multiple columns with as many rows as the highest cardinality argument.Other columns are padded with nulls.
The
WITH ORDINALITY
clause adds an ordinality column to the end.UNNEST
is usually used with aJOIN
and can reference columns from relations on the left side of theJOIN
.
Getting the file locations for source data in Amazon S3
To see the Amazon S3 file location for the data in a table row, you can use
"$path"
in a SELECT
query, as in the following
example:
SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;
This returns a result like the following:
s3://amzn-s3-demo-bucket/datasets_mytable/year=2019/data_file1.json
To return a sorted, unique list of the S3 filename paths for the data in a table, you
can use SELECT DISTINCT
and ORDER BY
, as in the following
example.
SELECT DISTINCT "$path" AS data_source_file FROM sampledb.elb_logs ORDER By data_source_file ASC
To return only the filenames without the path, you can pass "$path"
as a
parameter to an regexp_extract
function, as in the following
example.
SELECT DISTINCT regexp_extract("$path", '[^/]+$') AS data_source_file FROM sampledb.elb_logs ORDER By data_source_file ASC
To return the data from a specific file, specify the file in the WHERE
clause, as in the following example.
SELECT *,"$path" FROM my_database.my_table WHERE "$path" = 's3://amzn-s3-demo-bucket/my_table/my_partition/file-01.csv'
For more information and examples, see the Knowledge Center article How can
I see the Amazon S3 source file for a row in an Athena table?
Note
In Athena, the Hive or Iceberg hidden metadata columns $bucket
,
$file_modified_time
, $file_size
, and
$partition
are not supported for views.
Escaping single quotes
To escape a single quote, precede it with another single quote, as in the following example. Do not confuse this with a double quote.
Select 'O''Reilly'
Results
O'Reilly
Additional resources
For more information about using SELECT
statements in Athena, see the
following resources.
For information about this | See this |
---|---|
Running queries in Athena | Run SQL queries in Amazon Athena |
Using SELECT to create a table |
Create a table from query results (CTAS) |
Inserting data from a SELECT query into another
table |
INSERT INTO |
Using built-in functions in SELECT statements |
Functions in Amazon Athena |
Using user defined functions in SELECT
statements |
Query with user defined functions |
Querying Data Catalog metadata | Query the AWS Glue Data Catalog |