Optimize queries
Use the query optimization techniques described in this section to make queries run faster or as workarounds for queries that exceed resource limits in Athena.
Optimize joins
There are many different strategies for executing joins in a distributed query engine. Two of the most common are distributed hash joins and queries with complex join conditions.
In a distributed hash join, place large tables on the left, small tables on the right
The most common type of join uses an equality comparison as the join condition. Athena runs this type of join as a distributed hash join.
In a distributed hash join, the engine builds a lookup table (hash table) from one of the sides of the join. This side is called the build side. The records of the build side are distributed across the nodes. Each node builds a lookup table for its subset. The other side of the join, called the probe side, is then streamed through the nodes. The records from the probe side are distributed over the nodes in the same way as the build side. This enables each node to perform the join by looking up the matching records in its own lookup table.
When the lookup tables created from the build side of the join don't fit into memory, queries can fail. Even if the total size of the build side is less than the available memory, queries can fail if the distribution of the records has significant skew. In an extreme case, all records could have the same value for the join condition and have to fit into memory on a single node. Even a query with less skew can fail if a set of values gets sent to the same node and the values add up to more than the available memory. Nodes do have the ability to spill records to disk, but spilling slows query execution and can be insufficient to prevent the query from failing.
Athena attempts to reorder joins to use the larger relation as the probe side, and the smaller relation as the build side. However, because Athena does not manage the data in tables, it has limited information and often must assume that the first table is the larger and the second table is the smaller.
When writing joins with equality-based join conditions, assume that the table
to the left of the JOIN
keyword is the probe side and the table to
the right is the build side. Make sure that the right table, the build side, is
the smaller of the tables. If it is not possible to make the build side of the
join small enough to fit into memory, consider running multiple queries that
join subsets of the build table.
Use EXPLAIN to analyze queries with complex joins
Queries with complex join conditions (for example, queries that use
LIKE
, >
, or other operators), are often
computationally demanding. In the worst case, every record from one side of the
join must be compared to every record on the other side of the join. Because the
execution time grows with the square of the number of records, such queries run
the risk of exceeding the maximum execution time.
To find out how Athena will execute your query in advance, you can use the
EXPLAIN
statement. For more information, see Using EXPLAIN and EXPLAIN ANALYZE in
Athena and Understand Athena EXPLAIN statement
results.
Reduce the scope of window functions, or remove them
Because window functions are resource intensive operations, they can make queries
run slow or even fail with the message Query exhausted resources at this
scale factor
. Window functions keep all the records that they
operate on in memory in order to calculate their result. When the window is very
large, the window function can run out of memory.
To make sure your queries run within the available memory limits, reduce the size
of the windows that your window functions operate over. To do so, you can add a
PARTITIONED BY
clause or narrow the scope of existing partitioning
clauses.
Use non-window functions
Sometimes queries with window functions can be rewritten without window
functions. For example, instead of using row_number
to find the top
N
records, you can use ORDER BY
and
LIMIT
. Instead of using row_number
or
rank
to deduplicate records, you can use aggregate functions
like max_by
For example, suppose you have a dataset with updates from a sensor. The sensor periodically reports its battery status and includes some metadata like location. If you want to know the last battery status for each sensor and its location, you can use this query:
SELECT sensor_id, arbitrary(location) AS location, max_by(battery_status, updated_at) AS battery_status FROM sensor_readings GROUP BY sensor_id
Because metadata like location is the same for every record, you can use the
arbitrary
function to pick any value from the group.
To get the last battery status, you can use the max_by
function.
The max_by
function picks the value for a column from the record
where the maximum value of another column was found. In this case, it returns
the battery status for the record with the last update time within the group.
This query runs faster and uses less memory than an equivalent query with a
window function.
Optimize aggregations
When Athena performs an aggregation, it distributes the records across worker nodes
using the columns in the GROUP BY
clause. To make the task of matching
records to groups as efficient as possible, the nodes attempt to keep records in
memory but spill them to disk if necessary.
It is also a good idea to avoid including redundant columns in GROUP
BY
clauses. Because fewer columns require less memory, a query that
describes a group using fewer columns is more efficient. Numeric columns also use
less memory than strings. For example, when you aggregate a dataset that has both a
numeric category ID and a category name, use only the category ID column in the
GROUP BY
clause.
Sometimes queries include columns in the GROUP BY
clause to work
around the fact that a column must either be part of the GROUP BY
clause or an aggregate expression. If this rule is not followed, you can receive an
error message like the following:
EXPRESSION_NOT_AGGREGATE: line 1:8: 'category' must be an aggregate
expression or appear in GROUP BY clause
To avoid having to add a redundant columns to the GROUP BY
clause,
you can use the arbitrary
SELECT country_id, arbitrary(country_name) AS country_name, COUNT(*) AS city_count FROM world_cities GROUP BY country_id
The ARBITRARY
function returns an arbitrary value from the group. The
function is useful when you know all records in the group have the same value for a
column, but the value does not identify the group.
Optimize top N queries
The ORDER BY
clause returns the results of a query in sorted order.
Athena uses distributed sort to run the sort operation in parallel on multiple
nodes.
If you don't strictly need your result to be sorted, avoid adding an ORDER
BY
clause. Also, avoid adding ORDER BY
to inner queries if
they are not strictly necessary. In many cases, the query planner can remove
redundant sorting, but this is not guaranteed. An exception to this rule is if an
inner query is doing a top N
operation, such as finding the
N
most recent, or N
most common values.
When Athena sees ORDER BY
together with LIMIT
, it
understands that you are running a top N
query and uses dedicated
operations accordingly.
Note
Although Athena can also often detect window functions like
row_number
that use top N
, we recommend the
simpler version that uses ORDER BY
and LIMIT
. For more
information, see Reduce the scope of
window functions, or remove them.
Include only required columns
If you don't strictly need a column, don't include it in your query. The less data a query has to process, the faster it will run. This reduces both the amount of memory required and the amount of data that has to be sent between nodes. If you are using a columnar file format, reducing the number columns also reduces the amount of data that is read from Amazon S3.
Athena has no specific limit on the number of columns in a result, but how queries are executed limits the possible combined size of columns. The combined size of columns includes their names and types.
For example, the following error is caused by a relation that exceeds the size limit for a relation descriptor:
GENERIC_INTERNAL_ERROR:
io.airlift.bytecode.CompilationException
To work around this issue, reduce the number of columns in the query, or create
subqueries and use a JOIN
that retrieves a smaller amount of data. If
you have queries that do SELECT *
in the outermost query, you should
change the *
to a list of only the columns that you need.
Optimize queries by using approximations
Athena has support for approximation
aggregate functions
Unlike COUNT(DISTINCT col)
operations, approx_distinct
Optimize LIKE
You can use LIKE
to find matching strings, but with long strings,
this is compute intensive. The regexp_like
Often you can optimize a search by anchoring the substring that you are looking
for. For example, if you're looking for a prefix, it is much better to use
'substr
%' instead of
'%substr
%'. Or, if you're using
regexp_like
, '^substr
'.
Use UNION ALL instead of UNION
UNION ALL
and UNION
are two ways to combine the results of
two queries into one result. UNION ALL
concatenates the records from
the first query with the second, and UNION
does the same, but also
removes duplicates. UNION
needs to process all the records and find the
duplicates, which is memory and compute intensive, but UNION ALL
is a
relatively quick operation. Unless you need to deduplicate records, use UNION
ALL
for the best performance.
Use UNLOAD for large result sets
When the results of a query are expected to be large (for example, tens of
thousands of rows or more), use UNLOAD to export the results. In most cases, this is
faster than running a regular query, and using UNLOAD
also gives you
more control over the output.
When a query finishes executing, Athena stores the result as a single uncompressed
CSV file on Amazon S3. This takes longer than UNLOAD
, not only because the
result is uncompressed, but also because the operation cannot be parallelized. In
contrast, UNLOAD
writes results directly from the worker nodes and
makes full use of the parallelism of the compute cluster. In addition, you can
configure UNLOAD
to write the results in compressed format and in other
file formats such as JSON and Parquet.
For more information, see UNLOAD.
Use CTAS or Glue ETL to materialize frequently used aggregations
'Materializing' a query is a way of accelerating query performance by storing pre-computed complex query results (for example, aggregations and joins) for reuse in subsequent queries.
If many of your queries include the same joins and aggregations, you can
materialize the common subquery as a new table and then run queries against that
table. You can create the new table with Create a table from query results (CTAS), or a dedicated ETL tool like Glue
ETL
For example, suppose you have a dashboard with widgets that show different aspects of an orders dataset. Each widget has its own query, but the queries all share the same joins and filters. An order table is joined with a line items table, and there is a filter to show only the last three months. If you identify the common features of these queries, you can create a new table that the widgets can use. This reduces duplication and improves performance. The disadvantage is that you must keep the new table up to date.
Reuse query results
It's common for the same query to run multiple times within a short duration. For
example, this can occur when multiple people open the same data dashboard. When you
run a query, you can tell Athena to reuse previously calculated results. You specify
the maximum age of the results to be reused. If the same query was previously run
within that time frame, Athena returns those results instead of running the query
again. For more information, see Reuse query results in Athena here in the
Amazon Athena User Guide and Reduce cost and improve query performance with Amazon Athena Query Result
Reuse