Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Find keywords in arrays using regexp_like

Focus mode
Find keywords in arrays using regexp_like - Amazon Athena

The following examples illustrate how to search a dataset for a keyword within an element inside an array, using the regexp_like function. It takes as an input a regular expression pattern to evaluate, or a list of terms separated by a pipe (|), evaluates the pattern, and determines if the specified string contains it.

The regular expression pattern needs to be contained within the string, and does not have to match it. To match the entire string, enclose the pattern with ^ at the beginning of it, and $ at the end, such as '^pattern$'.

Consider an array of sites containing their host name, and a flaggedActivity element. This element includes an ARRAY, containing several MAP elements, each listing different popular keywords and their popularity count. Assume you want to find a particular keyword inside a MAP in this array.

To search this dataset for sites with a specific keyword, we use regexp_like instead of the similar SQL LIKE operator, because searching for a large number of keywords is more efficient with regexp_like.

Example 1: Using regexp_like

The query in this example uses the regexp_like function to search for terms 'politics|bigdata', found in values within arrays:

WITH dataset AS ( SELECT ARRAY[ CAST( ROW('aws.amazon.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']), MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']), MAP(ARRAY['term', 'count'], ARRAY['iot', '74']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('news.cnn.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['politics', '241']), MAP(ARRAY['term', 'count'], ARRAY['technology', '211']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']), MAP(ARRAY['term', 'count'], ARRAY['iot', '170']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('netflix.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']), MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']), MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']), MAP(ARRAY['term', 'count'], ARRAY['iot', '4']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ) ] AS items ), sites AS ( SELECT sites.hostname, sites.flaggedactivity FROM dataset, UNNEST(items) t(sites) ) SELECT hostname FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags) WHERE regexp_like(flags['term'], 'politics|bigdata') GROUP BY (hostname)

This query returns two sites:

+----------------+ | hostname | +----------------+ | aws.amazon.com | +----------------+ | news.cnn.com | +----------------+
Example 2: Using regexp_like

The query in the following example adds up the total popularity scores for the sites matching your search terms with the regexp_like function, and then orders them from highest to lowest.

WITH dataset AS ( SELECT ARRAY[ CAST( ROW('aws.amazon.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']), MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']), MAP(ARRAY['term', 'count'], ARRAY['iot', '74']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('news.cnn.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['politics', '241']), MAP(ARRAY['term', 'count'], ARRAY['technology', '211']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']), MAP(ARRAY['term', 'count'], ARRAY['iot', '170']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('netflix.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']), MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']), MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']), MAP(ARRAY['term', 'count'], ARRAY['iot', '4']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ) ] AS items ), sites AS ( SELECT sites.hostname, sites.flaggedactivity FROM dataset, UNNEST(items) t(sites) ) SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS total FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags) WHERE regexp_like(flags['term'], 'politics|bigdata') GROUP BY (hostname) ORDER BY total DESC

This query returns two sites:

+------------------------------------+ | hostname | terms | total | +----------------+-------------------+ | news.cnn.com | politics | 241 | +----------------+-------------------+ | aws.amazon.com | bigdata | 10 | +----------------+-------------------+
PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.