The following examples illustrate how to search a dataset for a keyword within an
element inside an array, using the regexp_like
The regular expression pattern needs to be contained within the string, and does not
have to match it. To match the entire string, enclose the pattern with ^ at the
beginning of it, and $ at the end, such as '^pattern$'
.
Consider an array of sites containing their host name, and a
flaggedActivity
element. This element includes an ARRAY
,
containing several MAP
elements, each listing different popular keywords
and their popularity count. Assume you want to find a particular keyword inside a
MAP
in this array.
To search this dataset for sites with a specific keyword, we use
regexp_like
instead of the similar SQL LIKE
operator,
because searching for a large number of keywords is more efficient with
regexp_like
.
Example 1: Using regexp_like
The query in this example uses the regexp_like
function to search for
terms 'politics|bigdata'
, found in values within arrays:
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('netflix.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
)
] AS items
),
sites AS (
SELECT sites.hostname, sites.flaggedactivity
FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
This query returns two sites:
+----------------+
| hostname |
+----------------+
| aws.amazon.com |
+----------------+
| news.cnn.com |
+----------------+
Example 2: Using regexp_like
The query in the following example adds up the total popularity scores for the
sites matching your search terms with the regexp_like
function, and
then orders them from highest to lowest.
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('netflix.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
)
] AS items
),
sites AS (
SELECT sites.hostname, sites.flaggedactivity
FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS total
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
ORDER BY total DESC
This query returns two sites:
+------------------------------------+
| hostname | terms | total |
+----------------+-------------------+
| news.cnn.com | politics | 241 |
+----------------+-------------------+
| aws.amazon.com | bigdata | 10 |
+----------------+-------------------+