Amazon Athena OpenSearch connector
OpenSearch Service
The Amazon Athena OpenSearch connector enables Amazon Athena to communicate with your OpenSearch instances so that you can use SQL to query your OpenSearch data.
Note
Due to a known issue, the OpenSearch connector cannot be used with a VPC.
If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.
Prerequisites
Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see Deploy a data source connector or Use the AWS Serverless Application Repository to deploy a data source connector.
Terms
The following terms relate to the OpenSearch connector.
-
Domain – A name that this connector associates with the endpoint of your OpenSearch instance. The domain is also used as the database name. For OpenSearch instances defined within the Amazon OpenSearch Service, the domain is auto-discoverable. For other instances, you must provide a mapping between the domain name and endpoint.
-
Index – A database table defined in your OpenSearch instance.
-
Mapping – If an index is a database table, then the mapping is its schema (that is, the definitions of its fields and attributes).
This connector supports both metadata retrieval from the OpenSearch instance and from the AWS Glue Data Catalog. If the connector finds a AWS Glue database and table that match your OpenSearch domain and index names, the connector attempts to use them for schema definition. We recommend that you create your AWS Glue table so that it is a superset of all fields in your OpenSearch index.
-
Document – A record within a database table.
-
Data stream – Time based data that is composed of multiple backing indices. For more information, see Data streams
in the OpenSearch documentation and Getting started with data streams in the Amazon OpenSearch Service Developer Guide. Note
Because data stream indices are internally created and managed by open search, the connector chooses the schema mapping from the first available index. For this reason, we strongly recommend setting up an AWS Glue table as a supplemental metadata source. For more information, see Setting up databases and tables in AWS Glue.
Parameters
Use the Lambda environment variables in this section to configure the OpenSearch connector.
-
spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
-
spill_prefix – (Optional) Defaults to a subfolder in the specified
spill_bucket
calledathena-federation-spill
. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours. -
spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3
putObject
request that is used for spilling (for example,{"x-amz-server-side-encryption" : "AES256"}
). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference. -
kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like
a7e63k4b-8loc-40db-a2a1-4d0en2cd8331
, you can specify a KMS key ID. -
disable_spill_encryption – (Optional) When set to
True
, disables spill encryption. Defaults toFalse
so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption. -
disable_glue – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
-
query_timeout_cluster – The timeout period, in seconds, for cluster health queries used in the generation of parallel scans.
-
query_timeout_search – The timeout period, in seconds, for search queries used in the retrieval of documents from an index.
-
auto_discover_endpoint – Boolean. The default is
true
. When you use the Amazon OpenSearch Service and set this parameter to true, the connector can auto-discover your domains and endpoints by calling the appropriate describe or list API operations on the OpenSearch Service. For any other type of OpenSearch instance (for example, self-hosted), you must specify the associated domain endpoints in thedomain_mapping
variable. Ifauto_discover_endpoint=true
, the connector uses AWS credentials to authenticate to the OpenSearch Service. Otherwise, the connector retrieves user name and password credentials from AWS Secrets Manager through thedomain_mapping
variable. -
domain_mapping – Used only when
auto_discover_endpoint
is set to false and defines the mapping between domain names and their associated endpoints. Thedomain_mapping
variable can accommodate multiple OpenSearch endpoints in the following format:domain1=endpoint1,domain2=endpoint2,domain3=endpoint3,...
For the purpose of authenticating to an OpenSearch endpoint, the connector supports substitution strings injected using the format
${SecretName}:
with user name and password retrieved from AWS Secrets Manager. The colon (:) at the end of the expression serves as a separator from the rest of the endpoint.Important
As a security best practice, do not use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see Move hardcoded secrets to AWS Secrets Manager in the AWS Secrets Manager User Guide.
The following example uses the
opensearch-creds
secret.movies=https://${opensearch-creds}:search-movies-ne...qu---us-east-1---es.amazonaws.com
At runtime,
${opensearch-creds}
is rendered as the user name and password, as in the following example.movies=https://myusername@mypassword:search-movies-ne...qu---us-east-1---es.amazonaws.com
In the
domain_mapping
parameter, each domain-endpoint pair can use a different secret. The secret itself must be specified in the formatuser_name
@password
. Although the password may contain embedded@
signs, the first@
serves as a separator fromuser_name
.It is also important to note that the comma (,) and equal sign (=) are used by this connector as separators for the domain-endpoint pairs. For this reason, you should not use them anywhere inside the stored secret.
Setting up databases and tables in AWS Glue
The connector obtains metadata information by using AWS Glue or OpenSearch. You can set up an AWS Glue table as a supplemental metadata definition source. To enable this feature, define a AWS Glue database and table that match the domain and index of the source that you are supplementing. The connector can also take advantage of metadata definitions stored in the OpenSearch instance by retrieving the mapping for the specified index.
Defining metadata for arrays in OpenSearch
OpenSearch does not have a dedicated array data type. Any field can contain zero
or more values so long as they are of the same data type. If you want to use
OpenSearch as your metadata definition source, you must define a
_meta
property for all indices used with Athena for the fields that
to be considered a list or array. If you fail to complete this step, queries return
only the first element in the list field. When you specify the _meta
property, field names should be fully qualified for nested JSON structures (for
example, address.street
, where street
is a nested field
inside an address
structure).
The following example defines actor
and genre
lists in
the movies
table.
PUT movies/_mapping { "_meta": { "actor": "list", "genre": "list" } }
Data types
The OpenSearch connector can extract metadata definitions from either AWS Glue or the OpenSearch instance. The connector uses the mapping in the following table to convert the definitions to Apache Arrow data types, including the points noted in the section that follows.
OpenSearch | Apache Arrow | AWS Glue |
---|---|---|
text, keyword, binary | VARCHAR | string |
long | BIGINT | bigint |
scaled_float | BIGINT | SCALED_FLOAT(...) |
integer | INT | int |
short | SMALLINT | smallint |
byte | TINYINT | tinyint |
double | FLOAT8 | double |
float, half_float | FLOAT4 | float |
boolean | BIT | boolean |
date, date_nanos | DATEMILLI | timestamp |
JSON structure | STRUCT | STRUCT |
_meta (for information, see the section Defining metadata for arrays in OpenSearch.) | LIST | ARRAY |
Notes on data types
-
Currently, the connector supports only the OpenSearch and AWS Glue data-types listed in the preceding table.
-
A
scaled_float
is a floating-point number scaled by a fixed double scaling factor and represented as aBIGINT
in Apache Arrow. For example, 0.756 with a scaling factor of 100 is rounded to 76. -
To define a
scaled_float
in AWS Glue, you must select thearray
column type and declare the field using the format SCALED_FLOAT(scaling_factor
).The following examples are valid:
SCALED_FLOAT(10.51) SCALED_FLOAT(100) SCALED_FLOAT(100.0)
The following examples are not valid:
SCALED_FLOAT(10.) SCALED_FLOAT(.5)
-
When converting from
date_nanos
toDATEMILLI
, nanoseconds are rounded to the nearest millisecond. Valid values fordate
anddate_nanos
include, but are not limited to, the following formats:"2020-05-18T10:15:30.123456789" "2020-05-15T06:50:01.123Z" "2020-05-15T06:49:30.123-05:00" 1589525370001 (epoch milliseconds)
-
An OpenSearch
binary
is a string representation of a binary value encoded usingBase64
and is converted to aVARCHAR
.
Running SQL queries
The following are examples of DDL queries that you can use with this connector. In the
examples, function_name
corresponds to the name of your Lambda
function, domain
is the name of the domain that you want to
query, and index
is the name of your index.
SHOW DATABASES in `lambda:
function_name
`
SHOW TABLES in `lambda:
function_name
`.domain
DESCRIBE `lambda:
function_name
`.domain
.index
Performance
The Athena OpenSearch connector supports shard-based parallel scans. The connector uses cluster health information retrieved from the OpenSearch instance to generate multiple requests for a document search query. The requests are split for each shard and run concurrently.
The connector also pushes down predicates as part of its document search queries. The following example query and predicate shows how the connector uses predicate push down.
Query
SELECT * FROM "lambda:elasticsearch".movies.movies WHERE year >= 1955 AND year <= 1962 OR year = 1996
Predicate
(_exists_:year) AND year:([1955 TO 1962] OR 1996)
Passthrough queries
The OpenSearch connector supports passthrough queries and uses the
Query DSL language. For more information about querying with Query DSL, see Query DSL
To use passthrough queries with the OpenSearch connector, use the following syntax:
SELECT * FROM TABLE( system.query( schema => '
schema_name
', index => 'index_name
', query => "{query_string
}" ))
The following OpenSearch example passthrough query filters for employees with active
employment status in the employee
index of the default
schema.
SELECT * FROM TABLE( system.query( schema => 'default', index => 'employee', query => "{ ''bool'':{''filter'':{''term'':{''status'': ''active''}}}}" ))
Additional resources
-
For an article on using the Amazon Athena OpenSearch connector to query data in Amazon OpenSearch Service and Amazon S3 in a single query, see Query data in Amazon OpenSearch Service using SQL from Amazon Athena
in the AWS Big Data Blog. For additional information about this connector, visit the corresponding site
on GitHub.com.