Working with Amazon OpenSearch Service direct queries with Amazon S3 - Amazon OpenSearch Service

Working with Amazon OpenSearch Service direct queries with Amazon S3

You can use Amazon OpenSearch Service direct queries to query data in Amazon S3. Amazon OpenSearch Service provides a direct query integration with Amazon S3 as a way to analyze operational logs in Amazon S3 and data lakes based in Amazon S3 without having to switch between services. You can now analyze data in cloud object stores—and simultaneously use the operational analytics and visualizations of OpenSearch Service.

With direct queries with Amazon S3, you no longer need to build complex ETL pipelines or incur the expense of duplicating data in both OpenSearch Service and Amazon S3 storage. You can also install integrations of popular log-type templates that include predefined dashboards, and configure data accelerations tailored to that log type. The templates include VPC Flow Logs, AWS CloudTrail logs, and Amazon S3 logs. The accelerations include skipping indexes, materialized views, and covered indexes.

Pricing

You pay for existing OpenSearch Service and Amazon S3 resources that are used to create and process direct queries. Queries that are sent to Amazon S3 use billable compute and show up as OpenSearch Compute Units (OCUs) per hour.

Direct queries with Amazon S3 are of two types—interactive and accelerations. Interactive queries perform analytics on your data in Amazon S3. When you run a new query, OpenSearch Service starts a new session that lasts for a minimum of three minutes. OpenSearch Service keeps the session active to ensure that subsequent queries run quickly. Acceleration queries use compute to maintain indexes in OpenSearch Service. These queries usually take longer because they ingest a varying amount of data into OpenSearch Service to make interactive queries run faster.

For more information, see Amazon OpenSearch Service Pricing.

Limitations

The following limitations apply to OpenSearch Service direct queries with Amazon S3.

  • Your OpenSearch domain must be version 2.13 or later to support OpenSearch Service direct queries.

  • Not available on OpenSearch Serverless.

  • Your OpenSearch domain and AWS Glue Data Catalog must be in the same AWS account. Your Amazon S3 bucket can be in a different account (requires condition to be added to your IAM policy), but must be in the same AWS Region as your domain.

  • Some data types aren't supported. Supported data types are limited to Parquet, CSV, and JSON.

  • OpenSearch Service direct queries with Amazon S3 only support Spark tables generated from Query Workbench. Tables generated within the AWS Glue Data Catalog or Athena are not supported by Spark streaming, which is needed to maintain accelerations and keep indexes up to date.

  • Data must be flattened ahead of querying or you must use SQL in OpenSearch Service to change your nested columns into dedicated columns.

  • Missing columns may require using the COALESCE SQL function to return results.

  • If the structure of your data changes, updates are required for the AWS Glue table as well as existing accelerations.

  • OpenSearch instance types have networked payload limitations depending on the instance type (10 v. 100).

  • AWS CloudFormation templates aren't supported yet.

Recommendations

We recommend you do the following when using direct query:

  • Ingest data into Amazon S3 using partition formats of year, month, day, hour to speed up queries.

  • Use limits on your queries to ensure you aren't pulling too much data back.

  • Use Index State Management (where applicable) to maintain storage for materialized views and covering indexes.

  • Drop acceleration jobs and indexes when they are no longer needed.

  • When building skipping indexes, use bloom filters for high cardinality and min/max for large ranges. It is recommended you use value set on a high cardinality field.

  • Use reference guides to export data to Amazon S3. You can use AWS logs such as CloudFront, CloudTrail, and Elastic Load Balancing.

Quotas

Your account has the following quotas related to OpenSearch Service direct queries with Amazon S3. Each time you initiate a query, OpenSearch Service opens a session and keeps it alive for at least ten minutes. This reduces query latency by removing session startup time in subsequent queries.

Description Maxiumum Can override
Connections per domain 10 Yes
Data sources per domain 20 Yes
Indexes per domain 5 Yes
Concurrent sessions per data source 10 Yes
Maximum OCU per query 60 Yes
Maximum query execution time (minutes) 30 Yes
Maximum OCUs per acceleration 20 Yes
Maximum ephemeral storage 20 Yes

Supported Regions

The following Regions are available for OpenSearch Service direct queries with Amazon S3: Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm), US East (N. Virginia), US East (Ohio), and US West (Oregon).