Amazon Athena DocumentDB connector
The Amazon Athena DocumentDB connector enables Athena to communicate with your DocumentDB instances so that you can query your DocumentDB data with SQL. The connector also works with any endpoint that is compatible with MongoDB.
Unlike traditional relational data stores, Amazon DocumentDB collections do not have set schema. DocumentDB does not have a metadata store. Each entry in a DocumentDB collection can have different fields and data types.
The DocumentDB connector supports two mechanisms for generating table schema information: basic schema inference and AWS Glue Data Catalog metadata.
Schema inference is the default. This option scans a small number of documents in your collection, forms a union of all fields, and coerces fields that have non-overlapping data types. This option works well for collections that have mostly uniform entries.
For collections with a greater variety of data types, the connector supports retrieving metadata from the AWS Glue Data Catalog. If the connector sees a AWS Glue database and table that match your DocumentDB database and collection names, it gets its schema information from the corresponding AWS Glue table. When you create your AWS Glue table, we recommend that you make it a superset of all fields that you might want to access from your DocumentDB collection.
If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.
Prerequisites
Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see Deploy a data source connector or Use the AWS Serverless Application Repository to deploy a data source connector.
Parameters
Use the Lambda environment variables in this section to configure the DocumentDB connector.
-
spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
-
spill_prefix – (Optional) Defaults to a subfolder in the specified
spill_bucket
calledathena-federation-spill
. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours. -
spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3
putObject
request that is used for spilling (for example,{"x-amz-server-side-encryption" : "AES256"}
). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference. -
kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like
a7e63k4b-8loc-40db-a2a1-4d0en2cd8331
, you can specify a KMS key ID. -
disable_spill_encryption – (Optional) When set to
True
, disables spill encryption. Defaults toFalse
so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption. -
disable_glue – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
-
glue_catalog – (Optional) Use this option to specify a cross-account AWS Glue catalog. By default, the connector attempts to get metadata from its own AWS Glue account.
-
default_docdb – If present, specifies a DocumentDB connection string to use when no catalog-specific environment variable exists.
-
disable_projection_and_casing – (Optional) Disables projection and casing. Use if you want to query Amazon DocumentDB tables that use case sensitive column names. The
disable_projection_and_casing
parameter uses the following values to specify the behavior of casing and column mapping:-
false – This is the default setting. Projection is enabled, and the connector expects all column names to be in lower case.
-
true – Disables projection and casing. When using the
disable_projection_and_casing
parameter, keep in mind the following points:-
Use of the parameter can result in higher bandwidth usage. Additionally, if your Lambda function is not in the same AWS Region as your data source, you will incur higher standard AWS cross-region transfer costs as a result of the higher bandwidth usage. For more information about cross-region transfer costs, see AWS Data Transfer Charges for Server and Serverless Architectures
in the AWS Partner Network Blog. -
Because a larger number of bytes is transferred and because the larger number of bytes requires a higher deserialization time, overall latency can increase.
-
-
-
enable_case_insensitive_match – (Optional) When
true
, performs case insensitive searches against schema and table names in Amazon DocumentDB. The default isfalse
. Use if your query contains uppercase schema or table names.
Specifying connection strings
You can provide one or more properties that define the DocumentDB connection details for the DocumentDB instances that you use with the connector. To do this, set a Lambda environment variable that corresponds to the catalog name that you want to use in Athena. For example, suppose you want to use the following queries to query two different DocumentDB instances from Athena:
SELECT * FROM "docdb_instance_1".database.table
SELECT * FROM "docdb_instance_2".database.table
Before you can use these two SQL statements, you must add two environment
variables to your Lambda function: docdb_instance_1
and
docdb_instance_2
. The value for each should be a
DocumentDB connection string in the following format:
mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0
Using secrets
You can optionally use AWS Secrets Manager for part or all of the value for your
connection string details. To use the Athena Federated Query feature with Secrets Manager, the VPC
connected to your Lambda function should have internet access
If you use the syntax ${my_secret}
to put the name of a secret
from Secrets Manager in your connection string, the connector replaces
${my_secret}
with its plain text value from Secrets Manager exactly.
Secrets should be stored as a plain text secret with value
.
Secrets stored as
<username>
:<password>
{username:
will not be passed to the connection string properly.<username>
,password:<password>
}
Secrets can also be used for the entire connection string entirely, and the username and password can be defined within the secret.
For example, suppose you set the Lambda environment variable for
docdb_instance_1
to the following value:
mongodb://${docdb_instance_1_creds}@myhostname.com:123/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0
The Athena Query Federation SDK automatically attempts to retrieve a secret named docdb_instance_1_creds
from Secrets Manager and inject that
value in place of ${docdb_instance_1_creds}
.
Any part of the connection string that is enclosed by the ${
}
character combination is interpreted as a secret from Secrets Manager. If
you specify a secret name that the connector cannot find in Secrets Manager, the connector
does not replace the text.
Setting up databases and tables in AWS Glue
Because the connector's built-in schema inference capability scans a limited number of documents and supports only a subset of data types, you might want to use AWS Glue for metadata instead.
To enable an AWS Glue table for use with Amazon DocumentDB, you must have a AWS Glue database and table for the DocumentDB database and collection that you want to supply supplemental metadata for.
To use an AWS Glue table for supplemental metadata
-
Use the AWS Glue console to create an AWS Glue database that has the same name as your Amazon DocumentDB database name.
-
Set the URI property of the database to include docdb-metadata-flag.
-
(Optional) Add the sourceTable table property. This property defines the source table name in Amazon DocumentDB. Use this property if your AWS Glue table has a different name from the table name in Amazon DocumentDB. Differences in naming rules between AWS Glue and Amazon DocumentDB can make this necessary. For example, capital letters are not permitted in AWS Glue table names, but they are permitted in Amazon DocumentDB table names.
-
(Optional) Add the columnMapping table property. This property defines column name mappings. Use this property if AWS Glue column naming rules prevent you from creating an AWS Glue table that has the same column names as those in your Amazon DocumentDB table. This can be useful because capital letters are permitted in Amazon DocumentDB column names but are not permitted in AWS Glue column names.
The
columnMapping
property value is expected to be a set of mappings in the formatcol1=Col1,col2=Col2
.Note
Column mapping applies only to top level column names and not to nested fields.
After you add the AWS Glue
columnMapping
table property, you can remove thedisable_projection_and_casing
Lambda environment variable. -
Make sure that you use the data types appropriate for AWS Glue as listed in this document.
Data type support
This section lists the data types that the DocumentDB connector uses for schema inference, and the data types when AWS Glue metadata is used.
Schema inference data types
The schema inference feature of the DocumentDB connector attempts to infer values as belonging to one of the following data types. The table shows the corresponding data types for Amazon DocumentDB, Java, and Apache Arrow.
Apache Arrow | Java or DocDB |
---|---|
VARCHAR | String |
INT | Integer |
BIGINT | Long |
BIT | Boolean |
FLOAT4 | Float |
FLOAT8 | Double |
TIMESTAMPSEC | Date |
VARCHAR | ObjectId |
LIST | List |
STRUCT | Document |
AWS Glue data types
If you use AWS Glue for supplemental metadata, you can configure the following data types. The table shows the corresponding data types for AWS Glue and Apache Arrow.
AWS Glue | Apache Arrow |
---|---|
int | INT |
bigint | BIGINT |
double | FLOAT8 |
float | FLOAT4 |
boolean | BIT |
binary | VARBINARY |
string | VARCHAR |
List | LIST |
Struct | STRUCT |
Required Permissions
For full details on the IAM policies that this
connector requires, review the Policies
section of the athena-docdb.yaml
-
Amazon S3 write access – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
-
Athena GetQueryExecution – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
-
AWS Glue Data Catalog – The DocumentDB connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
-
CloudWatch Logs – The connector requires access to CloudWatch Logs for storing logs.
-
AWS Secrets Manager read access – If you choose to store DocumentDB endpoint details in Secrets Manager, you must grant the connector access to those secrets.
-
VPC access – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your DocumentDB instances.
Performance
The Athena Amazon DocumentDB connector does not currently support parallel scans but attempts to push down predicates as part of its DocumentDB queries, and predicates against indexes on your DocumentDB collection result in significantly less data scanned.
The Lambda function performs projection pushdown to decrease the data scanned by the query. However, selecting a subset of columns sometimes results in a longer query execution runtime. LIMIT
clauses reduce the amount of data scanned, but if you do not provide a predicate, you should expect SELECT
queries with a LIMIT
clause to scan at least 16 MB of data.
Passthrough queries
The Athena Amazon DocumentDB connector supports passthrough queries and is NoSQL based. For information about querying Amazon DocumentDB, see Querying in the Amazon DocumentDB Developer Guide.
To use passthrough queries with Amazon DocumentDB, use the following syntax:
SELECT * FROM TABLE( system.query( database => '
database_name
', collection => 'collection_name
', filter => '{query_syntax
}' ))
The following example queries the example
database within the
TPCDS
collection, filtering on all books with the title Bill
of Rights.
SELECT * FROM TABLE( system.query( database => 'example', collection => 'tpcds', filter => '{title: "Bill of Rights"}' ))
Additional resources
-
For an article on using Amazon Athena Federated Query to connect a MongoDB database to Amazon QuickSight
to build dashboards and visualizations, see Visualize MongoDB data from Amazon QuickSight using Amazon Athena Federated Query in the AWS Big Data Blog. For additional information about this connector, visit the corresponding site
on GitHub.com.