

# Querying HealthOmics analytics data
<a name="analytics-query-data"></a>

**Important**  
AWS HealthOmics variant stores and annotation stores are no longer open to new customers. Existing customers can continue to use the service as normal. For more information, see [AWS HealthOmics variant store and annotation store availability change](variant-store-availability-change.md).

You can perform queries on your variant stores using AWS Lake Formation and Amazon Athena or Amazon EMR. Before you run any queries, complete the setup procedures (described in the following sections) for Lake Formation and Amazon Athena.

For information about Amazon EMR, see [ Tutorial: Getting started with Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html)

For variant stores created after Sept 26, 2024, HealthOmics partitions the store by sample ID. This partitioning means that HealthOmics uses the sample ID to optimize storing of the variant information. Queries that use sample information as filters will return results faster, as the query scans less data. 

HealthOmics uses sample IDs as partition file names. Before you ingest data, check whether the sample ID contains any PHI data. If it does, change the sample ID before you ingest the data. For more information about what content to include and not include in sample IDs, see guidance on the AWS [ HIPAA compliance](https://aws.amazon.com/compliance/hipaa-compliance) web page.

**Topics**
+ [Configuring Lake Formation to use HealthOmics](setting-up-lf.md)
+ [Configuring Athena for queries](analytics-setting-up-athena.md)
+ [Running queries on HealthOmics variant stores](analytics-run-queries.md)

# Configuring Lake Formation to use HealthOmics
<a name="setting-up-lf"></a>

**Important**  
AWS HealthOmics variant stores and annotation stores are no longer open to new customers. Existing customers can continue to use the service as normal. For more information, see [AWS HealthOmics variant store and annotation store availability change](variant-store-availability-change.md).

Before you use Lake Formation to manage HealthOmics data stores, perform the following Lake Formation configuration procedures.

**Topics**
+ [Creating or verify Lake Formation administrators](#create-lf-admins)
+ [Creating resource links using the Lake Formation console](#create-resource-links)
+ [Configuring permissions for AWS RAM resource shares](#configure-lf-permissions)

## Creating or verify Lake Formation administrators
<a name="create-lf-admins"></a>

Before you can create a data lake in Lake Formation, you define one or more administrators.

Administrators are users and roles with permissions to create resource links. You set up data lake administrators per account per region.

**Create an admin user in the Lake Formation console**

1. Open the AWS Lake Formation console: [Lake Formation console](https://console.aws.amazon.com//lakeformation)

1. If the console displays the **Welcome to Lake Formation** panel, choose **Get started**.

   Lake Formation adds you to the **Data lake administrators** table.

1. Otherwise, from the left menu, choose **Administative roles and tasks**.

1. Add any additional administrators as required.

## Creating resource links using the Lake Formation console
<a name="create-resource-links"></a>

To make a shared resource that users can query, the default access controls must be disabled. To learn more about disabling default access controls, see [Changing the default security settings for your data lake](https://docs.aws.amazon.com/lake-formation/latest/dg/change-settings.html) in the Lake Formation documentation. You can create resource links individually or as a group, so that you can access data in Amazon Athena or other AWS services (such as Amazon EMR).

**Creating resource links in the AWS Lake Formation console and sharing them with HealthOmics Analytics users**

1. Open the AWS Lake Formation console: [Lake Formation console](https://console.aws.amazon.com//lakeformation)

1. In the primary navigation bar, choose **Databases**.

1. In the **Databases** table, select the desired database.

1. From the **Create** menu, choose **Resource link**.

1. Enter a **Resource link name**. If you plan to access the database from Athena, enter a name using only lowercase letters (up to 256 characters).

1. Choose **Create**.

1. The new resource link is now listed under **Databases**.

### Grant access to the shared resource using the Lake Formation console
<a name="create-resource-links"></a>

A Lake Formation database administrator can grant access to the shared resource using the following procedure.

1. Open the AWS Lake Formation console: [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com//lakeformation)

1. In the primary navigation bar, choose **Databases**.

1. On the **Databases** page, select the resource link you previously created.

1. From the **Actions** menu, choose **Grant on target**.

1. On the **Grant data permissions** page under **Principals**, choose **IAM users or roles**.

1. From the **IAM users or roles** drop-down menu, find the user to which you want to grant access.

1. Next, under **LF-Tags or catalog resources** card, select the **Named data catalog resources** option.

1. From the **Tables-optional** drop-down menu, select **All Tables** or the table that you previously created.

1. In the **Table permissions** card, under **Table permissions** choose **Describe** and **Select**.

1. Next, choose **Grant**.

To view the Lake Formation permissions, choose **Data lake permissions** from the primary navigation pane. The table shows the available databases and resource links.

## Configuring permissions for AWS RAM resource shares
<a name="configure-lf-permissions"></a>

In the AWS Lake Formation console, view the permissions by choosing **Data lake permissions** in the primary navigation bar. On the **Data permissions** page, you can view a table that shows the **Resource types**, **Databases**, and **ARN** that's related to a shared resource under **RAM Resource Share**. If you need to accept an AWS Resource Access Manager (AWS RAM) resource share, AWS Lake Formation notifies you in the console.

HealthOmics can implicitly accept the AWS RAM resource shares during store creation. To accept the AWS RAM resource share, the IAM user or role that calls the `CreateVariantStore` or `CreateAnnotationStore` API operations must allow the following actions:
+ `ram:GetResourceShareInvitations` - This action allows HealthOmics to find the invitations.
+ `ram:AcceptResourceShareInvitation` - This action allows HealthOmics to accept the invitation by using an FAS token.

Without these permissions, you see an authorization error during store creation.

Here is a sample policy that includes these actions. Add this policy to the IAM user or role that accepts the AWS RAM resource share.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "omics:*",
        "ram:AcceptResourceShareInvitation",
        "ram:GetResourceShareInvitations"
      ],
      "Resource": "*"
    }
  ]
}
```

------

# Configuring Athena for queries
<a name="analytics-setting-up-athena"></a>

**Important**  
AWS HealthOmics variant stores and annotation stores are no longer open to new customers. Existing customers can continue to use the service as normal. For more information, see [AWS HealthOmics variant store and annotation store availability change](variant-store-availability-change.md).

You can use Athena to query variants and annotations. Before you run any queries, perform the following setup tasks:

**Topics**
+ [Configure a query results location using the Athena console](#configure-athena-query)
+ [Configure a workgroup with Athena engine v3](#configure-athena-workgroup)

## Configure a query results location using the Athena console
<a name="configure-athena-query"></a>

To configure a query results location, follow these steps.

1. Open the Athena console: [Athena console](https://console.aws.amazon.com//athena)

1. In the primary navigation bar, choose **Query editor**.

1. In the query editor, choose the **Settings** tab, then choose **Manage**.

1. Enter an S3 prefix of a location to save the query result.

## Configure a workgroup with Athena engine v3
<a name="configure-athena-workgroup"></a>

To configure a workgroup, follow these steps.

1. Open the Athena console: [Athena console](https://console.aws.amazon.com//athena)

1. In the primary navigation bar, choose **Workgroups**, then **Create workgroup**.

1. Enter a name for the workgroup.

1. Select **Athena SQL** as the type of engine.

1. Under **Upgrade query engine**, select **Manual**.

1. Under **Query version engine**, select **Athena version 3**.

1. Choose **Create workgroup**.

# Running queries on HealthOmics variant stores
<a name="analytics-run-queries"></a>

**Important**  
AWS HealthOmics variant stores and annotation stores are no longer open to new customers. Existing customers can continue to use the service as normal. For more information, see [AWS HealthOmics variant store and annotation store availability change](variant-store-availability-change.md).

You can perform queries on your variant store using Amazon Athena. Note that genomic coordinates in variant and annotation stores are represented as zero-based, half-closed half-open intervals.

## Run a simple query using the Athena console
<a name="run-queries-athena-simple"></a>

The following example shows how to run a simple query.

1. Open the Athena Query editor: [Athena Query editor](https://console.aws.amazon.com//athena)

1. Under **Workgroup**, select the workgroup that you created during setup.

1. Verify that **Data source** is **AwsDataCatalog**.

1. For **Database**, select the database resource link that you created during the Lake Formation setup.

1. Copy the following query into the **Query Editor ** under the **Query 1** tab:

   ```
   SELECT * from omicsvariants limit 10
   ```

1. Choose **Run** to run the query. The console populates the results table with the first 10 rows of the **omicsvariants** table.

## Run a complex query using the Athena console
<a name="run-queries-athena-complex"></a>

The following example shows how to run a complex query. To run this query, import `ClinVar` into the annotation store.

**Run a complex query**

1. Open the Athena Query editor: [Athena Query editor](https://console.aws.amazon.com//athena)

1. Under**Workgroup**, select the workgroup that you created during setup.

1. Verify that **Data source** is **AwsDataCatalog**.

1. For **Database**, select the database resource link that you created during the Lake Formation setup.

1. Choose the **\$1** at the top right to create a new query tab named **Query 2**.

1. Copy the following query into the **Query Editor ** under the **Query 2** tab:

   ```
   SELECT variants.sampleid,
     variants.contigname,
     variants.start,
     variants."end",
     variants.referenceallele,
     variants.alternatealleles,
     variants.attributes AS variant_attributes,
     clinvar.attributes AS clinvar_attributes  
   FROM omicsvariants as variants 
   INNER JOIN omicsannotations as clinvar ON 
     variants.contigname=CONCAT('chr',clinvar.contigname) 
     AND variants.start=clinvar.start 
     AND variants."end"=clinvar."end" 
     AND variants.referenceallele=clinvar.referenceallele 
     AND variants.alternatealleles=clinvar.alternatealleles 
   WHERE clinvar.attributes['CLNSIG']='Likely_pathogenic'
   ```

1. Choose **Run** to start running the query. 