

# HealthOmics storage
<a name="sequence-stores"></a>

Use HealthOmics storage to store, retrieve, organize, and share genomics data efficiently and at low cost. HealthOmics storage understands the relationships between different data objects, so that you can define which read sets originated from the same source data. This provides you with data provenance. 

Data that's stored in `ACTIVE` state is retrievable immediately. Data that hasn't been accessed for 30 days or more is stored in `ARCHIVE` state. To access archived data, you can reactivate it through the API operations or console. 

HealthOmics sequence stores are designed to preserve the content integrity of files. However, bitwise equivalence of imported data files and exported files isn't preserved because of the compression during active and archive tiering.

During ingestion, HealthOmics generates an entity tag, or *HealthOmics ETag*, to make it possible to validate the content integrity of your data files. Sequencing portions are identified and captured as an ETag at the source level of a read set. The ETag calculation doesn't alter the actual file or genomic data. After a read set is created, the ETag shouldn't change throughout the lifecycle of the read set source. This means that reimporting the same file results in the same ETag value being calculated. 

**Topics**
+ [HealthOmics ETags and data provenance](etags-and-provenance.md)
+ [Creating a HealthOmics reference store](create-reference-store.md)
+ [Creating a HealthOmics sequence store](create-sequence-store.md)
+ [Deleting HealthOmics reference and sequence stores](deleting-reference-and-sequence-stores.md)
+ [Importing read sets into a HealthOmics sequence store](import-sequence-store.md)
+ [Direct upload to a HealthOmics sequence store](synchronous-uploads.md)
+ [Exporting HealthOmics read sets to an Amazon S3 bucket](read-set-exports.md)
+ [Accessing HealthOmics read sets with Amazon S3 URIs](s3-access.md)
+ [Activating read sets in HealthOmics](activating-read-sets.md)

# HealthOmics ETags and data provenance
<a name="etags-and-provenance"></a>

A HealthOmics ETag (entity tag) is a hash of the ingested content in a sequence store. This simplifies data retrieval and processing while maintaining the content integrity of the ingested data files. The ETag reflects changes to the semantic content of the object, not its metadata. The specified read set type and algorithm determine how the ETag is calculated. The ETag calculation doesn't alter the actual file or genomic data. When the file type schema of the read set permits it, the sequence store updates fields that are linked to data provenance. 

Files have a bitwise identity and a semantic identity. The bitwise identity means that the bits of a ﬁle are identical, and a semantic identity means that the contents of a ﬁle are identical. Semantic identity is resilient to metadata changes and compression changes as it captures the content integrity of the file. 

Read sets in HealthOmics sequence stores undergo compression/decompression cycles and data provenance tracking throughout an object's lifecycle. During this processing, the bitwise identity of an ingested ﬁle may change and is expected to change each time a file is activated; however, the semantic identity of the ﬁle is maintained. The semantic identity is captured as a HealthOmics entity tag, or ETag that's calculated during sequence store ingestion and available as read set metadata.

When the ﬁle type schema of the read set permits it, the sequence store updates ﬁelds are linked to data provenance. For uBAM, BAM, and CRAM ﬁles, a new `@CO` or `Comment` tag is added to the header. The comment contains the sequence store ID and ingestion timestamp. 

## Amazon S3 ETags
<a name="s3-etags"></a>

When accessing a file using the Amazon S3 URI, Amazon S3 API operations may also return Amazon S3 ETag and checksum values. The Amazon S3 ETag and checksum values differ from the HealthOmics ETags because they represent the file's bitwise identity. To learn more about descriptive metadata and Objects, see the Amazon S3 [Object API documentation](https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html). Amazon S3 ETag values can change with each activation cycle of a read set and you can use them to validate the reading of a file. However, don't cache Amazon S3 ETag values to use for file identity validation during the file's lifecycle because they don't remain consistent. In contrast, the HealthOmics ETag remains consistent throughout the read set's lifecycle. 

## How HealthOmics calculates ETags
<a name="how-etags-calculated"></a>

The ETag is generated from a hash of the ingested file contents. The ETag algorithm family is set to MD5up by default, but it can be configured differently during sequence store creation. When the ETag is calculated, the algorithm and the calculated hashes are added to the read set. The supported MD5 algorithms for file types are as follows.
+ *FASTQ\$1MD5up* – Calculates the MD5 hash of an uncompressed, complete FASTQ read set source.
+ *BAM\$1MD5up* – Calculates the MD5 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available.
+ *CRAM\$1MD5up* – Calculates the MD5 hash of the alignment section of the uncompressed CRAM read set source as represented in the SAM, based on the linked reference.

**Note**  
MD5 hashing is known to be vulnerable to collisions. Because of this, two different files might have the same ETag if they were manufactured to exploit the known collision.

The following algorithms are supported for the SHA256 family. The algorithms are calculated as follows:
+ *FASTQ\$1SHA256up* – Calculates the SHA-256 hash of an uncompressed, complete FASTQ read set source. 
+ *BAM\$1SHA256up* – Calculates the SHA-256 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available. 
+ *CRAM\$1SHA256up* – Calculates the SHA-256 hash of the alignment section of an uncompressed CRAM read set source as represented in the SAM, based on the linked reference. 

The following algorithms are supported for the SHA512 family. The algorithms are calculated as follows:
+ *FASTQ\$1SHA512up* – Calculates the SHA-512 hash of an uncompressed, complete FASTQ read set source. 
+ *BAM\$1SHA512up* – Calculates the SHA-512 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available. 

   
+ *CRAM\$1SHA512up * – Calculates the SHA-512 hash of the alignment section of an uncompressed CRAM read set source as represented in the SAM, based on the linked reference. 

# Creating a HealthOmics reference store
<a name="create-reference-store"></a>

A reference store in HealthOmics is a data store for the storage of reference genomes. You can have a single reference store in each AWS account and Region. You can create a reference store using the console or CLI.

**Topics**
+ [Creating a reference store using the console](#console-create-reference-store)
+ [Creating a reference store using the CLI](#api-create-reference-store)

## Creating a reference store using the console
<a name="console-create-reference-store"></a>

**To create a reference store**

1. Open the [HealthOmics console](https://console.aws.amazon.com/omics/).

1.  If required, open the left navigation pane (≡). Choose **Reference store**.

1. Choose **Reference genomes** from the Genomics data storage options.

1. You can either choose a previously imported reference genome or import a new one. If you haven't imported a reference genome,choose **Import reference genome** in the top right.

1. On the **Create reference genome import job** page, choose either the **Quick create** or **Manual create** option to create a reference store, and then provide the following information.
   + **Reference genome name** - A unique name for this store. 
   + **Description** (optional) - A description of this reference store.
   + **IAM Role** - Select a role with access to your reference genome. 
   + **Reference from Amazon S3** - Select your reference sequence file in an Amazon S3 bucket.
   + **Tags** (optional) - Provide up to 50 tags for this reference store.

## Creating a reference store using the CLI
<a name="api-create-reference-store"></a>

The following example shows you how to create a reference store by using the AWS CLI. You can have one reference store per AWS Region. 

Reference stores support storage of FASTA files with the extensions `.fasta`, `.fa`, `.fas`, `.fsa`, `.faa`, `.fna`, `.ffn`, `.frn`, `.mpfa`, `.seq`, `.txt`. The `bgzip` version of these extensions is also supported. 

In the following example, replace `reference store name` with the name you've chosen for your reference store.

```
aws omics create-reference-store --name "reference store name"  
```

You receive a JSON response with the reference store ID and name, the ARN, and the timestamp of when your reference store was created.

```
{
    "id": "3242349265",
    "arn": "arn:aws:omics:us-west-2:555555555555:referenceStore/3242349265",
    "name": "MyReferenceStore",
    "creationTime": "2022-07-01T20:58:42.878Z"
}
```

You can use the reference store ID in additional AWS CLI commands. You can retrieve the list of reference store IDs linked to your account by using the **list-reference-stores** command, as shown in the following example.

```
aws omics list-reference-stores 
```

In response, you receive the name of your newly created reference store.

```
{
    "referenceStores": [
        {
              "id": "3242349265",
              "arn": "arn:aws:omics:us-west-2:555555555555:referenceStore/3242349265",
              "name": "MyReferenceStore",
             "creationTime": "2022-07-01T20:58:42.878Z"
         }
     ]
}
```

After you create a reference store, you can create import jobs to load genomic reference files into it. To do so, you must use or create an IAM role to access the data. The following is an example policy. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation"
                
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket1",
                "arn:aws:s3:::amzn-s3-demo-bucket1/*"
            ]
        }
    ]
}
```

------

You must also have a trust policy similar to the following example.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                   "omics.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

You can now import a reference genome. This example uses Genome Reference Consortium Human Build 38 (hg38), which is open access and available from the [Registry of Open Data on AWS](https://registry.opendata.aws/). The bucket that hosts this data is based in US East (Ohio). To use buckets in other AWS Regions, you can copy the data to an Amazon S3 bucket hosted in your Region. Use the following AWS CLI command to copy the genome to your Amazon S3 bucket. 

```
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta s3://amzn-s3-demo-bucket 
```

You can then begin your import job. Replace `reference store ID`, `role ARN`, and `source file path` with your own input.

```
aws omics start-reference-import-job --reference-store-id reference store ID --role-arn role ARN --sources source file path
```

After the data is imported, you receive the following response in JSON.

```
{
        "id": "7252016478",
        "referenceStoreId": "3242349265",
        "roleArn": "arn:aws:iam::111122223333:role/OmicsReferenceImport",
        "status": "CREATED",
        "creationTime": "2022-07-01T21:15:13.727Z"
}
```

You can monitor the status of a job by using the following command. In the following example, replace `reference store ID` and `job ID` with your reference store ID and the job ID that you want to learn more about.

```
aws omics get-reference-import-job --reference-store-id reference store ID --id job ID  
```

In response, you receive a response with the details for that reference store and its status.

```
{
    "id": "7252016478",
    "referenceStoreId": "3242349265",
    "roleArn": "arn:aws:iam::555555555555:role/OmicsReferenceImport",
    "status": "RUNNING",
    "creationTime": "2022-07-01T21:15:13.727Z",
    "sources": [
        {
            "sourceFile": "s3://amzn-s3-demo-bucket/Homo_sapiens_assembly38.fasta",
            "status": "IN_PROGRESS",
            "name": "MyReference"
        }
    ]
}
```

You can also find the reference that was imported by listing your references and filtering them based on the reference name. Replace `reference store ID` with your reference store ID, and add an optional filter to narrow the list.

```
aws omics list-references --reference-store-id reference store ID --filter name=MyReference  
```

In response, you receive the following information.

```
{
    "references": [
        {
            "id": "1234567890",
            "arn": "arn:aws:omics:us-west-2:555555555555:referenceStore/1234567890/reference/1234567890",
            "referenceStoreId": "12345678",
            "md5": "7ff134953dcca8c8997453bbb80b6b5e",
            "status": "ACTIVE",
            "name": "MyReference",
            "creationTime": "2022-07-02T00:15:19.787Z",
            "updateTime": "2022-07-02T00:15:19.787Z"
        }
    ]
}
```

To learn more about the reference metadata, use the **get-reference-metadata** API operation. In the following example, replace `reference store ID` with your reference store ID and `reference ID` with the reference ID that you want to learn more about.

```
aws omics get-reference-metadata --reference-store-id reference store ID --id reference ID   
```

You receive the following information in response.

```
{
    "id": "1234567890",
    "arn": "arn:aws:omics:us-west-2:555555555555:referenceStore/referencestoreID/reference/referenceID",
    "referenceStoreId": "1234567890",
    "md5": "7ff134953dcca8c8997453bbb80b6b5e",
    "status": "ACTIVE",
    "name": "MyReference",
    "creationTime": "2022-07-02T00:15:19.787Z",
    "updateTime": "2022-07-02T00:15:19.787Z",
    "files": {
        "source": {
            "totalParts": 31,
            "partSize": 104857600,
            "contentLength": 3249912778
        },
        "index": {
            "totalParts": 1,
            "partSize": 104857600,
            "contentLength": 160928
        }
    }
}
```

You can also download parts of the reference file by using **get-reference**. In the following example, replace `reference store ID` with your reference store ID and `reference ID` with the reference ID that you want to download from.

```
aws omics get-reference --reference-store-id reference store ID --id reference ID --part-number 1 outfile.fa   
```

# Creating a HealthOmics sequence store
<a name="create-sequence-store"></a>



HealthOmics sequence stores support storage of genomic files in the unaligned formats of `FASTQ` (gzip-only) and `uBAM`. It also supports the aligned formats of `BAM` and `CRAM`. 

Imported files are stored as read sets. You can add tags to read sets and use IAM policies to control access to read sets. Aligned read sets require a reference genome to align genomic sequences, but it's optional for unaligned read sets.

To store read sets, you first create a sequence store. When you create a sequence store, you can specify an optional Amazon S3 bucket as a fallback location and the location where S3 access logs are stored. The fallback location is used for storing any files that fail to create a read set during a direct upload. Fallback locations are available for sequence stores created after May 15, 2023. You specify the fallback location when you create the sequence store. 

You can specify up to five read set tag keys. When you create or update a read set with a tag key that matches one of these keys, the read set tags are propagated to the corresponding Amazon S3 object. System tags created by HealthOmics are propagated by default. 

**Topics**
+ [Creating a sequence store using the console](#console-create-sequence-store)
+ [Creating a sequence store using the CLI](#api-create-sequence-store)
+ [Updating a sequence store](#update-sequence-store)
+ [Updating read set tags for a sequence store](#sequence-store-manage-tags)
+ [Importing genomic files](#import-genomic-files)

## Creating a sequence store using the console
<a name="console-create-sequence-store"></a>

**To create a sequence store**

1. Open the [HealthOmics console](https://console.aws.amazon.com/omics/).

1.  If required, open the left navigation pane (≡). Choose **Sequence stores**.

1. On the **Create sequence store** page, provide the following information
   + **Sequence store name** - A unique name for this store. 
   + **Description** (optional) - A description of this sequence store.

1. For **Fallback location in S3**, specify an Amazon S3 location. HealthOmics uses the fallback location for storing any files that fail to create a read set during a direct upload. You need to grant the HealthOmics service write access to the Amazon S3 fallback location. For an example policy, see [Configure a fallback location](synchronous-uploads.md#synchronous-uploads-fallback).

   Fallback locations aren't available for sequence stores created before May 16, 2023. 

1. (Optional) For **Read set tag keys for S3 propagation**, you can enter up to five read set keys to propagate from a read set to the underlying S3 Objects. By propagating tags from a read set to the S3 object, you can grant S3 access permissions based on tags and/or end users to see the propagated tags through the Amazon S3 getObjectTagging API operation. 

   1. Enter one key value in the text box. The console creates a new text box to add the next key.

   1. (Optional) Choose **Remove** to remove all the keys.

1. Under **Data Encryption**, select whether you want data encryption to be owned and managed by AWS or to use a customer managed CMK. 

1. (Optional) Under **S3 Data access**, select whether to create a new role and policy to access the sequence store through Amazon S3.

1. (Optional) For **S3 access logging**, select `Enabled` if you want Amazon S3 to collect access log records.

   For **Access logging location in S3**, specify an Amazon S3 location to store the logs. This field is visible only if you enabled S3 access logging.

1. **Tags** (optional) - Provide up to 50 tags for this sequence store. These tags are separate from read set tags that are set during read set import/tag update

After you create the store, it's ready for [Importing genomic files](#import-genomic-files).

## Creating a sequence store using the CLI
<a name="api-create-sequence-store"></a>

In the following example, replace `sequence store name` with the name you chose for your sequence store.

```
aws omics create-sequence-store --name sequence store name --fallback-location "s3://amzn-s3-demo-bucket"  
```

You receive the following response in JSON, which includes the ID number for your newly created sequence store.

```
{
    "id": "3936421177",
    "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/3936421177",
    "name": "sequence_store_example_name",
    "creationTime": "2022-07-13T20:09:26.038Z"
    "fallbackLocation" : "s3://amzn-s3-demo-bucket"
}
```

You can also view all sequence stores associated with your account by using the **list-sequence-stores** command, as shown in the following.

```
aws omics list-sequence-stores
```

You receive the following response.

```
{
    "sequenceStores": [
        {
            "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/3936421177",
            "id": "3936421177",
            "name": "MySequenceStore",
            "creationTime": "2022-07-13T20:09:26.038Z",
            "updatedTime": "2024-09-13T04:11:31.242Z",
            "fallbackLocation" : "s3://amzn-s3-demo-bucket",
            "status": "Active"
        }
    ]
}
```

You can use **get-sequence-store** to learn more about a sequence store by using its ID, as shown in the following example:

```
aws omics get-sequence-store --id sequence store ID                             
```

You receive the following response:

```
{
  "arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/sequencestoreID",
  "creationTime": "2024-01-12T04:45:29.857Z",
  "updatedTime": "2024-09-13T04:11:31.242Z",
  "description": null,
  "fallbackLocation": null,
  "id": "2015356892",
  "name": "MySequenceStore",
  "s3Access": {
      "s3AccessPointArn": "arn:aws:s3:us-west-2:123456789012:accesspoint/592761533288-2015356892",
      "s3Uri": "s3://592761533288-2015356892-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/",
      "accessLogLocation": "s3://IAD-seq-store-log/2015356892/"
  },
  "sseConfig": {
      "keyArn": "arn:aws:kms:us-west-2:123456789012:key/eb2b30f5-635d-4b6d-b0f9-d3889fe0e648",
      "type": "KMS"
  },
  "status": "Active",
  "statusMessage": null,
  "setTagsToSync": ["withdrawn","protocol"],
}
```

After creation, several store parameters can also be updated. This can be done through the Console or the API `updateSequenceStore` operation.

## Updating a sequence store
<a name="update-sequence-store"></a>

To update a sequence store, follow these steps:

1. Open the [HealthOmics console](https://console.aws.amazon.com/omics/).

1.  If required, open the left navigation pane (≡). Choose **Sequence stores**.

1. Choose the sequence store to update.

1. In the **Details** panel, choose **Edit**.

1. On the **Edit details** page, you can update the following fields:
   + **Sequence store name** - A unique name for this store. 
   + **Description** - A description of this sequence store.
   + **Fallback location in S3**, specify an Amazon S3 location. HealthOmics uses the fallback location for storing any files that fail to create a read set during a direct upload. 
   + **Read set tag keys for S3 propagation** you can enter up to five read set keys to propagate to Amazon S3.
   + (Optional) For **S3 access logging**, select `Enabled` if you want Amazon S3 to collect access log records.

     For **Access logging location in S3**, specify an Amazon S3 location to store the logs. This field is visible only if you enabled S3 access logging.
   + **Tags** (optional) - Provide up to 50 tags for this sequence store.

## Updating read set tags for a sequence store
<a name="sequence-store-manage-tags"></a>

To update read set tags or other fields for a sequence store, follow these steps:

1. Open the [HealthOmics console](https://console.aws.amazon.com/omics/).

1.  If required, open the left navigation pane (≡). Choose **Sequence stores**.

1. Choose the sequence store that you want to update.

1. Choose the **Details** tab.

1. Choose **Edit**.

1. Add new read set tags or delete existing tags, as required.

1. Update the name, description, fallback location, or S3 data access, as required.

1. Choose **Save changes**.

## Importing genomic files
<a name="import-genomic-files"></a>

To import genomic files to a sequence store, follow these steps:

**To import a genomics file**

1. Open the [HealthOmics console](https://console.aws.amazon.com/omics/).

1.  If required, open the left navigation pane (≡). Choose choose **Sequence stores**.

1. On the **Sequence stores** page, choose the sequence store that you want to import your files into.

1. On the individual sequence store page, choose **Import genomic files**.

1. On the **Specify import details** page, provide the following information
   + **IAM role** - The IAM role that can access the genomic files on Amazon S3.
   + **Reference genome** - The reference genome for this genomics data.

1. On the **Specify import manifest** page, specify the following information **Manifest file**. The manifest file is a JSON or YAML file that describes essential information of your genomics data. For information about the manifest file, see [Importing read sets into a HealthOmics sequence store](import-sequence-store.md).

1. Click **Create import job**.

# Deleting HealthOmics reference and sequence stores
<a name="deleting-reference-and-sequence-stores"></a>

Both reference and sequence stores can be deleted. Sequence stores can only be deleted if they don't contain read sets, and reference stores can only be deleted if they don't contain references. Deleting a sequence or reference store also deletes any tags associated with that store.

The following example shows how to delete a reference store by using the AWS CLI. If the action is successful, you won't receive a response. In the following example, replace `reference store ID` with your reference store ID.

```
aws omics delete-reference-store --id reference store ID              
```

The following example shows you how to delete a sequence store. You don't receive a response if the action succeeds. In the following example, replace `sequence store ID` with your sequence store ID.

```
aws omics delete-sequence-store --id sequence store ID            
```

You can also delete a reference in a reference store as shown in the following example. References can only be deleted if they aren't being used in a read set, variant store, or annotation store. In the following example, replace `reference store ID` with your reference store ID, and replace `reference ID` with the ID for the reference you want to delete.

```
aws omics delete-reference  --id reference ID --reference-store-id reference store ID          
```

# Importing read sets into a HealthOmics sequence store
<a name="import-sequence-store"></a>

After you create your sequence store, create import jobs to upload read sets into the data store. You can upload your files from an Amazon S3 bucket, or you can upload directly by using the synchronous API operations. Your Amazon S3 bucket must be in the same Region as your sequence store.

You can upload any combination of aligned and unaligned read sets into your sequence store, however, if any of the read sets in your import are aligned, you must include a reference genome.

You can reuse the IAM access policy that you used to create the Reference store. 

The following topics describe the major steps you follow to import a read set into you sequence store and then get information about the imported data. 

**Topics**
+ [Upload files to Amazon S3](#upload-files-to-s3)
+ [Creating a manifest file](#create-manifest-file)
+ [Starting the import job](#start-import-job)
+ [Monitor the import job](#monitor-import-job)
+ [Find the imported sequence files](#list-read-sets)
+ [Get details about a read set](#get-read-set-metadata)
+ [Download the read set data files](#get-read-set-data)

## Upload files to Amazon S3
<a name="upload-files-to-s3"></a>

The following example shows how to move files into your Amazon S3 bucket. 

```
aws s3 cp s3://1000genomes/phase1/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam s3://your-bucket
aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://your-bucket
aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://your-bucket
aws s3 cp s3://1000genomes/data/HG00096/alignment/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram s3://your-bucket 
aws s3 cp s3://gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam s3://your-bucket
```

The sample `BAM` and `CRAM` used in this example require different genome references, `Hg19` and `Hg38`. To learn more or to access these references, see [The Broad Genome References](https://registry.opendata.aws/broad-references/) in the Registry of Open Data on AWS.

## Creating a manifest file
<a name="create-manifest-file"></a>

You must also create a manifest file in JSON to model the import job in `import.json` (see the following example). If you create a sequence store in the console, you don't have to specify the `sequenceStoreId` or `roleARN`, so your manifest file starts with the `sources` input.

------
#### [ API manifest ]

The following example imports three read sets by using the API: one `FASTQ`, one `BAM`, and one `CRAM`.

```
{
  "sequenceStoreId": "3936421177",
  "roleArn": "arn:aws:iam::555555555555:role/OmicsImport",
  "sources":
  [
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam"
          },
          "sourceFileType": "BAM",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001",
          "name": "HG00100",
          "description": "BAM for HG00100",
          "generatedFrom": "1000 Genomes"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz",
              "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz"
          },
          "sourceFileType": "FASTQ",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          // NOTE: there is no reference arn required here
          "name": "HG00146",
          "description": "FASTQ for HG00146",
          "generatedFrom": "1000 Genomes"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram"
          },
          "sourceFileType": "CRAM",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001",
          "name": "HG00096",
          "description": "CRAM for HG00096",
          "generatedFrom": "1000 Genomes"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam"
          },
          "sourceFileType": "UBAM",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          // NOTE: there is no reference arn required here
          "name": "NA12878_A",
          "description": "uBAM for NA12878",
          "generatedFrom": "GATK Test Data"
      }
  ]
}
```

------
#### [ Console manifest ]

This example code is used to import a single read set by using the console.

```
[    
  {
      "sourceFiles":
      {
          "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam"
      },
      "sourceFileType": "BAM",
      "subjectId": "mySubject",
      "sampleId": "mySample",
      "name": "HG00100",
      "description": "BAM for HG00100",
      "generatedFrom": "1000 Genomes"
  },
  {
      "sourceFiles":
      {
          "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz",
          "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz"
      },
      "sourceFileType": "FASTQ",
      "subjectId": "mySubject",
      "sampleId": "mySample",
      "name": "HG00146",
      "description": "FASTQ for HG00146",
      "generatedFrom": "1000 Genomes"
  },
  {
      "sourceFiles":
      {
          "source1": "s3://your-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram"
      },
      "sourceFileType": "CRAM",
      "subjectId": "mySubject",
      "sampleId": "mySample",
      "name": "HG00096",
      "description": "CRAM for HG00096",
      "generatedFrom": "1000 Genomes"
  },
  {
      "sourceFiles":
      {
          "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam"
      },
      "sourceFileType": "UBAM",
      "subjectId": "mySubject",
      "sampleId": "mySample",
      "name": "NA12878_A",
      "description": "uBAM for NA12878",
      "generatedFrom": "GATK Test Data"
  }
]
```

------

Alternatively, you can upload the manifest file in YAML format.

## Starting the import job
<a name="start-import-job"></a>

To start the import job, use the following AWS CLI command.

```
aws omics start-read-set-import-job --cli-input-json file://import.json      
```

You receive the following response, which indicates successful job creation.

```
{
  "id": "3660451514",
  "sequenceStoreId": "3936421177",
  "roleArn": "arn:aws:iam::111122223333:role/OmicsImport",
  "status": "CREATED",
  "creationTime": "2022-07-13T22:14:59.309Z"
}
```

## Monitor the import job
<a name="monitor-import-job"></a>

After the import job starts, you can monitor its progress with the following command. In the following example, replace `sequence store id` with your sequence store ID, and replace `job import ID` with the import ID.

```
aws omics get-read-set-import-job --sequence-store-id sequence store id --id job import ID 
```

The following shows the statuses for all import jobs associated with the specified sequence store ID.

```
{
  "id": "1234567890",
  "sequenceStoreId": "1234567890",
  "roleArn": "arn:aws:iam::111122223333:role/OmicsImport",
  "status": "RUNNING",
  "statusMessage": "The job is currently in progress.",
  "creationTime": "2022-07-13T22:14:59.309Z",
  "sources": [    
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam"
          },
          "sourceFileType": "BAM",
          "status": "IN_PROGRESS",
          "statusMessage": "The job is currently in progress."
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/8625408453",
          "name": "HG00100",
          "description": "BAM for HG00100",
          "generatedFrom": "1000 Genomes",
          "readSetID": "1234567890"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz",
              "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz"
          },
          "sourceFileType": "FASTQ",
          "status": "IN_PROGRESS",
          "statusMessage": "The job is currently in progress."
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "name": "HG00146",
          "description": "FASTQ for HG00146",
          "generatedFrom": "1000 Genomes",
          "readSetID": "1234567890"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram"
          },
          "sourceFileType": "CRAM",
          "status": "IN_PROGRESS",
          "statusMessage": "The job is currently in progress."
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/1234568870",
          "name": "HG00096",
          "description": "CRAM for HG00096",
          "generatedFrom": "1000 Genomes",
          "readSetID": "1234567890"
      },
      {
          "sourceFiles":
          {
              "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam"
          },
          "sourceFileType": "UBAM",
          "status": "IN_PROGRESS",
          "statusMessage": "The job is currently in progress."
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "name": "NA12878_A",
          "description": "uBAM for NA12878",
          "generatedFrom": "GATK Test Data",
          "readSetID": "1234567890"
      }
  ]
}
```

## Find the imported sequence files
<a name="list-read-sets"></a>

After the job completes, you can use the **list-read-sets** API operation to find the imported sequence files. In the following example, replace `sequence store id` with your sequence store ID.

```
aws omics list-read-sets --sequence-store-id sequence store id
```

You receive the following response.

```
{
  "readSets": [
      {
          "id": "0000000001",
          "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/01234567890/readSet/0000000001",
          "sequenceStoreId": "1234567890",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "status": "ACTIVE",
          "name": "HG00100",
          "description": "BAM for HG00100",
          "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/01234567890/reference/0000000001",
          "fileType": "BAM",
          "sequenceInformation": {
              "totalReadCount": 9194,
              "totalBaseCount": 928594,
              "generatedFrom": "1000 Genomes",
              "alignment": "ALIGNED"
          },
          "creationTime": "2022-07-13T23:25:20Z"
          "creationType": "IMPORT", 
          "etag": {
              "algorithm": "BAM_MD5up",
              "source1": "d1d65429212d61d115bb19f510d4bd02"
          }
      },
      {
          "id": "0000000002",
          "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000002",
          "sequenceStoreId": "0123456789",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "status": "ACTIVE",
          "name": "HG00146",
          "description": "FASTQ for HG00146",
          "fileType": "FASTQ",
          "sequenceInformation": {
              "totalReadCount": 8000000,
              "totalBaseCount": 1184000000,
              "generatedFrom": "1000 Genomes",
              "alignment": "UNALIGNED"
          },
          "creationTime": "2022-07-13T23:26:43Z"
          "creationType": "IMPORT",
          "etag": {
              "algorithm": "FASTQ_MD5up",
              "source1": "ca78f685c26e7cc2bf3e28e3ec4d49cd"
          }
      },
      {
          "id": "0000000003",
          "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000003",
          "sequenceStoreId": "0123456789",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "status": "ACTIVE",
          "name": "HG00096",
          "description": "CRAM for HG00096",
          "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/0123456789/reference/0000000001",
          "fileType": "CRAM",
          "sequenceInformation": {
              "totalReadCount": 85466534,
              "totalBaseCount": 24000004881,
              "generatedFrom": "1000 Genomes",
              "alignment": "ALIGNED"
          },
          "creationTime": "2022-07-13T23:30:41Z"
          "creationType": "IMPORT",
          "etag": {
              "algorithm": "CRAM_MD5up",
              "source1": "66817940f3025a760e6da4652f3e927e"
          }
      },
      {
          "id": "0000000004",
          "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000004",
          "sequenceStoreId": "0123456789",
          "subjectId": "mySubject",
          "sampleId": "mySample",
          "status": "ACTIVE",
          "name": "NA12878_A",
          "description": "uBAM for NA12878",
          "fileType": "UBAM",
          "sequenceInformation": {
              "totalReadCount": 20000,
              "totalBaseCount": 5000000,
              "generatedFrom": "GATK Test Data",
              "alignment": "ALIGNED"
          },
          "creationTime": "2022-07-13T23:30:41Z"
          "creationType": "IMPORT",
          "etag": {
              "algorithm": "BAM_MD5up",
              "source1": "640eb686263e9f63bcda12c35b84f5c7"
          }
      }
  ]
}
```

## Get details about a read set
<a name="get-read-set-metadata"></a>

To view more details about a read set, use the **GetReadSetMetadata** API operation. In the following example, replace `sequence store id` with your sequence store ID, and replace `read set id` with your read set ID.

```
aws omics get-read-set-metadata --sequence-store-id sequence store id --id read set id     
```

You receive the following response.

```
{
"arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/2015356892/readSet/9515444019",
"creationTime": "2024-01-12T04:50:33.548Z",
"creationType": "IMPORT",
"creationJobId": "33222111",
"description": null,
"etag": {
  "algorithm": "FASTQ_MD5up",
  "source1": "00d0885ba3eeb211c8c84520d3fa26ec",
  "source2": "00d0885ba3eeb211c8c84520d3fa26ec"
},
"fileType": "FASTQ",
"files": {
  "index": null,
  "source1": {
    "contentLength": 10818,
    "partSize": 104857600,
    "s3Access": {
      "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
},
    "totalParts": 1
  },
  "source2": {
    "contentLength": 10818,
    "partSize": 104857600,
    "s3Access": {        
      "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
    },
    "totalParts": 1
  }
},
"id": "9515444019",
"name": "paired-fastq-import",
"sampleId": "sampleId-paired-fastq-import",
"sequenceInformation": {
  "alignment": "UNALIGNED",
  "generatedFrom": null,
  "totalBaseCount": 30000,
  "totalReadCount": 200
},
"sequenceStoreId": "2015356892",
"status": "ACTIVE",
"statusMessage": null,
"subjectId": "subjectId-paired-fastq-import"
}
```

## Download the read set data files
<a name="get-read-set-data"></a>

You can access the objects for an active read set using the Amazon S3 **GetObject** API operation. The URI for the object is returned in the **GetReadSetMetadata** API response. For more information, see [Accessing HealthOmics read sets with Amazon S3 URIs](s3-access.md).

Alternatively, use the HealthOmics **GetReadSet** API operation. You can use **GetReadSet** to download in parallel by downloading individual parts. These parts are similar to Amazon S3 parts. The following is an example of how to download part 1 from a read set. In the following example, replace `sequence store id` with your sequence store ID, and replace `read set id` with your read set ID.

```
aws omics get-read-set --sequence-store-id sequence store id --id read set id  --part-number 1 outfile.bam  
```

You can also use the HealthOmics Transfer Manager to download files for a HealthOmics reference or read set. You can download the HealthOmics Transfer Manager [here](https://pypi.org/project/amazon-omics-tools/). For more information about using and setting up the Transfer Manager, see this [GitHub Repository](https://github.com/awslabs/amazon-omics-tools/).

# Direct upload to a HealthOmics sequence store
<a name="synchronous-uploads"></a>

We recommend that you use the HealthOmics Transfer Manager to add files to your sequence store. For more information about using Transfer Manager, see this [GitHub Repository](https://github.com/awslabs/amazon-omics-tools/). You can also upload your read sets directly to a sequence store through the direct upload API operations. 

Direct upload read sets exist first in `PROCESSING_UPLOAD` state. This means that the file parts are currently being uploaded, and you can access the read set metadata. After the parts are uploaded and the checksums are validated, the read set becomes `ACTIVE` and behaves the same as an imported read set. 

If the direct upload fails, the read set status is shown as `UPLOAD_FAILED`. You can configure an Amazon S3 bucket as a fallback location for files that fail to upload. Fallback locations are available for sequence stores created after May 15, 2023.

**Topics**
+ [Direct upload to a sequence store using the AWS CLI](#synchronous-uploads-api)
+ [Configure a fallback location](#synchronous-uploads-fallback)

## Direct upload to a sequence store using the AWS CLI
<a name="synchronous-uploads-api"></a>

To begin, start a multipart upload. You can do this by using the AWS CLI, as shown in the following example.

**To direct upload using AWS CLI commands**

1. Create the parts by separating your data, as shown in the following example.

   ```
    split -b 100MiB SRR233106_1.filt.fastq.gz source1_part_ 
   ```

1. After your source files are in parts, create a multipart read set upload, as shown in the following example. Replace `sequence store ID` and the other parameters with your sequence store ID and other values.

   ```
   aws omics create-multipart-read-set-upload \
   --sequence-store-id sequence store ID \
   --name upload name \
   --source-file-type FASTQ \
   --subject-id subject ID \
   --sample-id sample ID \
   --description "FASTQ for HG00146" "description of upload" \
   --generated-from "1000 Genomes""source of imported files"
   ```

   You get the `uploadID` and other metadata in the response. Use the `uploadID` for the next step of the upload process.

   ```
   {
   "sequenceStoreId": "1504776472",
   "uploadId": "7640892890",
   "sourceFileType": "FASTQ",
   "subjectId": "mySubject",
   "sampleId": "mySample",
   "generatedFrom": "1000 Genomes",
   "name": "HG00146",
   "description": "FASTQ for HG00146",
   "creationTime": "2023-11-20T23:40:47.437522+00:00"
   }
   ```

1. Add your read sets to the upload. If your file is small enough, you only have to perform this step once. For larger files, you perform this step for each part of your file. If you upload a new part by using a previously used part number, it overwrites the previously uploaded part.

   In the following example, replace `sequence store ID`, `upload ID`, and the other parameters with your values.

   ```
   aws omics upload-read-set-part \
   --sequence-store-id sequence store ID \
   --upload-id upload ID \
   --part-source SOURCE1 \
   --part-number part number \
   --payload  source1/source1_part_aa.fastq.gz
   ```

   The response is an ID that you can use to verify that the uploaded file matches the file you intended.

   ```
   {
   "checksum": "984979b9928ae8d8622286c4a9cd8e99d964a22d59ed0f5722e1733eb280e635"
   }
   ```

1. Continue uploading the parts of your file, if necessary. To verify that your read sets have been uploaded, use the **list-read-set-upload-parts** API operation, as shown in the following. In the following example, replace `sequence store ID `, `upload ID`, and the `part source` with your own input.

   ```
   aws omics list-read-set-upload-parts \
    --sequence-store-id sequence store ID \
    --upload-id upload ID \
    --part-source SOURCE1
   ```

   The response returns the number of read sets, the size, and the timestamp for when it was most recently updated.

   ```
   {
   "parts": [
       {
           "partNumber": 1,
           "partSize": 104857600,
           "partSource": "SOURCE1",
           "checksum": "MVMQk+vB9C3Ge8ADHkbKq752n3BCUzyl41qEkqlOD5M=",
           "creationTime": "2023-11-20T23:58:03.500823+00:00",
           "lastUpdatedTime": "2023-11-20T23:58:03.500831+00:00"
       },
       {
           "partNumber": 2,
           "partSize": 104857600,
           "partSource": "SOURCE1",
           "checksum": "keZzVzJNChAqgOdZMvOmjBwrOPM0enPj1UAfs0nvRto=",
           "creationTime": "2023-11-21T00:02:03.813013+00:00",
           "lastUpdatedTime": "2023-11-21T00:02:03.813025+00:00"
       },
       {
           "partNumber": 3,
           "partSize": 100339539,
           "partSource": "SOURCE1",
           "checksum": "TBkNfMsaeDpXzEf3ldlbi0ipFDPaohKHyZ+LF1J4CHk=",
           "creationTime": "2023-11-21T00:09:11.705198+00:00",
           "lastUpdatedTime": "2023-11-21T00:09:11.705208+00:00"
       }
   ]
   }
   ```

1. To view all active multipart read set uploads, use **list-multipart-read-set-uploads,** as shown in the following. Replace `sequence store ID` with the ID for your own sequence store.

   ```
   aws omics list-multipart-read-set-uploads --sequence-store-id 
                sequence store ID
   ```

   This API only returns multipart read set uploads that are in progress. After the ingested read sets are `ACTIVE`, or if the upload has failed, the upload will not be returned in the response to the **list-multipart-read-set-uploads** API. To view active read sets, use the **list-read-sets** API. An example response for **list-multipart-read-set-uploads** is shown in the following. 

   ```
   {
   "uploads": [
       {
           "sequenceStoreId": "1234567890",
           "uploadId": "8749584421",
           "sourceFileType": "FASTQ",
           "subjectId": "mySubject",
           "sampleId": "mySample",
           "generatedFrom": "1000 Genomes",
           "name": "HG00146",
           "description": "FASTQ for HG00146",
           "creationTime": "2023-11-29T19:22:51.349298+00:00"
       },
       {
           "sequenceStoreId": "1234567890",
           "uploadId": "5290538638",
           "sourceFileType": "BAM",
           "subjectId": "mySubject",
           "sampleId": "mySample",
           "generatedFrom": "1000 Genomes",
           "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383",
           "name": "HG00146",
           "description": "BAM for HG00146",
           "creationTime": "2023-11-29T19:23:33.116516+00:00"
       },
       {
           "sequenceStoreId": "1234567890",
           "uploadId": "4174220862",
           "sourceFileType": "BAM",
           "subjectId": "mySubject",
           "sampleId": "mySample",
           "generatedFrom": "1000 Genomes",
           "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383",
           "name": "HG00147",
           "description": "BAM for HG00147",
           "creationTime": "2023-11-29T19:23:47.007866+00:00"
       }
   ]
   }
   ```

1. After you upload all parts of your file, use **complete-multipart-read-set-upload** to conclude the upload process, as shown in the following example. Replace `sequence store ID`, `upload ID`, and the parameter for parts with your own values.

   ```
   aws omics complete-multipart-read-set-upload \
   --sequence-store-id sequence store ID \
   --upload-id upload ID \
   --parts '[{"checksum":"gaCBQMe+rpCFZxLpoP6gydBoXaKKDA/Vobh5zBDb4W4=","partNumber":1,"partSource":"SOURCE1"}]'
   ```

   The response for **complete-multipart-read-set-upload** is the read set IDs for your imported read sets. 

   ```
   {
   "readSetId": "0000000001"
   }
   ```

1. To stop the upload, use **abort-multipart-read-set-upload** with the upload ID to end the upload process. Replace `sequence store ID` and `upload ID` with your own parameter values.

   ```
   aws omics abort-multipart-read-set-upload \
   --sequence-store-id sequence store ID \
   --upload-id upload ID
   ```

1. After the upload is complete, retrieve your data from the read set by using **get-read-set**, as shown in the following. If the upload is still processing, **get-read-set** returns limited metadata, and the generated index files are unavailable. Replace `sequence store ID` and the other parameters with your own input.

   ```
   aws omics get-read-set 
    --sequence-store-id sequence store ID \
    --id read set ID \
    --file SOURCE1 \
    --part-number 1 myfile.fastq.gz
   ```

1. To check the metadata, including the status of your upload, use the **get-read-set-metadata** API operation.

   ```
   aws omics get-read-set-metadata --sequence-store-id sequence store ID --id read set ID    
   ```

   The response includes metadata details such as the file type, the reference ARN, the number of files, and the length of the sequences. It also includes the status. Possible statuses are `PROCESSING_UPLOAD`, `ACTIVE`, and `UPLOAD_FAILED`.

   ```
   {
   "id": "0000000001",
   "arn": "arn:aws:omics:us-west-2:555555555555:sequenceStore/0123456789/readSet/0000000001",
   "sequenceStoreId": "0123456789",
   "subjectId": "mySubject",
   "sampleId": "mySample",
   "status": "PROCESSING_UPLOAD",
   "name": "HG00146",
   "description": "FASTQ for HG00146",
   "fileType": "FASTQ",
   "creationTime": "2022-07-13T23:25:20Z",
   "files": {
       "source1": {
           "totalParts": 5,
           "partSize": 123456789012,
           "contentLength": 6836725,
   
       },
       "source2": {
           "totalParts": 5,
           "partSize": 123456789056,
           "contentLength": 6836726
       }
   },
   'creationType": "UPLOAD"
   }
   ```

## Configure a fallback location
<a name="synchronous-uploads-fallback"></a>

When you create or update a sequence store, you can configure an Amazon S3 bucket as the fallback location for files that fail to upload. The file parts for those read sets are transferred to the fallback location. Fallback locations are available for sequence stores created after May 15, 2023. 

Create an Amazon S3 bucket policy to grant HealthOmics write access to the Amazon S3 fallback location, as shown in the following example:

```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "omics.amazonaws.com"
    },
    "Action": "s3:PutObject",
    "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
}
```

If the Amazon S3 bucket for fallback or access logs uses a customer managed key, add the following permissions to the key policy:

```
 {
    "Sid": "Allow use of key",
    "Effect": "Allow",
    "Principal": {
        "Service": "omics.amazonaws.com"
    },
    "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey*"
    ],
    "Resource": "*"
}
```

# Exporting HealthOmics read sets to an Amazon S3 bucket
<a name="read-set-exports"></a>

You can export read sets as a batch export job to an Amazon S3 bucket. To do so, first create an IAM policy that has write access to the bucket, similar to the following IAM policy example. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket1",
        "arn:aws:s3:::amzn-s3-demo-bucket1/*"
      ]
    }
  ]
}
```

------

------
#### [ JSON ]

****  

```
{
"Version":"2012-10-17",		 	 	 
"Statement": [
  {
      "Effect": "Allow",
      "Principal": {
          "Service": [
              "omics.amazonaws.com"
          ]
      },
      "Action": "sts:AssumeRole"
  }
]
}
```

------

After the IAM policy is in place, begin your read set export job. The following example shows you how to do this by using the **start-read-set-export-job** API operation. In the following example, replace all parameters, such as `sequence store ID`, `destination` , `role ARN`, and `sources`, with your input.

```
aws omics start-read-set-export-job 
--sequence-store-id sequence store id \
--destination valid s3 uri \
--role-arn role ARN \
--sources readSetId=read set id_1 readSetId=read set id_2
```

You receive the following response with information on the origin sequence store and the destination Amazon S3 bucket. 

```
{
"id": <job-id>,
"sequenceStoreId": <sequence-store-id>,
"destination": <destination-s3-uri>,
"status": "SUBMITTED",
"creationTime": "2022-10-22T01:33:38.079000+00:00"
}
```

After the job starts, you can determine its status by using the **get-read-set-export-job** API operation, as shown in the following. Replace the `sequence store ID` and `job ID` with your sequence store ID and job ID, respectively. 

```
aws omics get-read-set-export-job --id job-id --sequence-store-id sequence store ID
```

You can view all export jobs initialized for a sequence store by using the ** list-read-set-export-jobs** API operation, as shown in the following. Replace the `sequence store ID` with your sequence store ID.

```
aws omics list-read-set-export-jobs --sequence-store-id sequence store ID.
```

```
{
"exportJobs": [
  {
      "id": <job-id>,
      "sequenceStoreId": <sequence-store-id>,
      "destination": <destination-s3-uri>,
      "status": "COMPLETED",
      "creationTime": "2022-10-22T01:33:38.079000+00:00",
      "completionTime": "2022-10-22T01:34:28.941000+00:00"
  }
]
}
```

In addition to exporting your read sets, you can also share them by using the Amazon S3 access URIs. To learn more, see [Accessing HealthOmics read sets with Amazon S3 URIs](s3-access.md). 

# Accessing HealthOmics read sets with Amazon S3 URIs
<a name="s3-access"></a>

You can use Amazon S3 URI paths to access your active sequence store read sets. 

With the Amazon S3 URI path, you can use Amazon S3 operations to list, share, and download your read sets. Access through the S3 APIs accelerates collaboration and tool integration given many industry tools are built already to read from S3. In addition, you can share access to the S3 APIs with other accounts and provide cross-region read access to data. 

HealthOmics doesn't support Amazon S3 URI access to archived read sets. When you activate a read set, it's restored to the same URI path each time. 

With data loaded into HealthOmics stores, because the Amazon S3 URI is based on Amazon S3 access points, you can directly integrate with industry standard tools that read Amazon S3 URIs, such as the following:
+ Visual analysis applications such as Integrative Genomics Viewer (IGV) or UCSC Genome Browser.
+ Common workflows with Amazon S3 extensions such as CWL, WDL, and Nextflow.
+ Any tool that can authenticate and read from access point Amazon S3 URIs or read presigned Amazon S3 URIs.
+ Amazon S3 utilities such as Mountpoint or CloudFront.

Amazon S3 Mountpoint makes it possible for you to use an Amazon S3 bucket as a local file system. To learn more about Mountpoint and to install it for use, see [Mountpoint for Amazon S3](https://github.com/awslabs/mountpoint-s3).

Amazon CloudFront is a content delivery network (CDN) service built for high performance, security, and developer convenience. To learn more about using Amazon CloudFront, see[the Amazon CloudFront documentation](https://docs.aws.amazon.com/cloudfront/). To set up CloudFront with a sequence store, contact the AWS HealthOmics team. 

The data owner root account is enabled for the actions S3:GetObject, S3:GetObjectTagging, and S3:List Bucket on the sequence store prefix. For a user in the account to access the data, you create an IAM policy and attach it to the user or role. For an example policy, see [Permissions for data access using Amazon S3 URIs](s3-sharing.md).

You can use the following Amazon S3 API operations on the active read sets to list and retrieve your data. You can access archived read sets through Amazon S3 URIs after they have been activated.
+ [GetObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html)— Retrieves an object from Amazon S3.
+ [HeadObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html.html)— The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you only want an object's metadata.
+ [ListObjects and ListObject v2](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html)— Returns some or all (up to 1,000) of the objects in a bucket.
+ [CopyObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObject.html)— Creates a copy of an object that's already stored in Amazon S3. HealthOmics supports copying to an Amazon S3 access point, but not writing to an access point.

HealthOmics sequence stores maintain the semantic identity of files through ETags. Throughout a lifecycle of a file, the Amazon S3 ETag, which is based on bitwise identity, may change, however, the HealthOmics ETag remains the same. To learn more, see [HealthOmics ETags and data provenance](etags-and-provenance.md).

**Topics**
+ [Amazon S3 URI structure in HealthOmics storage](#s3-uri-structure)
+ [Using Hosted or Local IGV to access read sets](#s3-access-igv)
+ [Using Samtools or HTSlib in HealthOmics](#s3-access-Samtools)
+ [Using Mountpoint HealthOmics](#s3-access-Mountpoint)
+ [Using CloudFront with HealthOmics](#s3-access-CloudFront)

## Amazon S3 URI structure in HealthOmics storage
<a name="s3-uri-structure"></a>

All files with Amazon S3 URIs have `omics:subjectId` and `omics:sampleId` resource tags. You can use these tags to share access by using IAM policies through a pattern such as `"s3:ExistingObjectTag/omics:subjectId": "pattern desired"`.

 The file structure is as follows: 

`.../account_id/sequenceStore/seq_store_id/readSet/read_set_id/files.`

For files imported into sequence stores from Amazon S3, the sequence store attempts to maintain the original source name. When the names conflict, the system appends read set information to ensure that the file names are unique. For instance, for fastq read sets, if both file names are the same, to make the names unique, `sourceX` is inserted before .fastq.gz or .fq.gz. For a direct upload, the file names follow the following patterns:
+ For FASTQ— *read\$1set\$1name*\$1*sourcex*.fastq.gz 
+ For uBAM/BAM/CRAM— *read\$1set\$1name*.*file extension* with extensions of `.bam` or `.cram`. An example is `NA193948.bam`.

For read sets that are BAM or CRAM, index files are automatically generated during the ingestion process. For the index files generated, the proper index extension at the end of the file name is applied. It has the pattern *<name of the Source the index is on>.<file index extension>.* The index extensions are `.bai` or `.crai`.

## Using Hosted or Local IGV to access read sets
<a name="s3-access-igv"></a>

IGV is a genome browser used to analyze BAM and CRAM files. It requires both the file and the index because it only displays a portion of the genome at a time. IGV can be downloaded and used locally, and there are guides to creating an AWS hosted IGV. The public web version isn't supported because it requires CORS. 

Local IGV relies on the local AWS configuration to access files. Ensure that the role used in that configuration has a policy attached that enables kms:Decrypt and s3:GetObject permissions to the s3 URI of the read sets being accessed. After that, in IGV, you can use “File > load from URL” and paste in the URI for the source and index. Alternatively, presigned URLs can be generated and used in the same manner, which will bypass the AWS configuration. Note that CORS isn't supported with Amazon S3 URI access, so requests relying on CORS aren't supported.

The example AWS Hosted IGV relies on AWS Cognito to create the correct configurations and permissions inside the environment. Ensure that a policy is created that enableskms:Decrypt and s3:GetObject permissions to the Amazon S3 URI of the read sets being accessed, and add this policy to the role that's assigned to the Cognito user pool. After that, in IGV, you can use “File > load from URL” and enter in the URI for the source and index. Alternatively, presigned URLs can be generated and used in the same manner, which bypasses the AWS configuration. 

Note that the sequence store will not appear under the “Amazon” tab because that only displays buckets owned by you in the Region in which the AWS profile is configured. 

## Using Samtools or HTSlib in HealthOmics
<a name="s3-access-Samtools"></a>

HTSlib is the core library that's shared by several tools such as Samtools, rSamtools, PySam, and others. Use HTSlib version 1.20 or later to get seamless support for Amazon S3 Access Points. For older versions of the HTSlib library, you can use the following workarounds:
+ Set the environment variable for the HTS Amazon S3 host with: `export HTS_S3_HOST="s3.region.amazonaws.com"`.
+ Generate a presigned URL for the files that you want to use. If a BAM or CRAM is being used, ensure that a presigned URL is generated for both the file and the index. After that, both files can be used with the libraries. 
+ Use Mountpoint to mount the sequence store or read set prefix in the same environment where you’re using HTSlib libraries. From here, the files can be accessed by using local file paths. 

## Using Mountpoint HealthOmics
<a name="s3-access-Mountpoint"></a>

Mountpoint for Amazon S3 is a simple, high-throughput file client for [mounting an Amazon S3 bucket as a local file system](https://aws.amazon.com/blogs/storage/the-inside-story-on-mountpoint-for-amazon-s3-a-high-performance-open-source-file-client/). With Mountpoint for Amazon S3, your applications can access objects stored in Amazon S3 through file operations such as open and read. Mountpoint for Amazon S3 automatically translates these operations into Amazon S3 object API calls, giving your applications access to the elastic storage and throughput of Amazon S3 through a file interface.

 Mountpoint can be installed by using [the Mountpoint installation instructions](https://github.com/awslabs/mountpoint-s3/blob/main/doc/INSTALL.md). Mountpoint uses the AWS Profile that's local to the installation and works at an Amazon S3 prefix level. Ensure that the profile being used has a policy that enables s3:GetObject, s3:ListBucket, and kms:Decrypt permissions to the Amazon S3 URI prefix of the read set(s) or sequence store being accessed. After that, the bucket can be mounted by using the following path: 

```
mount-s3 access point arn local path to mount --prefix prefix to sequence store or read set --region region                                  
```

## Using CloudFront with HealthOmics
<a name="s3-access-CloudFront"></a>

Amazon CloudFront is a content delivery network (CDN) service that's built for high performance, security, and developer convenience. Customers that want to use CloudFront must work with the Service team to turn on the CloudFront distribution. Work with your account team to engage the HealthOmics service team. 

# Activating read sets in HealthOmics
<a name="activating-read-sets"></a>

You can activate read sets that are archived with the **start-read-set-activation-job** API operation, or through the AWS CLI, as shown in the following example. Replace the `sequence store ID` and `read set id` with your sequence store ID and read set IDs. 

```
aws omics start-read-set-activation-job 
     --sequence-store-id sequence store ID \
     --sources readSetId=read set ID readSetId=read set id_1 read set id_2
```

You receive a response that contains the activation job information, as shown in the following.

```
{
    "id": "12345678",
    "sequenceStoreId": "1234567890",
    "status": "SUBMITTED",
    "creationTime": "2022-10-22T00:50:54.670000+00:00"
}
```

After the activation job starts, you can monitor its progress with the **get-read-set-activation-job** API operation. The following is an example of how to use the AWS CLI to check your activation job status. Replace `job ID` and `sequence store ID` with your sequence store ID and job IDs, respectively. 

```
aws omics get-read-set-activation-job --id job ID --sequence-store-id sequence store ID                    
```

The response summarizes the activation job, as shown in the following.

```
{
    "id": 123567890,
    "sequenceStoreId": 123467890,
    "status": "SUBMITTED",
    "statusUpdateReason": "The job is submitted and will start soon.",
    "creationTime": "2022-10-22T00:50:54.670000+00:00",
    "sources": [
        {
            "readSetId": <reads set id_1>,
            "status": "NOT_STARTED",
            "statusUpdateReason": "The source is queued for the job."
        },
        {
            "readSetId": <read set id_2>,
            "status": "NOT_STARTED",
            "statusUpdateReason": "The source is queued for the job."
        }
    ]
}
```

You can check the status of an activation job with the **get-read-set-metadata** API operation. Possible statuses are `ACTIVE`, `ACTIVATING`, and `ARCHIVED`. In the following example, replace `sequence store ID` with your sequence store ID, and replace `read set ID` with your read set ID.

```
aws omics get-read-set-metadata --sequence-store-id sequence store ID --id read set ID
```

The following response shows you that the read set is active.

```
{
    "id": "12345678",
    "arn": "arn:aws:omics:us-west-2:555555555555:sequenceStore/1234567890/readSet/12345678",
    "sequenceStoreId": "0123456789",
    "subjectId": "mySubject",
    "sampleId": "mySample",
    "status": "ACTIVE",
    "name": "HG00100",
    "description": "HG00100 aligned to HG38 BAM",
    "fileType": "BAM",
    "creationTime": "2022-07-13T23:25:20Z",
    "sequenceInformation": {
        "totalReadCount": 1513467,
        "totalBaseCount": 163454436,
        "generatedFrom": "Pulled from SRA",
        "alignment": "ALIGNED"
    },
    "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001",
    "files": {
        "source1": {
            "totalParts": 2,
            "partSize":  10485760,
            "contentLength": 17112283,
            "s3Access": {
        "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
},
         },
        "index": {
            "totalParts": 1,
            "partSize": 53216,
            "contentLength": 10485760
            "s3Access": {
        "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
},
        }
    },
    "creationType": "IMPORT",
    "etag": {
        "algorithm": "BAM_MD5up",
        "source1": "d1d65429212d61d115bb19f510d4bd02"
    }
}
```

You can view all read set activation jobs by using **list-read-set-activation-jobs**, as shown in the following example. In the following example, replace `sequence store ID` with your sequence store ID.

```
aws omics list-read-set-activation-jobs --sequence-store-id sequence store ID                
```

You receive the following response.

```
{
    "activationJobs": [
        {
            "id": 1234657890,
            "sequenceStoreId": "1234567890",
            "status": "COMPLETED",
            "creationTime": "2022-10-22T01:33:38.079000+00:00",
            "completionTime": "2022-10-22T01:34:28.941000+00:00"
        }
    ]
}
```