Accessing HealthOmics read sets with Amazon S3 URIs
You can use Amazon S3 URI paths to access your active sequence store read sets. Use Amazon S3 API operations to list, share, and download your read sets. This makes it possible for you to collaborate and share your data because you can share access across users in the data owner's account in the Region, or you can share through creating role users outside of the data owner's account that you can assume. Archived read sets aren't accessible by using Amazon S3 URIs until they have been activated. When a read set is activated, it's restored to the same URI path each time.
With data loaded into HealthOmics stores, because the Amazon S3 URI is based on Amazon S3 access points, you can directly integrate with industry standard tools that read Amazon S3 URIs, such as the following:
-
Visual analysis applications such as Integrative Genomics Viewer (IGV) or UCSC Genome Browser.
-
Common workflows with Amazon S3 extensions such as CWL, WDL, and Nextflow.
-
Any tool that can authenticate and read from access point Amazon S3 URIs or read presigned Amazon S3 URIs.
-
Amazon S3 utilities such as Mountpoint or CloudFront.
Amazon S3 Mountpoint makes it possible for you to use an Amazon S3 bucket as a local file system. To learn more
about Mountpoint and to install it for use, see Mountpoint
for Amazon S3
Amazon CloudFront is a content delivery network (CDN) service built for high performance, security, and developer convenience. To learn more about using Amazon CloudFront, seethe Amazon CloudFront documentation. To set up CloudFront with a sequence store, contact the AWS HealthOmics team.
The data owner root account is enabled for the actions S3:GetObject, S3:GetObjectTagging, and S3:List Bucket on the sequence store prefix. For a user in the account to access the data, you create an IAM policy and attach it to the user or role. For an example policy, see Permissions for data access using Amazon S3 URIs.
You can use the following Amazon S3 API operations on the active read sets to list and retrieve your data. You can access archived read sets through Amazon S3 URIs after they have been activated.
-
GetObject— Retrieves an object from Amazon S3.
-
HeadObject— The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you only want an object's metadata.
-
ListObjects and ListObject v2— Returns some or all (up to 1,000) of the objects in a bucket.
-
CopyObject— Creates a copy of an object that's already stored in Amazon S3. HealthOmics supports copying to an Amazon S3 access point, but not writing to an access point.
HealthOmics sequence stores maintain the semantic identity of files through ETags. Throughout a lifecycle of a file, the Amazon S3 ETag, which is based on bitwise identity, may change, however, the HealthOmics ETag remains the same. To learn more, see HealthOmics ETags and data provenance.
Topics
Amazon S3 URI structure in HealthOmics storage
All files with Amazon S3 URIs have omics:subjectId
and omics:sampleId
resource tags. You can use these tags to share access by using IAM policies through a pattern such as
"s3:ExistingObjectTag/omics:subjectId": "pattern desired"
.
The file structure is as follows:
.../
account_id
/sequenceStore/seq_store_id
/readSet/read_set_id
/files
.
For files imported into sequence stores from Amazon S3, the sequence store attempts to maintain the
original source name. When the names conflict, the system appends read set information to ensure that the file
names are unique. For instance, for fastq read sets, if both file names are the same, to make the names
unique, sourceX
is inserted before .fastq.gz or .fq.gz. For a direct upload, the file names
follow the following patterns:
-
For FASTQ—
read_set_name
_sourcex
.fastq.gz -
For uBAM/BAM/CRAM—
read_set_name
.file extension
with extensions of.bam
or.cram
. An example isNA193948.bam
.
For read sets that are BAM or CRAM, index files are automatically generated during the ingestion
process. For the index files generated, the proper index extension at the end of the file name is applied. It
has the pattern <name of the Source the index is on>.<file index extension>.
The index extensions are .bai
or .crai
.
Using Hosted or Local IGV to access read sets
IGV is a genome browser used to analyze BAM and CRAM files. It requires both the file and the index because it only displays a portion of the genome at a time. IGV can be downloaded and used locally, and there are guides to creating an AWS hosted IGV. The public web version isn't supported because it requires CORS.
Local IGV relies on the local AWS configuration to access files. Ensure that the role used in that configuration has a policy attached that enables kms:Decrypt and s3:GetObject permissions to the s3 URI of the read sets being accessed. After that, in IGV, you can use “File > load from URL” and paste in the URI for the source and index. Alternatively, presigned URLs can be generated and used in the same manner, which will bypass the AWS configuration. Note that CORS isn't supported with Amazon S3 URI access, so requests relying on CORS aren't supported.
The example AWS Hosted IGV relies on AWS Cognito to create the correct configurations and permissions inside the environment. Ensure that a policy is created that enableskms:Decrypt and s3:GetObject permissions to the Amazon S3 URI of the read sets being accessed, and add this policy to the role that's assigned to the Cognito user pool. After that, in IGV, you can use “File > load from URL” and enter in the URI for the source and index. Alternatively, presigned URLs can be generated and used in the same manner, which bypasses the AWS configuration.
Note that the sequence store will not appear under the “Amazon” tab because that only displays buckets owned by you in the Region in which the AWS profile is configured.
Using Samtools or HTSlib in HealthOmics
HTSlib is the core library that's shared by several tools such as Samtools, rSamtools, PySam, and others. Use HTSlib version 1.20 or later to get seamless support for Amazon S3 Access Points. For older versions of the HTSlib library, you can use the following workarounds:
-
Set the environment variable for the HTS Amazon S3 host with:
export HTS_S3_HOST="s3.
.region
.amazonaws.com" -
Generate a presigned URL for the files that you want to use. If a BAM or CRAM is being used, ensure that a presigned URL is generated for both the file and the index. After that, both files can be used with the libraries.
-
Use Mountpoint to mount the sequence store or read set prefix in the same environment where you’re using HTSlib libraries. From here, the files can be accessed by using local file paths.
Using Mountpoint HealthOmics
Mountpoint for Amazon S3 is a simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system
Mountpoint can be installed by using the Mountpoint installation
instructions
mount-s3
access point arn
--prefix
local path to mount
--region
prefix to sequence store or read set
region
Using CloudFront with HealthOmics
Amazon CloudFront is a content delivery network (CDN) service that's built for high performance, security, and developer convenience. Customers that want to use CloudFront must work with the Service team to turn on the CloudFront distribution. Work with your account team to engage the HealthOmics service team.