Direct upload to a HealthOmics sequence store
We recommend that you use the HealthOmics Transfer Manager to add files to your sequence store. For more information
about using Transfer Manager, see this GitHub
Repository
Direct upload read sets exist first in PROCESSING_UPLOAD
state. This
means that the file parts are currently being uploaded, and you can access the read
set metadata. After the parts are uploaded and the checksums are validated, the read
set becomes ACTIVE
and behaves the same as an imported read set.
If the direct upload fails, the read set status is shown as UPLOAD_FAILED
. You can configure an
Amazon S3 bucket as a fallback location for files that fail to upload. Fallback locations are available for sequence
stores created after May 15, 2023.
Direct upload to a sequence store using the AWS CLI
To begin, start a multipart upload. You can do this by using the AWS CLI, as shown in the following example.
To direct upload using AWS CLI commands
Create the parts by separating your data, as shown in the following example.
split -b 100MiB SRR233106_1.filt.fastq.gz source1_part_
-
After your source files are in parts, create a multipart read set upload, as shown in the following example. Replace
and the other parameters with your sequence store ID and other values.sequence store ID
aws omics create-multipart-read-set-upload \ --sequence-store-id
\ --namesequence store ID
\ --source-file-typeupload name
\ --subject-idFASTQ
\ --sample-idsubject ID
\ --description "FASTQ for HG00146"sample ID
\ --generated-from "1000 Genomes""description of upload"
"source of imported files"
You get the
uploadID
and other metadata in the response. Use theuploadID
for the next step of the upload process.{ "sequenceStoreId": "1504776472", "uploadId": "7640892890", "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "name": "HG00146", "description": "FASTQ for HG00146", "creationTime": "2023-11-20T23:40:47.437522+00:00" }
-
Add your read sets to the upload. If your file is small enough, you only have to perform this step once. For larger files, you perform this step for each part of your file. If you upload a new part by using a previously used part number, it overwrites the previously uploaded part.
In the following example, replace
,sequence store ID
, and the other parameters with your values.upload ID
aws omics upload-read-set-part \ --sequence-store-id
\ --upload-idsequence store ID
\ --part-sourceupload ID
\ --part-numberSOURCE1
\ --payload source1/source1_part_aa.fastq.gzpart number
The response is an ID that you can use to verify that the uploaded file matches the file you intended.
{ "checksum": "984979b9928ae8d8622286c4a9cd8e99d964a22d59ed0f5722e1733eb280e635" }
-
Continue uploading the parts of your file, if necessary. To verify that your read sets have been uploaded, use the list-read-set-upload-parts API operation, as shown in the following. In the following example, replace
,sequence store ID
, and theupload ID
with your own input.part source
aws omics list-read-set-upload-parts \ --sequence-store-id
\ --upload-idsequence store ID
\ --part-sourceupload ID
SOURCE1
The response returns the number of read sets, the size, and the timestamp for when it was most recently updated.
{ "parts": [ { "partNumber": 1, "partSize": 104857600, "partSource": "SOURCE1", "checksum": "MVMQk+vB9C3Ge8ADHkbKq752n3BCUzyl41qEkqlOD5M=", "creationTime": "2023-11-20T23:58:03.500823+00:00", "lastUpdatedTime": "2023-11-20T23:58:03.500831+00:00" }, { "partNumber": 2, "partSize": 104857600, "partSource": "SOURCE1", "checksum": "keZzVzJNChAqgOdZMvOmjBwrOPM0enPj1UAfs0nvRto=", "creationTime": "2023-11-21T00:02:03.813013+00:00", "lastUpdatedTime": "2023-11-21T00:02:03.813025+00:00" }, { "partNumber": 3, "partSize": 100339539, "partSource": "SOURCE1", "checksum": "TBkNfMsaeDpXzEf3ldlbi0ipFDPaohKHyZ+LF1J4CHk=", "creationTime": "2023-11-21T00:09:11.705198+00:00", "lastUpdatedTime": "2023-11-21T00:09:11.705208+00:00" } ] }
-
To view all active multipart read set uploads, use list-multipart-read-set-uploads, as shown in the following. Replace
with the ID for your own sequence store.sequence store ID
aws omics list-multipart-read-set-uploads --sequence-store-id
sequence store ID
This API only returns multipart read set uploads that are in progress. After the ingested read sets are
ACTIVE
, or if the upload has failed, the upload will not be returned in the response to the list-multipart-read-set-uploads API. To view active read sets, use the list-read-sets API. An example response for list-multipart-read-set-uploads is shown in the following.{ "uploads": [ { "sequenceStoreId": "1234567890", "uploadId": "8749584421", "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "name": "HG00146", "description": "FASTQ for HG00146", "creationTime": "2023-11-29T19:22:51.349298+00:00" }, { "sequenceStoreId": "1234567890", "uploadId": "5290538638", "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383", "name": "HG00146", "description": "BAM for HG00146", "creationTime": "2023-11-29T19:23:33.116516+00:00" }, { "sequenceStoreId": "1234567890", "uploadId": "4174220862", "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383", "name": "HG00147", "description": "BAM for HG00147", "creationTime": "2023-11-29T19:23:47.007866+00:00" } ] }
-
After you upload all parts of your file, use complete-multipart-read-set-upload to conclude the upload process, as shown in the following example. Replace
,sequence store ID
, and the parameter for parts with your own values.upload ID
aws omics complete-multipart-read-set-upload \ --sequence-store-id
\ --upload-idsequence store ID
\ --partsupload ID
'[{"checksum":"gaCBQMe+rpCFZxLpoP6gydBoXaKKDA/Vobh5zBDb4W4=","partNumber":1,"partSource":"SOURCE1"}]'
The response for complete-multipart-read-set-upload is the read set IDs for your imported read sets.
{ "readSetId": "0000000001" }
-
To stop the upload, use abort-multipart-read-set-upload with the upload ID to end the upload process. Replace
andsequence store ID
with your own parameter values.upload ID
aws omics abort-multipart-read-set-upload \ --sequence-store-id
\ --upload-idsequence store ID
upload ID
-
After the upload is complete, retrieve your data from the read set by using get-read-set, as shown in the following. If the upload is still processing, get-read-set returns limited metadata, and the generated index files are unavailable. Replace
and the other parameters with your own input.sequence store ID
aws omics get-read-set --sequence-store-id
\ --idsequence store ID
\ --fileread set ID
\ --part-number 1SOURCE1
myfile.fastq.gz
-
To check the metadata, including the status of your upload, use the get-read-set-metadata API operation.
aws omics get-read-set-metadata --sequence-store-id
--idsequence store ID
read set ID
The response includes metadata details such as the file type, the reference ARN, the number of files, and the length of the sequences. It also includes the status. Possible statuses are
PROCESSING_UPLOAD
,ACTIVE
, andUPLOAD_FAILED
.{ "id": "0000000001", "arn": "arn:aws:omics:us-west-2:555555555555:sequenceStore/0123456789/readSet/0000000001", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "PROCESSING_UPLOAD", "name": "HG00146", "description": "FASTQ for HG00146", "fileType": "FASTQ", "creationTime": "2022-07-13T23:25:20Z", "files": { "source1": { "totalParts": 5, "partSize": 123456789012, "contentLength": 6836725, }, "source2": { "totalParts": 5, "partSize": 123456789056, "contentLength": 6836726 } }, 'creationType": "UPLOAD" }
Configure a fallback location
When you create or update a sequence store, you can configure an Amazon S3 bucket as the fallback location for files that fail to upload. The file parts for those read sets are transferred to the fallback location. Fallback locations are available for sequence stores created after May 15, 2023.
Create an Amazon S3 bucket policy to grant HealthOmics write access to the Amazon S3 fallback location, as shown in the following example:
{ "Effect": "Allow", "Principal": { "Service": "omics.amazonaws.com" }, "Action": "s3:PutObject", "Resource": "arn:aws:s3:::
amzn-s3-demo-bucket
" }
If the Amazon S3 bucket for fallback or access logs uses a customer managed key, add the following permissions to the key policy:
{ "Sid": "Allow use of key", "Effect": "Allow", "Principal": { "Service": "omics.amazonaws.com" }, "Action": [ "kms:Decrypt", "kms:GenerateDataKey*" ], "Resource": "*" }