Writing workflow definitions for HealthOmics workflows

Mode de mise au point

Writing workflow definitions for HealthOmics workflows - AWS HealthOmics

Writing workflows in WDL Writing workflows in Nextflow Writing workflows in CWL Example workflow definition WDL workflow definition example

Cette page n'a pas été traduite dans votre langue. Demande de traduction

HealthOmics supports workflow definitions written in WDL, Nextflow, or CWL. To learn more about these workflow languages, see the specifications for WDL, Nextflow, or CWL.

HealthOmics supports version management for the three workflow definition languages. For more information, see Version support for HealthOmics workflow definition languages .

Writing workflows in WDL

The following tables show how inputs in WDL map to the matching primitive type or complex JSON type. Type coercion is limited and whenever possible, types should be explicit.

Primitive types
WDL type	JSON type	Example WDL	Example JSON key and value	Notes
`Boolean`	`boolean`	`Boolean b`	`"b": true`	The value must be lower case and unquoted.
`Int`	`integer`	`Int i`	`"i": 7`	Must be unquoted.
`Float`	`number`	`Float f`	`"f": 42.2`	Must be unquoted.
`String`	`string`	`String s`	`"s": "characters"`	JSON strings that are a URI must be mapped to a WDL file to be imported.
`File`	`string`	`File f`	`"f": "s3://amzn-s3-demo-bucket1/path/to/file"`	Amazon S3 and HealthOmics storage URIs are imported as long as the IAM role provided for the workflow has read access to these objects. No other URI schemes are supported (such as `file://`, `https://`, and `ftp://`). The URI must specify an object. It cannot be a directory meaning it can not end with a `/`.
`Directory`	`string`	`Directory d`	`"d": "s3://bucket/path/"`	The `Directory` type isn't included in WDL 1.0 or 1.1, so you will need to add `version development` to the header of the WDL file. The URI must be a Amazon S3 URI and with a prefix that ends with a '/'. All contents of the directory will be recursively copied to the workflow as a single download. The `Directory` should only contain files related to the workflow.

Complex types in WDL are data structures comprised of primitive types. Data structures such as lists will be converted to arrays.

Complex types
WDL type	JSON type	Example WDL	Example JSON key and value	Notes
`Array`	`array`	`Array[Int] nums`	`“nums": [1, 2, 3]`	The members of the array must follow the format of the WDL array type.
`Pair`	`object`	`Pair[String, Int] str_to_i`	`“str_to_i": {"left": "a", "right": 1}`	Each value of the pair must use the JSON format of its matching WDL type.
`Map`	`object`	`Map[Int, String] int_to_string`	`"int_to_string": { 2: "hello", 1: "goodbye" }`	Each entry in the map must use the JSON format of its matching WDL type.
`Struct`	`object`	`struct SampleBamAndIndex { String sample_name File bam File bam_index } SampleBamAndIndex b_and_i`	`"b_and_i":{ sample_name: "NA12878", bam: "s3://amzn-s3-demo-bucket1/NA12878.bam", bam_index: "s3://amzn-s3-demo-bucket1/NA12878.bam.bai" }`	The names of the struct members must exactly match the names of the JSON object keys. Each value must use the JSON format of the matching WDL type.
`Object`	N/A	N/A	N/A	The WDL `Object` type is outdated and should be replaced by `Struct` in all cases.

The HealthOmics workflow engine doesn't support qualified or name-spaced input parameters. Handling of qualified parameters and their mapping to WDL parameters isn't specified by the WDL language and can be ambiguous. For these reasons, best practice is to declare all input parameters in the top level (main) workflow definition file and pass them down to subworkflow calls using standard WDL mechanisms.

Writing workflows in Nextflow

HealthOmics suppports Nextflow DSL1 and DSL2. For details, see Nextflow version support.

Nextflow DSL2 is based on the Groovy programming language, so parameters are dynamic and type coercion is possible using the same rules as Groovy. Parameters and values supplied by the input JSON are available in the parameters (params) map of the workflow.

HealthOmics suppports the Nextflow nf-validation plugin. You cannot retrieve additional plugins during a workflow run.

When an Amazon S3 or HealthOmics URI is used to construct a Nextflow file or path object, it makes the matching object available to the workflow, as long as read access is granted. The use of prefixes or directories is allowed for Amazon S3 URIs. For examples, see Amazon S3 input parameter formats.

HealthOmics supports the use of glob patterns in Amazon S3 URIs or HealthOmics Storage URIs. Use Glob patterns in the workflow definition for the creation of path or file channels.

For workflows written in Nextflow, define a publishDir directive to export task content to your output Amazon S3 bucket. As shown in the following example, set the publishDir value to /mnt/workflow/pubdir. To export files to Amazon S3, the files must be in this directory.


 nextflow.enable.dsl=2
            
workflow {
  CramToBamTask(params.ref_fasta, params.ref_fasta_index, params.ref_dict, params.input_cram, params.sample_name)
  ValidateSamFile(CramToBamTask.out.outputBam)
}

process CramToBamTask {
  container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud"

  publishDir "/mnt/workflow/pubdir"

  input:
      path ref_fasta
      path ref_fasta_index
      path ref_dict
      path input_cram
      val sample_name

  output:
      path "${sample_name}.bam", emit: outputBam
      path "${sample_name}.bai", emit: outputBai

  script:
  """
      set -eo pipefail

      samtools view -h -T $ref_fasta $input_cram |
      samtools view -b -o ${sample_name}.bam -
      samtools index -b ${sample_name}.bam
      mv ${sample_name}.bam.bai ${sample_name}.bai
  """
}

process ValidateSamFile {
  container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud"

  publishDir "/mnt/workflow/pubdir"

  input:
      file input_bam

  output:
      path "validation_report"

  script:
  """
      java -Xmx3G -jar /usr/gitc/picard.jar \
      ValidateSamFile \
      INPUT=${input_bam} \
      OUTPUT=validation_report \
      MODE=SUMMARY \
      IS_BISULFITE_SEQUENCED=false
  """
}

Writing workflows in CWL

Workflows written in Common Workflow Language, or CWL, offer similar functionality to workflows written in WDL and Nextflow. You can use Amazon S3 or HealthOmics storage URIs as input parameters.

If you define input in a secondaryFile in a sub workflow, add the same definition in the main workflow.

HealthOmics workflows don't support operation processes. To learn more about operations processes in CWL workflows, see the CWL documentation.

To convert an existing CWL workflow definition file to use HealthOmics, make the following changes:

Replace all Docker container URIs with Amazon ECR URIs.
Make sure that all the workflow files are declared in the main workflow as input, and all variables are explicitly defined.
Make sure that all JavaScript code is strict-mode complaint.

CWL workflows should be defined for each container used. It isn't recommended to hardcode the dockerPull entry with a fixed Amazon ECR URI.

The following is an example of a workflow written in CWL.



cwlVersion: v1.2
class: Workflow

inputs:
in_file:
  type: File
  secondaryFiles: [.fai]
 
out_filename: string
docker_image: string


outputs:
copied_file:
  type: File
  outputSource: copy_step/copied_file

steps:
copy_step:
  in:
    in_file: in_file
    out_filename: out_filename
    docker_image: docker_image
  out: [copied_file]
  run: copy.cwl

The following file defines the copy.cwl task.



cwlVersion: v1.2
class: CommandLineTool
baseCommand: cp

inputs:
in_file:
  type: File
  secondaryFiles: [.fai]
  inputBinding:
    position: 1

out_filename:
  type: string
  inputBinding:
    position: 2
docker_image:
  type: string

outputs:
copied_file:
  type: File
  outputBinding:
      glob: $(inputs.out_filename)

requirements:
InlineJavascriptRequirement: {}
DockerRequirement:
  dockerPull: "$(inputs.docker_image)"

The following is an example of a workflow written in CWL with a GPU requirement.


cwlVersion: v1.2
class: CommandLineTool
baseCommand: ["/bin/bash", "docm_haplotypeCaller.sh"]
$namespaces:
cwltool: http://commonwl.org/cwltool#
requirements:
cwltool:CUDARequirement:
  cudaDeviceCountMin: 1
  cudaComputeCapability: "nvidia-tesla-t4" 
  cudaVersionMin: "1.0"
InlineJavascriptRequirement: {}
InitialWorkDirRequirement:
  listing:
  - entryname: 'docm_haplotypeCaller.sh'
    entry: |
            nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv   

inputs: []
outputs: []

Example workflow definition

The following example shows the same workflow definition in WDL, Nextflow, and CWL.

WDL


version 1.1

task my_task {
   runtime { ... }
   inputs {
       File input_file
       String name
       Int threshold
   }
   
   command <<<
   my_tool --name ~{name} --threshold ~{threshold} ~{input_file}
   >>>
   
   output {
       File results = "results.txt"
   }
}

workflow my_workflow {
   inputs {
       File input_file
       String name
       Int threshold = 50
   }
   
   call my_task {
       input:
          input_file = input_file,
          name = name,
          threshold = threshold
   }
   outputs {
       File results = my_task.results
   }
}

Nextflow


nextflow.enable.dsl = 2

params.input_file = null
params.name = null
params.threshold = 50

process my_task {
   // <directives>
   
   input:
     path input_file
     val name
     val threshold
   
   output:
     path 'results.txt', emit: results
   
   script:
     """
     my_tool --name ${name} --threshold ${threshold} ${input_file}
     """
     
   
}

workflow MY_WORKFLOW {
   my_task(
       params.input_file,
       params.name,
       params.threshold
   )
}

workflow {
   MY_WORKFLOW()
}

CWL


cwlVersion: v1.2
class: Workflow

requirements:
    InlineJavascriptRequirement: {}

inputs:
   input_file: File
   name: string
   threshold: int

outputs:
    result:
        type: ...
        outputSource: ...

steps:
    my_task:
        run:
            class: CommandLineTool
            baseCommand: my_tool
            requirements:
                ...
            inputs:
                name:
                    type: string
                    inputBinding:
                        prefix: "--name"
                threshold:
                    type: int
                    inputBinding:
                        prefix: "--threshold"
                input_file:
                    type: File
                    inputBinding: {}
            outputs:
                results:
                    type: File
                    outputBinding:
                        glob: results.txt

anchor anchor anchor


version 1.1

task my_task {
   runtime { ... }
   inputs {
       File input_file
       String name
       Int threshold
   }
   
   command <<<
   my_tool --name ~{name} --threshold ~{threshold} ~{input_file}
   >>>
   
   output {
       File results = "results.txt"
   }
}

workflow my_workflow {
   inputs {
       File input_file
       String name
       Int threshold = 50
   }
   
   call my_task {
       input:
          input_file = input_file,
          name = name,
          threshold = threshold
   }
   outputs {
       File results = my_task.results
   }
}

WDL workflow definition example

The following examples show private workflow definitions for converting from CRAM to BAM in WDL. The CRAM to BAM workflow defines two tasks and uses tools from the genomes-in-the-cloud container, which is shown in the example and is publicly available.

The following example shows how to include the Amazon ECR container as a parameter. This allows HealthOmics to verify the access permissions to your container before it starts the run the run.


{
     ...
     "gotc_docker":"<account_id>.dkr.ecr.<region>.amazonaws.com/genomes-in-the-cloud:2.4.7-1603303710"
  }

The following example shows how to specify which files to use in your run, when the files are in an Amazon S3 bucket.


{
      "input_cram": "s3://amzn-s3-demo-bucket1/inputs/NA12878.cram",
      "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict",
      "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta",
      "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai",
      "sample_name": "NA12878"
  }

If you want to specify files from a sequence store, indicate that as shown in the following example, using the URI for the sequence store.


{
      "input_cram": "omics://429915189008.storage.us-west-2.amazonaws.com/111122223333/readSet/4500843795/source1",
      "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict",
      "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta",
      "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai",
      "sample_name": "NA12878"
  }

You can then define your workflow in WDL as shown in the following.


 version 1.0
  workflow CramToBamFlow {
      input {
          File ref_fasta
          File ref_fasta_index
          File ref_dict
          File input_cram
          String sample_name
          String gotc_docker = "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-
  cloud:latest"
      }
      #Converts CRAM to SAM to BAM and makes BAI.
      call CramToBamTask{
           input:
              ref_fasta = ref_fasta,
              ref_fasta_index = ref_fasta_index,
              ref_dict = ref_dict,
              input_cram = input_cram,
              sample_name = sample_name,
              docker_image = gotc_docker,
       }
       #Validates Bam.
       call ValidateSamFile{
          input:
             input_bam = CramToBamTask.outputBam,
             docker_image = gotc_docker,
       }
       #Outputs Bam, Bai, and validation report to the FireCloud data model.
       output {
           File outputBam = CramToBamTask.outputBam
           File outputBai = CramToBamTask.outputBai
           File validation_report = ValidateSamFile.report
        }
  }
  #Task definitions.
  task CramToBamTask {
      input {
         # Command parameters
         File ref_fasta
         File ref_fasta_index
         File ref_dict
         File input_cram
         String sample_name
         # Runtime parameters
         String docker_image
      }
     #Calls samtools view to do the conversion.
     command {
         set -eo pipefail
  
         samtools view -h -T ~{ref_fasta} ~{input_cram} |
         samtools view -b -o ~{sample_name}.bam -
         samtools index -b ~{sample_name}.bam
         mv ~{sample_name}.bam.bai ~{sample_name}.bai
      }
      
      #Runtime attributes:
      runtime {
          docker: docker_image
      }
  
      #Outputs a BAM and BAI with the same sample name
       output {
           File outputBam = "~{sample_name}.bam"
           File outputBai = "~{sample_name}.bai"
      }
  }
  
  #Validates BAM output to ensure it wasn't corrupted during the file conversion.
  task ValidateSamFile {
     input {
        File input_bam
        Int machine_mem_size = 4
        String docker_image
     }
     String output_name = basename(input_bam, ".bam") + ".validation_report"
     Int command_mem_size = machine_mem_size - 1
     command {
         java -Xmx~{command_mem_size}G -jar /usr/gitc/picard.jar \
         ValidateSamFile \
         INPUT=~{input_bam} \
         OUTPUT=~{output_name} \
         MODE=SUMMARY \
         IS_BISULFITE_SEQUENCED=false
      }
      runtime {
      docker: docker_image
      }
     #A text file is generated that lists errors or warnings that apply.
      output {
          File report = "~{output_name}"
      }
  }