How call caching works - AWS HealthOmics

How call caching works

To use call caching, you create a run cache and configure it to have an associated Amazon S3 location for the cached data. When you start a run, you specify the run cache. A run cache isn't dedicated to one workflow. Runs from multiple workflows can use the same cache.

During the export phase of a run, the system exports the completed task outputs to the Amazon S3 location. To export intermediate task files, declare these files as task outputs in the workflow definition. Call caching also internally saves metadata and creates unique hashes for each cache to guarantee data provenance. When you turn on caching for a run, you can experience longer export times, because call caching increases the amount of export data.

For each task in a run, the workflow engine detects whether there is a matching cache entry for this task. If there is no matching cache entry, HealthOmics computes the task. If there is a matching cache entry, the engine retrieves the cached results. HealthOmics considers a task to match the cache entry if the following are all identical:

  • The task definition

  • The task inputs (determined based on S3 ETags)

  • The ECR container, determined by the ECR URI and the digest.

HealthOmics uses S3 eTags for unique file identification across all workflow engines, and includes task name in the task hash calculation.

HealthOmics supports call caching for these (or later) workflow language versions:

  • WDL versions 1.0, 1.1, and the development version

  • Nextflow version 23.10

  • All CWL versions

HealthOmics doesn't support call caching for Ready2Run workflows.

Note

Call caching saves task output data in the Amazon S3 location specified for the cache, which incurs charges to your AWS account.

Run cache behavior

You can set run cache behavior to save the task outputs for runs that fail (cache on failure) or for all runs (cache always). When you create a run cache, you set the default cache behavior for all runs that use this cache. You can override the default behavior when you start a run.

Cache on failure is useful if you're debugging a workflow that fails after several tasks completed successfully. The subsequent run resumes from the last successfully completed task if the task definition, inputs, and container in ECR are identical to the prior run.

Cache always is useful if you're updating a task in a workflow that completes successfully. We recommend that you follow these steps:

  1. Create a new run. Set the Cache behavior to Cache always, and start the run.

  2. After the run completes, update the task in the workflow and start a new run with behavior set Cache always. This run processes the updated task and any subsequent tasks that have a dependency on the updated task. All other tasks use the cached results.

  3. Repeat step 2 as required, until the updated task is ready for production.

  4. Use the updated task in production. Remember to use Cache on failure for these runs.

Note

We recommend Cache always mode for development purposes, but not for a production workflow. If you set this mode for a large production workflow, the system can export large amounts of data to S3, resulting in increased export times and storage costs.