Scenarios
HPC cases are typically complex computational problems that require parallel-processing techniques. To support the calculations, a well-architected HPC infrastructure is capable of sustained performance for the duration of the calculations. HPC workloads span traditional applications, like genomics, computational chemistry, financial risk modeling, computer aided engineering, weather prediction and seismic imaging, as well as emerging applications, like machine learning, deep learning, and autonomous driving. Still, the traditional grids or HPC clusters that support these calculations are remarkably similar in architecture with select cluster attributes optimized for the specific workload. In AWS, the network, storage type, compute (instance) type, and even deployment method can be strategically chosen to optimize performance, cost, and usability for a particular workload.
HPC is divided into two categories based on the degree of interaction between the concurrently running parallel processes: loosely coupled and tightly coupled workloads. Loosely coupled HPC cases are those where the multiple or parallel processes don’t strongly interact with each other in the course of the entire simulation. Tightly coupled HPC cases are those where the parallel processes are simultaneously running and regularly exchanging information between each other at each iteration or step of the simulation.
With loosely coupled workloads, the completion of an entire calculation or simulation often requires hundreds to millions of parallel processes. These processes occur in any order and at any speed through the course of the simulation. This offers flexibility on the computing infrastructure required for loosely coupled simulations.
Tightly coupled workloads have processes that regularly exchange information at each iteration of the simulation. Typically, these tightly coupled simulations run on a homogenous cluster. The total core or processor count can range from tens, to thousands, and occasionally to hundreds of thousands if the infrastructure allows. The interactions of the processes during the simulation place extra demands on the infrastructure, such as the compute nodes and network infrastructure.
The infrastructure used to run the huge variety of loosely and tightly coupled applications is differentiated by its ability for process interactions across nodes. There are fundamental aspects that apply to both scenarios and specific design considerations for each. Consider the following fundamentals for both scenarios when selecting an HPC infrastructure on AWS:
-
Network: Network requirements can range from cases with low requirements, such as loosely coupled applications with minimal communication traffic, to tightly coupled and massively parallel applications that require a performant network with large bandwidth and low latency.
-
Storage: HPC calculations use, create, and move data in unique ways. Storage infrastructure must support these requirements during each step of the calculation. Input data is frequently stored on startup, more data is created and stored while running, and output data is moved to a reservoir location upon run completion. Factors to be considered include data size, media type, transfer speeds, shared access, and storage properties (for example, durability and availability). It is helpful to use a shared file system between nodes. For example, using a Network File System (NFS) share, such as Amazon Elastic File System (EFS), or a Lustre file system, such as Amazon FSx for Lustre.
-
Compute: The Amazon EC2 instance type defines the hardware capabilities available for your HPC workload. Hardware capabilities include the processor type, core frequency, processor features (for example, vector extensions), memory-to-core ratio, and network performance. On AWS, an instance is considered to be the same as an HPC node. These terms are used interchangeably in this whitepaper.
-
AWS offers managed services with the ability to access compute without the need to choose the underlying EC2 instance type. AWS Lambda and AWS Fargate are compute services that allow you to run workloads without having to provision and manage the underlying servers.
-
-
Deployment: AWS provides many options for deploying HPC workloads. Instances can be manually launched from the AWS Management Console. For an automated deployment, a variety of Software Development Kits (SDKs) is available for coding end-to-end solutions in different programming languages. A popular HPC deployment option combines bash shell scripting with the AWS Command Line Interface (AWS CLI).
-
AWS CloudFormation templates allow the specification of application-tailored HPC clusters described as code so that they can be launched in minutes. AWS ParallelCluster is open-source software that coordinates the launch of a cluster through CloudFormation with already installed software (for example, compilers and schedulers) for a traditional cluster experience.
-
AWS provides managed deployment services for container-based workloads, such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and AWS Batch.
-
Additional software options are available from third-party companies in the AWS Marketplace and the AWS Partner Network (APN).
-
Cloud computing makes it easy to experiment with infrastructure components and architecture design. AWS strongly encourages testing instance types, EBS volume types, deployment methods, etc., to find the best performance at the lowest cost.