Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2 - Amazon Elastic Compute Cloud

Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2

An Elastic Fabric Adapter (EFA) is a network device that you can attach to your Amazon EC2 instance to accelerate Artificial Intelligence (AI), Machine Learning (ML), and High Performance Computing (HPC) applications. EFA enables you to achieve the application performance of an on-premises AI/ML or HPC cluster, with the scalability, flexibility, and elasticity provided by the AWS Cloud.

EFA provides lower and more consistent latency and higher throughput than the TCP transport traditionally used in cloud-based HPC systems. It enhances the performance of inter-instance communication that is critical for scaling AI/ML and HPC applications. It is optimized to work on the existing AWS network infrastructure and it can scale depending on application requirements.

EFA integrates with Libfabric 1.7.0 and later, and it supports Nvidia Collective Communications Library (NCCL) for AI and ML applications, and Open MPI 4 and later and Intel MPI 2019 Update 5 and later for HPC applications.

EFA basics

An EFA device can be attached to an EC2 instance in two ways:

  1. Using a traditional EFA interface, also called EFA with ENA, which creates both an EFA device and an ENA device.

  2. Using an EFA-only interface, which creates just the EFA device.

The EFA device provides capabilities like built-in OS-bypass and congestion control through the Scalable Reliable Datagram (SRD) protocol. The EFA device features enable low-latency, reliable transport functionality that allows EFA interface to provide better application performance for HPC and ML applications on Amazon EC2. While the ENA device offers traditional IP networking.

Contrasting a traditional HPC software stack with one that uses an EFA.

Traditionally, AI/ML applications use NCCL and HPC applications use the Message Passing Interface (MPI) to interface with the system's network transport. In the AWS cloud, this has meant that applications interface with NCCL or MPI, which then uses the operating system's TCP/IP stack and the ENA device driver to enable network communication between instances.

With a traditional EFA (EFA with ENA) or EFA-only interface, AI/ML applications use NCCL and HPC applications use MPI, to interface directly with the Libfabric API. The Libfabric API bypasses the operating system kernel and communicates directly with the EFA device to put packets on the network. This reduces overhead and enables AI/ML and HPC applications to run more efficiently.

Note

Libfabric is a core component of the OpenFabrics Interfaces (OFI) framework, which defines and exports the user-space API of OFI. For more information, see the Libfabric OpenFabrics website.

Differences between ENA, EFA, and EFA-only network interfaces

Amazon EC2 provides two types of network interfaces:

  • ENA interfaces provide all of the traditional IP networking and routing features that are required to support IP networking for a VPC. For more information, see Enable enhanced networking with ENA on your EC2 instances.

  • EFA (EFA with ENA) interfaces provide both the ENA device for IP networking and the EFA device for low-latency, high-throughput communication.

  • EFA-only interfaces support only the EFA device capabilities, without the ENA device for traditional IP networking.

The following table provides a comparison of ENA, EFA, and EFA-only network interfaces.

ENA EFA (EFA with ENA) EFA-only
Supports IP networking functionality Yes Yes No
Can be assigned IPv4 or IPv6 addresses Yes Yes No
Can be used as primary network interface for instance Yes Yes No
Counts towards ENI attachement limit for instance Yes Yes Yes
Instance type support Supported on all Nitro-based instances types Supported instance types Supported instance types
Parameter naming in EC2 APIs interface efa efa-only
Field naming in EC2 console No selection EFA with ENA EFA-only

Supported interfaces and libraries

EFAs support the following interfaces and libraries:

  • Open MPI 4 and later

    Note

    Open MPI 4.0 or newer is preferred for Graviton based instances.

  • Intel MPI 2019 Update 5 and later

  • NVIDIA Collective Communications Library (NCCL) 2.4.2 and later

  • AWS Neuron SDK version 2.3 and later

Supported instance types

The following instance types support EFAs:

Nitro v5
  • General Purpose: m8g.24xlarge | m8g.48xlarge | m8g.metal-24xl | m8g.metal-48xl

  • Compute Optimized: c7gn.16xlarge | c7gn.metal | c8g.24xlarge | c8g.48xlarge | c8g.metal-24xl | c8g.metal-48xl

  • Memory Optimized: r8g.24xlarge | r8g.48xlarge | r8g.metal-24xl | r8g.metal-48xl | x8g.24xlarge | x8g.48xlarge | x8g.metal-24xl | x8g.metal-48xl

  • Storage Optimized: i7ie.48xlarge

  • Accelerated Computing: p5en.48xlarge | trn2.3xlarge | trn2.48xlarge | trn2a.3xlarge | trn2a.48xlarge | trn2n.3xlarge | trn2n.48xlarge | trn2p.48xlarge | trn2u.48xlarge

  • High Performance Computing: hpc7g.4xlarge | hpc7g.8xlarge | hpc7g.16xlarge

Nitro v4
  • General Purpose: m6a.48xlarge | m6a.metal | m6i.32xlarge | m6i.metal | m6id.32xlarge | m6id.metal | m6idn.32xlarge | m6idn.metal | m6in.32xlarge | m6in.metal | m7a.48xlarge | m7a.metal-48xl | m7g.16xlarge | m7g.metal | m7gd.16xlarge | m7gd.metal | m7i.48xlarge | m7i.metal-48xl

  • Compute Optimized: c6a.48xlarge | c6a.metal | c6gn.16xlarge | c6i.32xlarge | c6i.metal | c6id.32xlarge | c6id.metal | c6in.32xlarge | c6in.metal | c7a.48xlarge | c7a.metal-48xl | c7g.16xlarge | c7g.metal | c7gd.16xlarge | c7gd.metal | c7i.48xlarge | c7i.metal-48xl

  • Memory Optimized: r6a.48xlarge | r6a.metal | r6i.32xlarge | r6i.metal | r6idn.32xlarge | r6idn.metal | r6in.32xlarge | r6in.metal | r6id.32xlarge | r6id.metal | r7a.48xlarge | r7a.metal-48xl | r7g.16xlarge | r7g.metal | r7gd.16xlarge | r7gd.metal | r7i.48xlarge | r7i.metal-48xl | r7iz.32xlarge | r7iz.metal-32xl | u7i-12tb.224xlarge | u7in-16tb.224xlarge | u7in-24tb.224xlarge | u7in-32tb.224xlarge | x2idn.32xlarge | x2idn.metal | x2iedn.32xlarge | x2iedn.metal

  • Storage Optimized: i4g.16xlarge | i4i.32xlarge | i4i.metal | im4gn.16xlarge

  • Accelerated Computing: g6.8xlarge | g6.12xlarge | g6.16xlarge | g6.24xlarge | g6.48xlarge | g6e.8xlarge | g6e.12xlarge | g6e.16xlarge | g6e.24xlarge | g6e.48xlarge | gr6.8xlarge | p5.48xlarge | p5e.48xlarge | trn1.32xlarge | trn1n.32xlarge

  • High Performance Computing: hpc6a.48xlarge | hpc6id.32xlarge | hpc7a.12xlarge | hpc7a.24xlarge | hpc7a.48xlarge | hpc7a.96xlarge

Nitro v3
  • General Purpose: m5dn.24xlarge | m5dn.metal | m5n.24xlarge | m5n.metal | m5zn.12xlarge | m5zn.metal

  • Compute Optimized: c5n.9xlarge | c5n.18xlarge | c5n.metal

  • Memory Optimized: r5dn.24xlarge | r5dn.metal | r5n.24xlarge | r5n.metal | x2iezn.12xlarge | x2iezn.metal

  • Storage Optimized: i3en.12xlarge | i3en.24xlarge | i3en.metal

  • Accelerated Computing: dl1.24xlarge | dl2q.24xlarge | g4dn.8xlarge | g4dn.12xlarge | g4dn.16xlarge | g4dn.metal | g5.8xlarge | g5.12xlarge | g5.16xlarge | g5.24xlarge | g5.48xlarge | inf1.24xlarge | p3dn.24xlarge | p4d.24xlarge | p4de.24xlarge | vt1.24xlarge

To see the available instance types that support EFAs in a specific Region

The available instance types vary by Region. To see the available instance types that support EFAs in a Region, use the describe-instance-types command with the --region parameter. Include the --filters parameter to scope the results to the instance types that support EFA and the --query parameter to scope the output to the value of InstanceType.

aws ec2 describe-instance-types --region us-east-1 --filters Name=network-info.efa-supported,Values=true --query "InstanceTypes[*].[InstanceType]" --output text | sort

Supported operating systems

Operating system support differs depending on the processor type. The following table shows the supported operating systems.

Operating system Intel/AMD (x86_64) instance types AWS Graviton (arm64) instance types
Amazon Linux 2023
Amazon Linux 2
RHEL 8 and 9
Debian 10, 11, and 12
Rocky Linux 8 and 9
Ubuntu 20.04, 22.04, and 24.04
SUSE Linux Enterprise 15 SP2 and later
OpenSUSE Leap 15.5 and later
Note

Ubuntu 20.04 supports peer direct support when used with dl1.24xlarge instances.

EFA limitations

EFAs have the following limitations:

Note

EFA traffic refers to the traffic transmitted through the EFA device of either an EFA (EFA with ENA) or EFA-only interface.

  • EFA traffic between P4d/P4de/DL1 instances and other instance types is currently not supported.

  • Instance types that support multiple network cards can be configured with one EFA per network card. All other supported instance types support only one EFA per instance.

  • For c7g.16xlarge, m7g.16xlarge and r7g.16xlarge Dedicated Instances and Dedicated Hosts are not supported when an EFA is attached.

  • EFA traffic can't cross Availability Zones or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface.

  • EFA traffic is not routable. Normal IP traffic from the ENA device of an EFA interface remains routable.

  • EFA is not supported on AWS Outposts.

  • The EFA device of an EFA (EFA with ENA) interface is supported on Windows instances only for AWS Cloud Digital Interface Software Development Kit (AWS CDI SDK) based applications. If you attach an EFA (EFA with ENA) interface to a Windows instance for non-CDI SDK based applications, it functions as an ENA interface, without the added EFA device capabilities. The EFA-only interface is not supported by AWS CDI based applications on Windows or Linux. For more information, see the AWS Cloud Digital Interface Software Development Kit (AWS CDI SDK) User Guide.

EFA pricing

EFA is available as an optional Amazon EC2 networking feature that you can enable on any supported instance at no additional cost.