Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2 - Amazon Elastic Compute Cloud

Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2

An Elastic Fabric Adapter (EFA) is a network device that you can attach to your Amazon EC2 instance to accelerate Artificial Intelligence (AI), Machine Learning (ML), and High Performance Computing (HPC) applications. EFA enables you to achieve the application performance of an on-premises AI/ML or HPC cluster, with the scalability, flexibility, and elasticity provided by the AWS Cloud.

EFA provides lower and more consistent latency and higher throughput than the TCP transport traditionally used in cloud-based HPC systems. It enhances the performance of inter-instance communication that is critical for scaling AI/ML and HPC applications. It is optimized to work on the existing AWS network infrastructure and it can scale depending on application requirements.

EFA integrates with Libfabric 1.7.0 and later, and it supports Nvidia Collective Communications Library (NCCL) for AI and ML applications, and Open MPI 4 and later and Intel MPI 2019 Update 5 and later for HPC applications.

EFA supports RDMA (Remote Direct Memory Access) write on most supported instance types that have Nitro version 4 and later. RDMA read is supported on all instances with Nitro version 4 and later. For more information, see Supported instance types.

EFA basics

An EFA device can be attached to an EC2 instance in two ways:

  1. Using a traditional EFA interface, also called EFA with ENA, which creates both an EFA device and an ENA device.

  2. Using an EFA-only interface, which creates just the EFA device.

The EFA device provides capabilities like built-in OS-bypass and congestion control through the Scalable Reliable Datagram (SRD) protocol. The EFA device features enable low-latency, reliable transport functionality that allows EFA interface to provide better application performance for HPC and ML applications on Amazon EC2. While the ENA device offers traditional IP networking.

Contrasting a traditional HPC software stack with one that uses an EFA.

Traditionally, AI/ML applications use NCCL and HPC applications use the Message Passing Interface (MPI) to interface with the system's network transport. In the AWS cloud, this has meant that applications interface with NCCL or MPI, which then uses the operating system's TCP/IP stack and the ENA device driver to enable network communication between instances.

With a traditional EFA (EFA with ENA) or EFA-only interface, AI/ML applications use NCCL and HPC applications use MPI, to interface directly with the Libfabric API. The Libfabric API bypasses the operating system kernel and communicates directly with the EFA device to put packets on the network. This reduces overhead and enables AI/ML and HPC applications to run more efficiently.

Note

Libfabric is a core component of the OpenFabrics Interfaces (OFI) framework, which defines and exports the user-space API of OFI. For more information, see the Libfabric OpenFabrics website.

Differences between ENA, EFA, and EFA-only network interfaces

Amazon EC2 provides two types of network interfaces:

  • ENA interfaces provide all of the traditional IP networking and routing features that are required to support IP networking for a VPC. For more information, see Enable enhanced networking with ENA on your EC2 instances.

  • EFA (EFA with ENA) interfaces provide both the ENA device for IP networking and the EFA device for low-latency, high-throughput communication.

  • EFA-only interfaces support only the EFA device capabilities, without the ENA device for traditional IP networking.

The following table provides a comparison of ENA, EFA, and EFA-only network interfaces.

ENA EFA (EFA with ENA) EFA-only
Supports IP networking functionality Yes Yes No
Can be assigned IPv4 or IPv6 addresses Yes Yes No
Can be used as primary network interface for instance Yes Yes No
Counts towards ENI attachement limit for instance Yes Yes Yes
Instance type support Supported on all Nitro-based instances types Supported instance types Supported instance types
Parameter naming in EC2 APIs interface efa efa-only
Field naming in EC2 console No selection EFA with ENA EFA-only

Supported interfaces and libraries

EFAs support the following interfaces and libraries:

  • Open MPI 4 and later

    Note

    Open MPI 4.0 or newer is preferred for Graviton based instances.

  • Intel MPI 2019 Update 5 and later

  • NVIDIA Collective Communications Library (NCCL) 2.4.2 and later

  • AWS Neuron SDK version 2.3 and later

Supported instance types

The following instance types support EFAs:

Nitro v5
Instance type RDMA read support RDMA write support
General Purpose
m8g.24xlarge
m8g.48xlarge
m8g.metal-24xl
m8g.metal-48xl
Compute Optimized
c7gn.16xlarge
c7gn.metal
c8g.24xlarge
c8g.48xlarge
c8g.metal-24xl
c8g.metal-48xl
Memory Optimized
r8g.24xlarge
r8g.48xlarge
r8g.metal-24xl
r8g.metal-48xl
x8g.24xlarge
x8g.48xlarge
x8g.metal-24xl
x8g.metal-48xl
Storage Optimized
i7ie.48xlarge
Accelerated Computing
p5en.48xlarge
trn2.48xlarge
trn2u.48xlarge
High Performance Computing
hpc7g.4xlarge
hpc7g.8xlarge
hpc7g.16xlarge
Nitro v4
Instance type RDMA read support RDMA write support
General Purpose
m6a.48xlarge
m6a.metal
m6i.32xlarge
m6i.metal
m6id.32xlarge
m6id.metal
m6idn.32xlarge
m6idn.metal
m6in.32xlarge
m6in.metal
m7a.48xlarge
m7a.metal-48xl
m7g.16xlarge
m7g.metal
m7gd.16xlarge
m7gd.metal
m7i.48xlarge
m7i.metal-48xl
Compute Optimized
c6a.48xlarge
c6a.metal
c6gn.16xlarge
c6i.32xlarge
c6i.metal
c6id.32xlarge
c6id.metal
c6in.32xlarge
c6in.metal
c7a.48xlarge
c7a.metal-48xl
c7g.16xlarge
c7g.metal
c7gd.16xlarge
c7gd.metal
c7i.48xlarge
c7i.metal-48xl
Memory Optimized
r6a.48xlarge
r6a.metal
r6i.32xlarge
r6i.metal
r6idn.32xlarge
r6idn.metal
r6in.32xlarge
r6in.metal
r6id.32xlarge
r6id.metal
r7a.48xlarge
r7a.metal-48xl
r7g.16xlarge
r7g.metal
r7gd.16xlarge
r7gd.metal
r7i.48xlarge
r7i.metal-48xl
r7iz.32xlarge
r7iz.metal-32xl
u7i-12tb.224xlarge
u7in-16tb.224xlarge
u7in-24tb.224xlarge
u7in-32tb.224xlarge
x2idn.32xlarge
x2idn.metal
x2iedn.32xlarge
x2iedn.metal
Storage Optimized
i4g.16xlarge
i4i.32xlarge
i4i.metal
im4gn.16xlarge
Accelerated Computing
f2.48xlarge
g6.8xlarge
g6.12xlarge
g6.16xlarge
g6.24xlarge
g6.48xlarge
g6e.8xlarge
g6e.12xlarge
g6e.16xlarge
g6e.24xlarge
g6e.48xlarge
gr6.8xlarge
p5.48xlarge
p5e.48xlarge
trn1.32xlarge
trn1n.32xlarge
High Performance Computing
hpc6a.48xlarge
hpc6id.32xlarge
hpc7a.12xlarge
hpc7a.24xlarge
hpc7a.48xlarge
hpc7a.96xlarge
Nitro v3
Instance type RDMA read support RDMA write support
General Purpose
m5dn.24xlarge
m5dn.metal
m5n.24xlarge
m5n.metal
m5zn.12xlarge
m5zn.metal
Compute Optimized
c5n.9xlarge
c5n.18xlarge
c5n.metal
Memory Optimized
r5dn.24xlarge
r5dn.metal
r5n.24xlarge
r5n.metal
x2iezn.12xlarge
x2iezn.metal
Storage Optimized
i3en.12xlarge
i3en.24xlarge
i3en.metal
Accelerated Computing
dl1.24xlarge
dl2q.24xlarge
g4dn.8xlarge
g4dn.12xlarge
g4dn.16xlarge
g4dn.metal
g5.8xlarge
g5.12xlarge
g5.16xlarge
g5.24xlarge
g5.48xlarge
inf1.24xlarge
p3dn.24xlarge
p4d.24xlarge
p4de.24xlarge
vt1.24xlarge
To see the available instance types that support EFAs in a specific Region

The available instance types vary by Region. To see the available instance types that support EFAs in a Region, use the describe-instance-types command with the --region parameter. Include the --filters parameter to scope the results to the instance types that support EFA and the --query parameter to scope the output to the value of InstanceType.

aws ec2 describe-instance-types --region us-east-1 --filters Name=network-info.efa-supported,Values=true --query "InstanceTypes[*].[InstanceType]" --output text | sort

Supported operating systems

Operating system support differs depending on the processor type. The following table shows the supported operating systems.

Operating system Intel/AMD (x86_64) instance types AWS Graviton (arm64) instance types
Amazon Linux 2023
Amazon Linux 2
RHEL 8 and 9
Debian 10, 11, and 12
Rocky Linux 8 and 9
Ubuntu 20.04, 22.04, and 24.04
SUSE Linux Enterprise 15 SP2 and later
OpenSUSE Leap 15.5 and later
Note

Ubuntu 20.04 supports peer direct support when used with dl1.24xlarge instances.

EFA limitations

EFAs have the following limitations:

Note

EFA traffic refers to the traffic transmitted through the EFA device of either an EFA (EFA with ENA) or EFA-only interface.

  • RDMA write is not supported with all instance types. For more information, see Supported instance types.

  • EFA traffic between P4d/P4de/DL1 instances and other instance types is currently not supported.

  • Instance types that support multiple network cards can be configured with one EFA per network card. All other supported instance types support only one EFA per instance.

  • For c7g.16xlarge, m7g.16xlarge and r7g.16xlarge Dedicated Instances and Dedicated Hosts are not supported when an EFA is attached.

  • EFA traffic can't cross Availability Zones or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface.

  • EFA traffic is not routable. Normal IP traffic from the ENA device of an EFA interface remains routable.

  • EFA is not supported on AWS Outposts.

  • The EFA device of an EFA (EFA with ENA) interface is supported on Windows instances only for AWS Cloud Digital Interface Software Development Kit (AWS CDI SDK) based applications. If you attach an EFA (EFA with ENA) interface to a Windows instance for non-CDI SDK based applications, it functions as an ENA interface, without the added EFA device capabilities. The EFA-only interface is not supported by AWS CDI based applications on Windows or Linux. For more information, see the AWS Cloud Digital Interface Software Development Kit (AWS CDI SDK) User Guide.

EFA pricing

EFA is available as an optional Amazon EC2 networking feature that you can enable on any supported instance at no additional cost.