**Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Manage compute resources for AI/ML workloads on Amazon EKS
<a name="ml-compute-management"></a>

This section is designed to help you manage compute resources for machine learning workloads in Amazon Elastic Kubernetes Service (EKS). You’ll find details on reserving GPUs using Capacity Blocks for managed node groups and self-managed nodes, including prerequisites, launch template setup, scaling configurations, workload preparation, and key considerations for handling reservation lifecycles and graceful node termination.

**Topics**
+ [Create a managed node group with Capacity Blocks for ML](capacity-blocks-mng.md)
+ [Create self-managed nodes with Capacity Blocks for ML](capacity-blocks.md)
+ [Use P6e-GB200 UltraServers with Amazon EKS](ml-eks-nvidia-ultraserver.md)

# Create a managed node group with Capacity Blocks for ML
<a name="capacity-blocks-mng"></a>

Capacity Blocks for machine learning (ML) allow you to reserve GPU instances on a future date to support your short duration ML workloads. For more information, see [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) in the *Amazon EC2 User Guide for Linux Instances*.

## Considerations
<a name="capacity-blocks-mng-considerations"></a>

**Important**  
Capacity Blocks are only available for certain Amazon EC2 instance types and AWS Regions. For compatibility information, see [Work with Capacity Blocks Prerequisites](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html#capacity-blocks-prerequisites) in the *Amazon EC2 User Guide for Linux Instances*.
For more information, see [Use Capacity Blocks for machine learning workloads](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-template-capacity-blocks.html) in the *Amazon EC2 Auto Scaling User Guide*.
Managed node groups with Capacity Blocks can only be created with custom launch templates.
When upgrading managed node groups with Capacity Blocks, make sure that the desired size of the node group is set to `0`.

## Create a managed node group with Amazon EC2 Capacity Blocks
<a name="capacity-blocks-mng-procedure"></a>

You can use Capacity Blocks with Amazon EKS managed node groups for provisioning and scaling GPU-accelerated worker nodes. The AWS CloudFormation template examples that follow don’t cover every aspect needed in a production cluster. Typically, you’d also want a bootstrapping script to join the node to the cluster and specify an Amazon EKS accelerated AMI. For more information, see [Create a managed node group for your cluster](create-managed-node-group.md).

1. Create a launch template that’s appropriate for your workloads and works with Amazon EKS managed node groups. For more information, see [Customize managed nodes with launch templates](launch-templates.md).

   In addition to the requirements in the above procedures, make sure that the `LaunchTemplateData` includes the following:
   +  `InstanceMarketOptions` with `MarketType` set to `"capacity-block"` 
   +  `CapacityReservationSpecification: CapacityReservationTarget` with `CapacityReservationId` set to the Capacity Block (for example: `cr-02168da1478b509e0 ` )
   +  `InstanceType` set to an instance type that supports Capacity Blocks (for example: *p5.48xlarge*)

     The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block. To create a custom AMI managed node group, you can also add `ImageId` and `UserData` parameters.

     ```
     NodeLaunchTemplate:
       Type: "AWS::EC2::LaunchTemplate"
       Properties:
         LaunchTemplateData:
           InstanceMarketOptions:
             MarketType: "capacity-block"
           CapacityReservationSpecification:
             CapacityReservationTarget:
               CapacityReservationId: "cr-02168da1478b509e0"
           InstanceType: p5.48xlarge
     ```

1. Use the launch template to create a managed node group.

   The following is an example create node group command for Capacity Blocks. Replace *example-values* with ones applicable to your cluster.

   When creating the Capacity Block managed node group, do the following:
   + Set the `capacity-type` to `"CAPACITY_BLOCK"`. If the capacity type isn’t set to `"CAPACITY_BLOCK"` or any of the other above required launch template values are missing, then the create request will be rejected.
   + When specifying `subnets` in the create request, make sure to only specify the subnet in the same Availability Zone as the capacity reservation.
   + If you specify a non-zero `desiredSize` in the create request, Amazon EKS will honor that when creating the Auto Scaling group (ASG). However, if the create request is made before the capacity reservation is active, then the ASG won’t be able to launch Amazon EC2 instances until it becomes active. As a result, ASG scaling activities will have launch errors. Whenever the reservation becomes active, then the launch of instances will succeed and the ASG will be scaled up to the `desiredSize` mentioned at create time.

     ```
     aws eks create-nodegroup \
         --cluster-name my-cluster \
         --nodegroup-name my-mng \
         --node-role node-role-arn \
         --region region-code \
         --subnets subnet-id \
         --scaling-config minSize=node-group-min-size,maxSize=node-group-max-size,desiredSize=node-group-desired-size \
         --ami-type "AL2023_x86_64_NVIDIA" \
         --capacity-type "CAPACITY_BLOCK" \
         --launch-template id="lt-id",version=1
     ```

1. Make sure that the nodes join after scale up. Amazon EKS clusters using managed node groups with Capacity Blocks don’t perform any validations that instances launched actually join and register with the cluster.

1. If you set `desiredSize` to `0` at create time, then you have different options to scale up the node group when the capacity reservation becomes active:
   + Create a scheduled scaling policy for the ASG that aligns to the Capacity Block reservation start time. For more information, see [Scheduled scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scheduled-scaling.html) in the *Amazon EC2 Auto Scaling User Guide*.
   + Use the Amazon EKS console or `eks update-nodegroup-config` to update the scaling config and set the desired size of the node group.
   + Use the Kubernetes Cluster Autoscaler. For more information, see [Cluster Autoscaler on AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md).

1. The node group is now ready for workloads and Pods to be scheduled.

1. In order for your Pods to be gracefully drained before reservation ends, Amazon EKS uses a scheduled scaling policy to scale down the node group size to `0` . This scheduled scaling will be set with name titled `Amazon EKS Node Group Capacity Scaledown Before Reservation End` . We recommend not editing or deleting this action.

   Amazon EC2 starts shutting down the instances 30 minutes before reservation end time. As a result, Amazon EKS will setup a scheduled scale down on the node group 40 minutes prior to their reservation end in order to safely and gracefully evict Pods.

# Create self-managed nodes with Capacity Blocks for ML
<a name="capacity-blocks"></a>

Capacity Blocks for machine learning (ML) allow you to reserve GPU instances on a future date to support your short duration ML workloads. For more information, see [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) in the *Amazon EC2 User Guide for Linux Instances*.

## Considerations
<a name="capacity-blocks-considerations"></a>

**Important**  
Capacity Blocks are only available for certain Amazon EC2 instance types and AWS Regions. For compatibility information, see [Work with Capacity Blocks Prerequisites](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html#capacity-blocks-prerequisites) in the *Amazon EC2 User Guide for Linux Instances*.
If you create a self-managed node group prior to the capacity reservation becoming active, then set the desired capacity to `0`.
To allow sufficient time to gracefully drain the node(s), we suggest that you schedule scaling to scale to zero more than 30 minutes before the Capacity Block reservation end time.
In order for your Pods to be gracefully drained, we recommend that you set up AWS Node Termination Handler as explained in the example steps.

## Use Capacity Blocks with self-managed nodes
<a name="capacity-blocks-procedure"></a>

You can use Capacity Blocks with Amazon EKS for provisioning and scaling your self-managed nodes. The following steps give a general example overview. The AWS CloudFormation template examples don’t cover every aspect needed in a production workload. Typically you’d also want a bootstrapping script to join the node to the cluster, specify an Amazon EKS accelerated AMI, and an appropriate instance profile for joining the cluster. For more information, see [Create self-managed Amazon Linux nodes](launch-workers.md).

1. Create a launch template that’s applicable to your workload. For more information, see [Use Capacity Blocks for machine learning workloads](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-template-capacity-blocks.html) in the *Amazon EC2 Auto Scaling User Guide*.

   Make sure the `LaunchTemplateData` includes the following:
   +  `InstanceMarketOptions` with `MarketType` set to `"capacity-block"` 
   +  `CapacityReservationSpecification: CapacityReservationTarget` with `CapacityReservationId` set to the Capacity Block (for example: `cr-02168da1478b509e0 ` )
   +  `IamInstanceProfile` with the `Arn` set to the applicable *iam-instance-profile-arn* 
   +  `ImageId` set to the applicable *image-id* 
   +  `InstanceType` set to an instance type that supports Capacity Blocks (for example: *p5.48xlarge*)
   +  `SecurityGroupIds` set to the applicable IDs (for example: *sg-05b1d815d1EXAMPLE*)
   +  `UserData` set to the applicable *user-data* for your self-managed node group

     The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block.

     ```
     NodeLaunchTemplate:
       Type: "AWS::EC2::LaunchTemplate"
       Properties:
         LaunchTemplateData:
           InstanceMarketOptions:
             MarketType: "capacity-block"
           CapacityReservationSpecification:
             CapacityReservationTarget:
               CapacityReservationId: "cr-02168da1478b509e0"
           IamInstanceProfile:
             Arn: iam-instance-profile-arn
           ImageId: image-id
           InstanceType: p5.48xlarge
           KeyName: key-name
           SecurityGroupIds:
           - sg-05b1d815d1EXAMPLE
           UserData: user-data
     ```

     You must pass the subnet in the Availability Zone in which the reservation is made because Capacity Blocks are zonal.

1. Use the launch template to create a self-managed node group. If you’re doing this prior to the capacity reservation becoming active, then set the desired capacity to `0`. When creating the node group, make sure that you are only specifying the respective subnet for the Availability Zone in which the capacity is reserved.

   The following is a sample CloudFormation template that you can reference when creating one that is applicable to your workload. This example gets the `LaunchTemplateId` and `Version` of the ` AWS::Amazon EC2::LaunchTemplate` resource shown in the previous step. It also gets the values for `DesiredCapacity`, `MaxSize`, `MinSize`, and `VPCZoneIdentifier` that are declared elsewhere in the same template.

   ```
   NodeGroup:
     Type: "AWS::AutoScaling::AutoScalingGroup"
     Properties:
       DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
       LaunchTemplate:
         LaunchTemplateId: !Ref NodeLaunchTemplate
         Version: !GetAtt NodeLaunchTemplate.LatestVersionNumber
       MaxSize: !Ref NodeAutoScalingGroupMaxSize
       MinSize: !Ref NodeAutoScalingGroupMinSize
       VPCZoneIdentifier: !Ref Subnets
       Tags:
         - Key: Name
           PropagateAtLaunch: true
           Value: !Sub ${ClusterName}-${NodeGroupName}-Node
         - Key: !Sub kubernetes.io/cluster/${ClusterName}
           PropagateAtLaunch: true
           Value: owned
   ```

1. Once the node group is created successfully, make sure to record the `NodeInstanceRole` for the node group that was created. You need this in order to make sure that when node group is scaled, the new nodes join the cluster and Kubernetes is able to recognize the nodes. For more information, see the AWS Management Console instructions in [Create self-managed Amazon Linux nodes](launch-workers.md).

1. We recommend that you create a scheduled scaling policy for the Auto Scaling group that aligns to the Capacity Block reservation times. For more information, see [Scheduled scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scheduled-scaling.html) in the *Amazon EC2 Auto Scaling User Guide*.

   You can use all of the instances you reserved until 30 minutes before the end time of the Capacity Block. Instances that are still running at that time will start terminating. To allow sufficient time to gracefully drain the node(s), we suggest that you schedule scaling to scale to zero more than 30 minutes before the Capacity Block reservation end time.

   If you want to instead scale up manually whenever the capacity reservation becomes `Active`, then you need to update the Auto Scaling group’s desired capacity at the start time of the Capacity Block reservation. Then you would need to also scale down manually more than 30 minutes before the Capacity Block reservation end time.

1. The node group is now ready for workloads and Pods to be scheduled.

1. In order for your Pods to be gracefully drained, we recommend that you set up AWS Node Termination Handler. This handler will be able to watch for "ASG Scale-in" lifecycle events from Amazon EC2 Auto Scaling using EventBridge and allow the Kubernetes control plane to take required action before the instance becomes unavailable. Otherwise, your Pods and Kubernetes objects will get stuck in a pending state. For more information, see [AWS Node Termination Handler](https://github.com/aws/aws-node-termination-handler) on GitHub.

   If you don’t setup a Node Termination Handler, we recommend that you start draining your Pods manually before hitting the 30 minute window so that they have enough time to be gracefully drained.

# Use P6e-GB200 UltraServers with Amazon EKS
<a name="ml-eks-nvidia-ultraserver"></a>

This topic describes how to configure and use Amazon EKS with P6e-GB200 UltraServers. The `p6e-gb200.36xlarge` instance type with 4 NVIDIA Blackwell GPUs is only available as P6e-GB200 UltraServers. There are two types of P6e-GB200 UltraServers. The `u-p6e-gb200x36` UltraServer has 9 `p6e-gb200.36xlarge` instances and the `u-p6e-gb200x72` UltraServer has 18 `p6e-gb200.36xlarge` instances.

To learn more, see the [Amazon EC2 P6e-GB200 UltraServers webpage](https://aws.amazon.com/ec2/instance-types/p6/).

## Considerations
<a name="nvidia-ultraserver-considerations"></a>
+ Amazon EKS supports P6e-GB200 UltraServers for Kubernetes versions 1.33 and above. This Kubernetes version release provides support for [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (DRA), enabled by default in EKS and in the [AL2023 EKS-optimized accelerated AMIs](https://docs.aws.amazon.com/eks/latest/userguide/ml-eks-optimized-ami.html). DRA is a requirement to use the P6e-GB200 UltraServers with EKS. DRA is not supported in Karpenter or EKS Auto Mode, and it is recommended to use EKS self-managed node groups or EKS managed node groups when using the P6e-GB200 UltraServers with EKS.
+ P6e-GB200 UltraServers are made available through [EC2 Capacity Blocks for ML](https://aws.amazon.com/ec2/capacityblocks/). See [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for information on how to launch EKS nodes with Capacity Blocks.
+ When using EKS managed node groups with Capacity Blocks, you must use custom launch templates. When upgrading EKS managed node groups with P6e-GB200 UltraServers, you must set the desired size of the node group to `0` before upgrading.
+ It is recommended to use the AL2023 ARM NVIDIA variant of the EKS-optimized accelerated AMIs. This AMI includes the required node components and configuration to work with P6e-GB200 UltraServers. If you decide to build your own AMI, you are responsible for installing and validating the compatibility of the node and system software, including drivers. For more information, see [Use EKS-optimized accelerated AMIs for GPU instances](ml-eks-optimized-ami.md).
+ It is recommended to use EKS-optimized AMI release `v20251103` or later, which includes NVIDIA driver version 580. This NVIDIA driver version enables Coherent Driver-Based Memory (CDMM) to address potential memory over-reporting. When CDMM is enabled, the following capabilities are not supported: NVIDIA Multi-Instance GPU (MIG) and vGPU. For more information on CDMM, see [NVIDIA Coherent Driver-based Memory Management (CDMM)](https://nvdam.widen.net/s/gpqp6wmz7s/cuda-whitepaper—​cdmm-pdf).
+ When using the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI. The EKS-optimized AL2023 NVIDIA AMIs do not include the NVIDIA Kubernetes device plugin or the NVIDIA DRA driver, and these must be installed separately.
+ Each `p6e-gb200.36xlarge` instance can be configured with up to 17 network cards and can leverage EFA for communication between UltraServers. Workload network traffic can cross UltraServers, but for highest performance it is recommended to schedule workloads in the same UltraServer leveraging IMEX for intra-UltraServer GPU communication. For more information, see [EFA configuration for P6e-GB200 instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e).
+ Each `p6e-gb200.36xlarge` instance has 3x 7.5TB [instance store storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html). By default, the EKS-optimized AMI does not format and mount the instance stores. The node’s ephemeral storage can be shared among pods that request ephemeral storage and container images that are downloaded to the node. If using the AL2023 EKS-optimized AMI, this can be configured as part of the nodes bootstrap in the user data by setting the instance local storage policy in [NodeConfig](https://docs.aws.amazon.com/eks/latest/eksctl/node-bootstrapping.html#configuring-the-bootstrapping-process) to RAID0. Setting to RAID0 stripes the instance stores and configures the container runtime and kubelet to make use of this ephemeral storage.

## Components
<a name="nvidia-ultraserver-components"></a>

The following components are recommended for running workloads on EKS with the P6e-GB200 UltraServers. You can optionally use the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) to install the NVIDIA node components. When using the NVIDIA GPU operator with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/eks/latest/userguide/ml-eks-nvidia-ultraserver.html)

The node components in the table above perform the following functions:
+  **VPC CNI**: Allocates VPC IPs as the primary network interface for pods running on EKS
+  **EFA device plugin**: Allocates EFA devices as secondary networks for pods running on EKS. Responsible for network traffic across P6e-GB200 UltraServers. For multi-node workloads, for GPU-to-GPU within an UltraServer can flow over multi-node NVLink.
+  **NVIDIA Kubernetes device plugin**: Allocates GPUs as devices for pods running on EKS. It is recommended to use the NVIDIA Kubernetes device plugin until the NVIDIA DRA driver GPU allocation functionality graduates from experimental. See the [NVIDIA DRA driver releases](https://github.com/NVIDIA/k8s-dra-driver-gpu/releases) for updated information.
+  **NVIDIA DRA driver**: Enables ComputeDomain custom resources that facilitate creation of IMEX domains that follow workloads running on P6e-GB200 UltraServers.
  + The ComputeDomain resource describes an Internode Memory Exchange (IMEX) domain. When workloads with a ResourceClaim for a ComputeDomain are deployed to the cluster, the NVIDIA DRA driver automatically creates an IMEX DaemonSet that runs on matching nodes and establishes the IMEX channel(s) between the nodes before the workload is started. To learn more about IMEX, see [overview of NVIDIA IMEX for multi-node NVLink systems](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html).
  + The NVIDIA DRA driver uses a clique ID label (`nvidia.com/gpu.clique`) applied by NVIDIA GFD that relays the knowledge of the network topology and NVLink domain.
  + It is a best practice to create a ComputeDomain per workload job.
+  **NVIDIA Node Feature Discovery (NFD)**: Required dependency for GFD to apply node labels based on discovered node-level attributes.
+  **NVIDIA GPU Feature Discovery (GFD)**: Applies an NVIDIA standard topology label called `nvidia.com/gpu.clique` to the nodes. Nodes within the same `nvidia.com/gpu.clique` have multi-node NVLink-reachability, and you can use pod affinities in your application to schedule pods to the same NVlink domain.

## Procedure
<a name="nvidia-ultraserver-procedure"></a>

The following section assumes you have an EKS cluster running Kubernetes version 1.33 or above with one or more node groups with P6e-GB200 UltraServers running the AL2023 ARM NVIDIA EKS-optimized accelerated AMI. See the links in [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for the prerequisite steps for EKS self-managed nodes and managed node groups.

The following procedure uses the components below.


| Name | Version | Description | 
| --- | --- | --- | 
|  NVIDIA GPU Operator  |  25.3.4\$1  |  For lifecycle management of required plugins such as NVIDIA Kubernetes device plugin and NFD/GFD.  | 
|  NVIDIA DRA Drivers  |  25.8.0\$1  |  For ComputeDomain CRDs and IMEX domain management.  | 
|  EFA Device Plugin  |  0.5.14\$1  |  For cross-UltraServer communication.  | 

## Install NVIDIA GPU operator
<a name="nvidia-ultraserver-gpu-operator"></a>

The NVIDIA GPU operator simplifies the management of components required to use GPUs in Kubernetes clusters. As the NVIDIA GPU driver and container toolkit are installed as part of the EKS-optimized accelerated AMI, these must be set to `false` in the Helm values configuration.

1. Create a Helm values file named `gpu-operator-values.yaml` with the following configuration.

   ```
   devicePlugin:
     enabled: true
   nfd:
     enabled: true
   gfd:
     enabled: true
   driver:
     enabled: false
   toolkit:
     enabled: false
   migManager:
     enabled: false
   ```

1. Install the NVIDIA GPU operator for your cluster using the `gpu-operator-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator \
    --create-namespace \
    --version v25.3.4 \
    --values gpu-operator-values.yaml
   ```

## Install NVIDIA DRA driver
<a name="nvidia-ultraserver-dra-driver"></a>

As of NVIDIA GPU operator version `v25.3.4`, the NVIDIA DRA driver must be installed separately. It is recommended to track the NVIDIA GPU operator [release notes](https://github.com/NVIDIA/gpu-operator/releases) as this may change in a future release.

1. Create a Helm values file named `dra-values.yaml` with the following configuration. Note the `nodeAffinity` and `tolerations` that configures the DRA driver to deploy only on nodes with an NVIDIA GPU.

   ```
   resources:
     gpus:
       enabled: false # set to false to disable experimental gpu support
     computeDomains:
       enabled: true
   
   controller:
     nodeSelector: null
     affinity: null
     tolerations: []
   
   kubeletPlugin:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: "nvidia.com/gpu.present"
               operator: In
               values:
               - "true"
     tolerations:
       - key: "nvidia.com/gpu"
         operator: Exists
         effect: NoSchedule
   ```

1. Install the NVIDIA DRA driver for your cluster using the `dra-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
     --version="25.8.0" \
     --namespace nvidia-dra-driver-gpu \
     --create-namespace \
     -f dra-values.yaml
   ```

1. After installation, the DRA driver creates `DeviceClass` resources that enable Kubernetes to understand and allocate `ComputeDomain` resources, making the IMEX management possible for distributed GPU workloads on P6e-GB200 UltraServers.

   Confirm the DRA resources are available with the following commands.

   ```
   kubectl api-resources | grep resource.k8s.io
   ```

   ```
   deviceclasses           resource.k8s.io/v1  false        DeviceClass
   resourceclaims          resource.k8s.io/v1  true         ResourceClaim
   resourceclaimtemplates  resource.k8s.io/v1  true         ResourceClaimTemplate
   resourceslices          resource.k8s.io/v1  false        ResourceSlice
   ```

   ```
   kubectl get deviceclasses
   ```

   ```
   NAME
   compute-domain-daemon.nvidia.com
   compute-domain-default-channel.nvidia.com
   ```

## Install the EFA device plugin
<a name="nvidia-ultraserver-efa-plugin"></a>

To use EFA communication between UltraServers, you must install the Kubernetes device plugin for EFA. P6e-GB200 instances can be configured with up to [17 network cards](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) and the primary NCI (index 0) must be of type `interface` and supports up to 100 Gbps of ENA bandwidth. Configure your EFA and ENA interfaces as per your requirements during node provisioning. Review the [EFA configuration for a P6e-GB200 instances AWS documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) for more details on EFA configuration.

1. Create a Helm values file named `efa-values.yaml` with the following configuration.

   ```
   tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Install the EFA device plugin for your cluster using the `efa-values.yaml` file you created in the previous step.

   ```
   helm repo add eks https://aws.github.io/eks-charts
   helm repo update
   ```

   ```
   helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
     --version="0.5.14" \
     -f efa-values.yaml
   ```

   As an example, if you configured your instances with 1 efa-only interface in each [NCI group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e), when describing a node, it is expected to see 4 allocatable EFA devices per node.

   ```
   kubectl describe node/<gb200-node-name>
   ```

   ```
   Capacity:
     ...
     vpc.amazonaws.com/efa:  4
   Allocatable:
     ...
     vpc.amazonaws.com/efa:  4
   ```

## Validate IMEX over Multi-Node NVLink
<a name="nvidia-ultraserver-imex-nvlink"></a>

For a multi-node NVLINK NCCL test and other micro-benchmarks review the [awesome-distributed-training](https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests) GitHub repository. The following steps show how to run a multi-node NVLink test with nvbandwidth.

1. To run a multi-node bandwidth test across two nodes in the NVL72 domain, first install the MPI operator:

   ```
   kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.7.0/mpi-operator.yaml
   ```

1. Create a Helm values file named `nvbandwidth-test-job.yaml` that defines the test manifest. Note the `nvidia.com/gpu.clique` pod affinity to schedule the workers in the same NVLink domain which has Multi-Node NVLink reachability. The sample below runs a multi-node device-to-device CE Read memcpy test using cuMemcpyAsync and prints the results in the logs.

   As of NVIDIA DRA Driver version `v25.8.0` ComputeDomains are elastic and `.spec.numNodes` can be set to `0` in the ComputeDomain definition. Review the latest [NVIDIA DRA Driver release notes](https://github.com/NVIDIA/k8s-dra-driver-gpu) for updates.

   ```
   ---
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: nvbandwidth-test-compute-domain
   spec:
     numNodes: 0 # This can be set to 0 from NVIDIA DRA Driver version v25.8.0+
     channel:
       resourceClaimTemplate:
         name: nvbandwidth-test-compute-domain-channel
   
   ---
   apiVersion: kubeflow.org/v2beta1
   kind: MPIJob
   metadata:
     name: nvbandwidth-test
   spec:
     slotsPerWorker: 4 # 4 GPUs per worker node
     launcherCreationPolicy: WaitForWorkersReady
     runPolicy:
       cleanPodPolicy: Running
     sshAuthMountPath: /home/mpiuser/.ssh
     mpiReplicaSpecs:
       Launcher:
         replicas: 1
         template:
           metadata:
             labels:
               nvbandwidth-test-replica: mpi-launcher
           spec:
             affinity:
               nodeAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                   nodeSelectorTerms:
                   - matchExpressions:
                     # Only schedule on NVIDIA GB200/GB300 nodes
                     - key: node.kubernetes.io/instance-type
                       operator: In
                       values:
                       - p6e-gb200.36xlarge
                       - p6e-gb300.36xlarge
             containers:
             - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
               name: mpi-launcher
               securityContext:
                 runAsUser: 1000
               command:
               - mpirun
               args:
               - --bind-to
               - core
               - --map-by
               - ppr:4:node
               - -np
               - "8"
               - --report-bindings
               - -q
               - nvbandwidth
               - -t
               - multinode_device_to_device_memcpy_read_ce
       Worker:
         replicas: 2 # 2 worker nodes
         template:
           metadata:
             labels:
               nvbandwidth-test-replica: mpi-worker
           spec:
             affinity:
               nodeAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                   nodeSelectorTerms:
                   - matchExpressions:
                     # Only schedule on NVIDIA GB200/GB300 nodes
                     - key: node.kubernetes.io/instance-type
                       operator: In
                       values:
                       - p6e-gb200.36xlarge
                       - p6e-gb300.36xlarge
               podAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                 - labelSelector:
                     matchExpressions:
                     - key: nvbandwidth-test-replica
                       operator: In
                       values:
                       - mpi-worker
                   topologyKey: nvidia.com/gpu.clique
             containers:
             - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
               name: mpi-worker
               securityContext:
                 runAsUser: 1000
               env:
               command:
               - /usr/sbin/sshd
               args:
               - -De
               - -f
               - /home/mpiuser/.sshd_config
               resources:
                 limits:
                   nvidia.com/gpu: 4  # Request 4 GPUs per worker
                 claims:
                 - name: compute-domain-channel # Link to IMEX channel
             resourceClaims:
             - name: compute-domain-channel
               resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
   ```

1. Create the ComputeDomain and start the job with the following command.

   ```
   kubectl apply -f nvbandwidth-test-job.yaml
   ```

1. ComputeDomain creation, you can see the workload’s ComputeDomain has two nodes:

   ```
   kubectl get computedomains.resource.nvidia.com -o yaml
   ```

   ```
   status:
     nodes:
     - cliqueID: <ClusterUUID>.<Clique ID>
       ipAddress: <node-ip>
       name: <node-hostname>
     - cliqueID: <ClusterUUID>.<Clique ID>
       ipAddress: <node-ip>
       name: <node-hostname>
     status: Ready
   ```

1. Review the results of the job with the following command.

   ```
   kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
   ```

   A successful test shows bandwidth statistics in GB/s for the multi-node memcpy test. An example of a successful test output is shown below.

   ```
   ...
   nvbandwidth Version: ...
   Built from Git version: ...
   
   MPI version: ...
   CUDA Runtime Version: ...
   CUDA Driver Version: ...
   Driver Version: ...
   
   Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (...)
   Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (...)
   Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (...)
   Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (...)
   Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (...)
   Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (...)
   Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (...)
   Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (...)
   
   Running multinode_device_to_device_memcpy_read_ce.
   memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
              0         1         2         3         4         5         6         7
    0       N/A    821.45    822.18    821.73    822.05    821.38    822.61    821.89
    1    822.34       N/A    821.67    822.12    821.94    820.87    821.53    822.08
    2    821.76    822.29       N/A    821.58    822.43    821.15    821.82    822.31
    3    822.19    821.84    822.05       N/A    821.67    821.23    820.95    822.47
    4    821.63    822.38    821.49    822.17       N/A    821.06    821.78    822.22
    5    822.08    821.52    821.89    822.35    821.27       N/A    821.64    822.13
    6    821.94    822.15    821.68    822.04    821.39    820.92       N/A    822.56
    7    822.27    821.73    822.11    821.86    822.38    821.04    821.49       N/A
   
   SUM multinode_device_to_device_memcpy_read_ce ...
   
   NOTE: The reported results may not reflect the full capabilities of the platform.
   Performance can vary with software drivers, hardware clocks, and system topology.
   ```

1. When the test is complete, delete it with the following command.

   ```
   kubectl delete -f nvbandwidth-test-job.yaml
   ```