NVIDIA DRA driver vs. NVIDIA device plugin Install the NVIDIA DRA driver Install the NVIDIA Kubernetes device plugin

Manage NVIDIA GPU devices on Amazon EKS

NVIDIA GPUs are widely used for machine learning training, inference, and high-performance computing workloads. Amazon EKS supports two mechanisms for managing NVIDIA GPU devices in your EKS clusters: the NVIDIA DRA driver for GPUs and the NVIDIA Kubernetes device plugin.

It’s recommended to use the NVIDIA DRA driver for new deployments on clusters running Kubernetes version 1.34 or later with EKS managed node groups or self-managed node groups. The NVIDIA DRA driver enables flexible GPU allocation and GPU sharing between containers.

The NVIDIA DRA driver is not supported with Karpenter or EKS Auto Mode. Use the NVIDIA device plugin with Karpenter and EKS Auto Mode. The NVIDIA device plugin also remains supported for EKS managed node groups and self-managed nodes.

NVIDIA DRA driver vs. NVIDIA device plugin

Capability	NVIDIA DRA driver	NVIDIA device plugin
Minimum Kubernetes version	1.34	All EKS-supported Kubernetes versions
EKS Compute	Managed node groups, self-managed nodes	EKS Auto Mode, Karpenter, managed node groups, self-managed nodes
EKS-optimized AMIs	AL2023 (NVIDIA)	AL2023 (NVIDIA), Bottlerocket
Device advertisement	Rich attributes via `ResourceSlice` objects including GPU model, memory, driver version, and topology	Integer count of `nvidia.com/gpu` extended resources
GPU sharing	Multiple containers in the same Pod can share a GPU through shared `ResourceClaim` references	Not supported. Each GPU is exclusively allocated to one container.
ComputeDomains	Manages Multi-Node NVLink (MNNVL) through `ComputeDomain` resources for secure multi-node GPU communication	Not supported
Attribute-based selection	Filter GPUs by model, memory, or other attributes using CEL expressions	Not supported
Topology-aware EFA allocation	DRA-native topology-awareness	Automatic topology-awareness (EKS-optimized AL2023 AMIs only)

Install the NVIDIA DRA driver

The NVIDIA DRA driver for GPUs manages two types of resources: GPUs and ComputeDomains. It runs two DRA kubelet plugins: gpu-kubelet-plugin and compute-domain-kubelet-plugin. Each can be enabled or disabled separately during installation. This guide focuses on GPU allocation. For using ComputeDomains, see Use P6e-GB200 UltraServers with Amazon EKS.

Prerequisites

An Amazon EKS cluster running Kubernetes version 1.34 or later with EKS managed node groups or self-managed node groups.
Nodes with NVIDIA GPU instance types (such as P or G instances).
Nodes with host-level components installed for NVIDIA GPUs. When using the EKS-optimized AL2023 or Bottlerocket NVIDIA AMIs, the host-level NVIDIA driver, CUDA user mode driver, and container toolkit are pre-installed.
Helm installed in your command-line environment, see the Setup Helm instructions for more information.
kubectl configured to communicate with your cluster, see Install or update kubectl for more information.

Procedure

Important

When using the NVIDIA DRA driver for GPU device management, it cannot be deployed alongside the NVIDIA device plugin on the same node. See upstream Kubernetes KEP-5004 for updates.

Important

Using the NVIDIA DRA driver with Bottlerocket is not currently supported.

Add the NVIDIA DRA driver Helm chart repository.


helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

Update your local Helm repository.
```
helm repo update
```
Get the latest version of the NVIDIA DRA driver.
```
helm search repo nvidia/nvidia-dra
```

Install the NVIDIA DRA driver.


helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    --set resources.computeDomains.enabled=false \
    --set 'gpuResourcesEnabledOverride=true'

For advanced configuration options, see the NVIDIA DRA driver Helm chart.

Verify that the DRA driver pods are running.
```
kubectl get pods -n nvidia-dra-driver-gpu
```

Verify that the DeviceClass objects were created.


kubectl get deviceclass


NAME            AGE
gpu.nvidia.com  60s

Verify that ResourceSlice objects are published for your GPU nodes.


kubectl get resourceslice

To request NVIDIA GPUs using the DRA driver, create a ResourceClaimTemplate that references the gpu.nvidia.com DeviceClass and reference it in your Pod specification. The following example requests a single GPU. See Topology-aware EFA and GPU/Neuron device allocation for steps to allocate NVIDIA GPUs with topology-aligned EFA interfaces.


apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: app
    ...
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Install the NVIDIA Kubernetes device plugin

The NVIDIA Kubernetes device plugin advertises NVIDIA GPUs as nvidia.com/gpu extended resources. You request GPUs in container resource requests and limits.

Prerequisites

An Amazon EKS cluster.
Nodes with NVIDIA GPU instance types using the EKS-optimized AL2023 NVIDIA AMI. The EKS-optimized Bottlerocket AMIs include the NVIDIA device plugin and no separate installation is required.
Helm installed in your command-line environment, see the Setup Helm instructions for more information.
kubectl configured to communicate with your cluster, see Install or update kubectl for more information.

Procedure

Add the NVIDIA device plugin Helm chart repository.


helm repo add nvdp https://nvidia.github.io/k8s-device-plugin

Update your local Helm repository.
```
helm repo update
```
Install the NVIDIA Kubernetes device plugin.
```
helm install nvdp nvdp/nvidia-device-plugin \
    --namespace nvidia \
    --create-namespace \
    --set gfd.enabled=true
```
Disable MOFED on nodes with EFA devices
Starting with k8s-device-plugin v0.19.0, the --mofed-enabled flag defaults to true. This causes the NVIDIA device plugin to mount all /dev/infiniband/uverbs* devices into containers requesting GPUs (nvidia.com/gpu). This conflicts with the AWS EFA device plugin, which manages mounting of the uverbs devices. Workloads requesting fewer than all EFA devices on a node are impacted because the NVIDIA device plugin claims all uverbs devices by default.
If your nodes have EFA devices, explicitly disable MOFED when installing the NVIDIA device plugin:
```
helm upgrade --install nvdp nvdp/nvidia-device-plugin \
    --namespace nvidia \
    --create-namespace \
    --set gfd.enabled=true \
    --set mofedEnabled=false
```
If you manage the NVIDIA device plugin through the NVIDIA GPU Operator, disable MOFED using the devicePlugin.env field:
```
helm upgrade --install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator \
    --set 'devicePlugin.env[0].name=MOFED_ENABLED' \
    --set 'devicePlugin.env[0].value=false'
```
For more information, see NVIDIA k8s-device-plugin issue #1692.
Note
You can also install and manage the NVIDIA Kubernetes device plugin using the NVIDIA GPU Operator, which automates the management of all NVIDIA software components needed to provision GPUs.

Verify the NVIDIA device plugin DaemonSet is running.


kubectl get ds -n nvidia nvdp-nvidia-device-plugin


NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
nvdp-nvidia-device-plugin   2         2         2       2            2           <none>          60s

Verify that your nodes have allocatable GPUs.


kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

An example output is as follows.


NAME                                           GPU
ip-192-168-11-225.us-west-2.compute.internal   1
ip-192-168-24-96.us-west-2.compute.internal    1

Request NVIDIA GPUs in a Pod

To request NVIDIA GPUs using the device plugin, specify the nvidia.com/gpu resource in your container resource requests and limits.


apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: gpu-demo
    image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
    command: ["/bin/sh", "-c"]
    args: ["nvidia-smi && tail -f /dev/null"]
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

To run this test, apply the manifest and view the logs:


kubectl apply -f nvidia-smi.yaml
kubectl logs nvidia-smi

An example output is as follows.


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI XXX.XXX.XX            Driver Version: XXX.XXX.XX     CUDA Version: XX.X      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   27C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Manage Neuron accelerators

Recipes

Manage NVIDIA GPU devices on Amazon EKS

NVIDIA DRA driver vs. NVIDIA device plugin

Install the NVIDIA DRA driver

Prerequisites

Procedure

Important

Important

Install the NVIDIA Kubernetes device plugin

Prerequisites

Procedure

Disable MOFED on nodes with EFA devices

Note

Request NVIDIA GPUs in a Pod