

 **Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Overview of Artificial Intelligence (AI) and Machine Learning (ML) on Amazon EKS
<a name="machine-learning-on-eks"></a>

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes platform that empowers organizations to deploy, manage, and scale AI and machine learning (ML) workloads with unparalleled flexibility and control. Built on the open source Kubernetes ecosystem, EKS lets you harness your existing Kubernetes expertise, while integrating seamlessly with open source tools and AWS services.

Whether you’re training large-scale models, running real-time online inference, or deploying generative AI applications, EKS delivers the performance, scalability, and cost efficiency your AI/ML projects demand.

## Why Choose EKS for AI/ML?
<a name="_why_choose_eks_for_aiml"></a>

EKS is a managed Kubernetes platform that helps you deploy and manage complex AI/ML workloads. Built on the open source Kubernetes ecosystem, it integrates with AWS services, providing the control and scalability needed for advanced projects. For teams new to AI/ML deployments, existing Kubernetes skills transfer directly, allowing efficient orchestration of multiple workloads.

EKS supports everything from operating system customizations to compute scaling, and its open source foundation promotes technological flexibility, preserving choice for future infrastructure decisions. The platform provides the performance and tuning options AI/ML workloads require, supporting features such as:
+ Full cluster control to fine-tune costs and configurations without hidden abstractions
+ Sub-second latency for real-time inference workloads in production
+ Advanced customizations like multi-instance GPUs, multi-cloud strategies, and OS-level tuning
+ Ability to centralize workloads using EKS as a unified orchestrator across AI/ML pipelines

## Key use cases
<a name="_key_use_cases"></a>

Amazon EKS provides a robust platform for a wide range of AI/ML workloads, supporting various technologies and deployment patterns:
+  **Real-time (online) inference:** EKS powers immediate predictions on incoming data, such as fraud detection, with sub-second latency using tools like [TorchServe](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-torchserve.html), [Triton Inference Server](https://aws.amazon.com/blogs/containers/quora-3x-faster-machine-learning-25-lower-costs-with-nvidia-triton-on-amazon-eks/), and [KServe](https://kserve.github.io/website/0.8/get_started/first_isvc/) on Amazon EC2 [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) and [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) instances. These workloads benefit from dynamic scaling with [Karpenter](https://karpenter.sh/) and [KEDA](https://keda.sh/), while leveraging [Amazon EFS](https://aws.amazon.com/efs/) for model sharding across pods. [Amazon ECR Pull Through Cache (PTC)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html) accelerates model updates, and [Bottlerocket](https://aws.amazon.com/bottlerocket/) data volumes with [Amazon EBS](https://docs.aws.amazon.com/ebs/latest/userguide/what-is-ebs.html)-optimized volumes ensure fast data access.
+  **General model training:** Organizations leverage EKS to train complex models on large datasets over extended periods using the [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/trainer/), [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), and [Torch Distributed Elastic](https://pytorch.org/docs/stable/distributed.elastic.html) on [Amazon EC2 P4d](https://aws.amazon.com/ec2/instance-types/p4/) and [Amazon EC2 Trn1](https://aws.amazon.com/ec2/instance-types/trn1/) instances. These workloads are supported by batch scheduling with tools like [Volcano](https://volcano.sh/en/#home_slider), [Yunikorn](https://yunikorn.apache.org/), and [Kueue](https://kueue.sigs.k8s.io/). [Amazon EFS](https://aws.amazon.com/efs/) enables sharing of model checkpoints, and [Amazon S3](https://aws.amazon.com/s3/) handles model import/export with lifecycle policies for version management.
+  **Retrieval augmented generation (RAG) pipelines:** EKS manages customer support chatbots and similar applications by integrating retrieval and generation processes. These workloads often use tools like [Argo Workflows](https://argoproj.github.io/workflows/) and [Kubeflow](https://www.kubeflow.org/) for orchestration, vector databases like [Pinecone](https://www.pinecone.io/blog/serverless/), [Weaviate](https://weaviate.io/), or [Amazon OpenSearch](https://aws.amazon.com/opensearch-service/), and expose applications to users via the [Application Load Balancer Controller (LBC)](aws-load-balancer-controller.md). [NVIDIA NIM](https://docs.nvidia.com/nim/index.html) optimizes GPU utilization, while [Prometheus](prometheus.md) and [Grafana](https://aws.amazon.com/grafana/) monitor resource usage.
+  **Generative AI model deployment:** Companies deploy real-time content creation services on EKS, such as text or image generation, using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), [vLLM](https://github.com/vllm-project/vllm), and [Triton Inference Server](https://aws.amazon.com/blogs/containers/quora-3x-faster-machine-learning-25-lower-costs-with-nvidia-triton-on-amazon-eks/) on Amazon [EC2 G5](https://aws.amazon.com/ec2/instance-types/g5/) and [Inferentia](https://aws.amazon.com/ai/machine-learning/inferentia/) accelerators. These deployments optimize performance and memory utilization for large-scale models. [JupyterHub](https://jupyter.org/hub) enables iterative development, [Gradio](https://www.gradio.app/) provides simple web interfaces, and the [S3 Mountpoint CSI Driver](s3-csi.md) allows mounting S3 buckets as file systems for accessing large model files.
+  **Batch (offline) inference:** Organizations process large datasets efficiently through scheduled jobs with [AWS Batch](https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html) or [Volcano](https://volcano.sh/en/docs/schduler_introduction/). These workloads often use [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) and [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) EC2 instances for AWS [Inferentia](https://aws.amazon.com/ai/machine-learning/inferentia/) chips, Amazon EC2 [G4dn](https://aws.amazon.com/ec2/instance-types/g4/) instances for NVIDIA T4 GPUs, or [c5](https://aws.amazon.com/ec2/instance-types/c5/) and [c6i](https://aws.amazon.com/ec2/instance-types/c6i) CPU instances, maximizing resource utilization during off-peak hours for analytics tasks. The [AWS Neuron SDK](https://aws.amazon.com/ai/machine-learning/neuron/) and NVIDIA GPU drivers optimize performance, while MIG/TS enables GPU sharing. Storage solutions include [Amazon S3](https://aws.amazon.com/s3/) and Amazon [EFS](https://aws.amazon.com/efs/) and [FSx for Lustre](https://aws.amazon.com/fsx/lustre/), with CSI drivers for various storage classes. Model management leverages tools like [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/), [Argo Workflows](https://argoproj.github.io/workflows/), and [Ray Cluster](https://docs.ray.io/en/latest/cluster/getting-started.html), while monitoring is handled by [Prometheus](prometheus.md), [Grafana](https://aws.amazon.com/grafana/) and custom model monitoring tools.

## Case studies
<a name="_case_studies"></a>

Customers choose Amazon EKS for various reasons, such as optimizing GPU usage or running real-time inference workloads with sub-second latency, as demonstrated in the following case studies. For a list of all case studies for Amazon EKS, see [AWS Customer Success Stories](https://aws.amazon.com/solutions/case-studies/browse-customer-success-stories/?refid=cr_card&customer-references-cards.sort-by=item.additionalFields.sortDate&customer-references-cards.sort-order=desc&awsf.customer-references-location=*all&awsf.customer-references-industry=*all&awsf.customer-references-use-case=*all&awsf.language=language%23english&awsf.customer-references-segment=*all&awsf.content-type=*all&awsf.customer-references-product=product%23eks&awsm.page-customer-references-cards=1).
+  [Unitary](https://aws.amazon.com/solutions/case-studies/unitary-eks-case-study/?did=cr_card&trk=cr_card) processes 26 million videos daily using AI for content moderation, requiring high-throughput, low-latency inference and have achieved an 80% reduction in container boot times, ensuring fast response to scaling events as traffic fluctuates.
+  [Miro](https://aws.amazon.com/solutions/case-studies/miro-eks-case-study/), the visual collaboration platform supporting 70 million users worldwide, reported an 80% reduction in compute costs compared to their previous self-managed Kubernetes clusters.
+  [Synthesia](https://aws.amazon.com/solutions/case-studies/synthesia-case-study/?did=cr_card&trk=cr_card), which offers generative AI video creation as a service for customers to create realistic videos from text prompts, achieved a 30x improvement in ML model training throughput.
+  [Harri](https://aws.amazon.com/solutions/case-studies/harri-eks-case-study/?did=cr_card&trk=cr_card), providing HR technology for the hospitality industry, achieved 90% faster scaling in response to spikes in demand and reduced its compute costs by 30% by migrating to [AWS Graviton processors](https://aws.amazon.com/ec2/graviton/).
+  [Ada Support](https://aws.amazon.com/solutions/case-studies/ada-support-eks-case-study/), an AI-powered customer service automation company, achieved a 15% reduction in compute costs alongside a 30% increase in compute efficiency.
+  [Snorkel AI](https://aws.amazon.com/blogs/startups/how-snorkel-ai-achieved-over-40-cost-savings-by-scaling-machine-learning-workloads-using-amazon-eks/), which equips enterprises to build and adapt foundation models and large language models, achieved over 40% cost savings by implementing intelligent scaling mechanisms for their GPU resources.

## Start using Machine Learning on EKS
<a name="_start_using_machine_learning_on_eks"></a>

To begin planning for and using Machine Learning platforms and workloads on EKS on the AWS cloud, proceed to the [Resources to get started with AI/ML on Amazon EKS](ml-resources.md) section.

# Running real-time online inference workloads on Amazon EKS
<a name="ml-realtime-inference"></a>

**Tip**  
 [Register](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) for upcoming Amazon EKS AI/ML workshops.

This section is designed to help you deploy and operate real-time online inference workloads on Amazon Elastic Kubernetes Service (EKS). You’ll find guidance on building optimized clusters with GPU-accelerated nodes, integrating AWS services for storage and autoscaling, deploying sample models for validation, and key architectural considerations such as decoupling CPU and GPU tasks, selecting appropriate AMIs and instance types, and ensuring low-latency exposure of inference endpoints.

**Topics**
+ [

# Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS
](ml-realtime-inference-cluster.md)
+ [

# Quickstart: High-throughput LLM inference with vLLM on Amazon EKS
](ml-realtime-inference-llm-inference-vllm.md)

# Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS
<a name="ml-realtime-inference-cluster"></a>

**Tip**  
 [Register](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) for upcoming Amazon EKS AI/ML workshops.

## Introduction
<a name="_introduction"></a>

This guide offers a hands-on walkthrough for setting up an Amazon Elastic Kubernetes Service (EKS) cluster optimized for real-time online inference workloads, incorporating best practices curated by AWS experts throughout. It uses an opinionated EKS Quickstart Architecture—a curated set of drivers, instance types, and configurations aligned with AWS best practices for models, accelerators, and scaling. This approach helps you bypass the task of selecting cluster settings, allowing you to get a functional, pre-configured cluster up and running quickly. Along the way, we’ll deploy sample workloads to validate your setup, explain key architectural concepts (such as decoupling CPU-bound tasks from GPU-intensive computations), address common questions (e.g., why choose Bottlerocket AMI over AL2023?), and outline next steps to extend your cluster’s capabilities.

Designed specifically for Machine Learning (ML) and Artificial Intelligence (AI) engineers, platform administrators, operators, and data/AI specialists who are new to the AWS and EKS ecosystem, this guide assumes familiarity with Kubernetes but no prior EKS experience. It is designed to help you understand the steps and processes needed to get real-time online inference workloads up and running. The guide shows you the essentials of creating a single-node inference cluster, including provisioning GPU resources, integrating storage for model artifacts, enabling secure AWS service access, and exposing inference endpoints. Throughout, it emphasizes low-latency, resilient design for user-facing applications like fraud detection, real-time chatbots, and sentiment analysis in customer feedback systems.

In this guide, we focus exclusively on setting up a foundational, prescriptive starting point using G5 EC2 instances. If you’re seeking AWS Inferentia-specific cluster configurations or end-to-end workflows, see [Use AWS Inferentia instances with Amazon EKS for Machine Learning](inferentia-support.md) or our workshops in [Resources to get started with AI/ML on Amazon EKS](ml-resources.md).

## Before you begin
<a name="_before_you_begin"></a>

Before you start, make sure you have performed the following tasks:
+  [Setup your environment for Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html) 
+  [Install the latest version of eksctl](https://eksctl.io/installation/) 
+  [Install Helm](https://helm.sh/docs/intro/install/) 
+  [(Optional) Install Docker](https://docs.docker.com/get-started/) 
+  [(Optional) Install the NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli) 

## Architecture
<a name="_architecture"></a>

Real-time online inference refers to the process of using a trained machine learning model to generate predictions or outputs on incoming data streams with minimal latency. For example, it enables real-time fraud detection, classification of images, or the generation of knowledge graphs in response to user inputs. The architecture of a real-time online inference system delivers low-latency machine learning predictions in user-facing applications by decoupling CPU-bound web traffic handling from GPU-intensive AI computations. This process typically lives within a larger application ecosystem, and often includes backend, frontend, vector, and model services, with a focus on specialized components to enable independent scaling, parallel development, and resilience against failures. Isolating inference tasks on dedicated GPU hardware and leveraging interfaces like APIs and WebSockets ensures high concurrency, fast processing of models like transformers, and user interactions through the frontend. Note that although vector databases and Retrieval Augmented Generation (RAG) pipelines often play a big part in real-time inference systems, these components are not covered in this guide. At a minimum, a typical architecture often includes:
+  **Frontend Service**: Serves as the user-facing interface, handling client-side logic, rendering dynamic content, and facilitating real-time interactions, it communicates with the backend service to initiate inference requests and display results, often initiating requests to the backend service which uses WebSockets for streaming updates or APIs for structured data exchange. This service typically does not require a dedicated load balancer, as it can be hosted on content delivery networks (CDNs) like AWS CloudFront for static assets or served directly from web servers, with scaling handled via auto-scaling groups if needed for dynamic content.
+  **Backend Service**: Acts as the application’s orchestrator, managing business logic such as user authentication, data validation, and service coordination (e.g., via APIs for RESTful endpoints or WebSockets for persistent connections). It communicates with the inference service, scales independently on multi-core CPUs and RAM to handle high web traffic without relying on GPUs, and often requires a load balancer (such as AWS Application Load Balancer or Network Load Balancer) to distribute incoming requests across multiple instances, especially in high-concurrency scenarios. An ingress controller can further manage external access and routing rules for enhanced security and traffic shaping.
+  **Inference Service**: Serves as the core for AI computations, running on GPUs with sufficient VRAM (e.g., 8-12 GB for models like DistilBERT) to perform vector embeddings, knowledge extraction, and model inference (e.g., exposed through APIs for batch requests or WebSockets for real-time streaming) using custom or open-source models. This isolation prevents dependency conflicts, allows model updates without downtime, and enables horizontal scaling with load balancing for multiple concurrent requests. To expose the model service effectively, it typically sits behind a load balancer to distribute GPU-bound workloads across replicated instances, while an ingress resource or controller (such as ALB Ingress Controller in AWS) handles external routing, SSL termination, and path-based forwarding to ensure secure and efficient access without overwhelming individual GPUs.

## Solution Overview
<a name="_solution_overview"></a>

Real-time online inference systems require a high-performance, resilient architecture that can deliver ultra-low latency while handling unpredictable, high volume traffic bursts. This solution overview explains how the following AWS components work together in the Amazon EKS cluster we will create to ensure our cluster is able to host and manage machine learning models that provide immediate predictions on live data with minimal delay for end-users.
+  [Amazon G5 EC2 Instances](https://aws.amazon.com/ec2/instance-types/g5/) — For GPU-intensive inference tasks, we are using g5.xlarge and g5.2xlarge G5 EC2 instance types, which feature a single (1) NVIDIA A10G GPU with 24GB of memory (e.g., 8 billion parameters at FP16). Based on the NVIDIA Ampere Architecture, these GPUs are powered by NVIDIA A10G Tensor Core GPUs and 2nd generation AMD EPYC processors, support 4-8 vCPUs, up to 10 Gbps network bandwidth, and 250-450 GB of local NVMe SSD storage, ensuring fast data movement and compute power for complex models, making them ideal for low-latency, high-throughput inference tasks. Choosing an EC2 instance type is application-specific, depends on your model (e.g., image, video, text model), and your latency and throughput requirements. For instance, if using an image and or video model, you may want to use [P5 EC2 instances](https://aws.amazon.com/ec2/instance-types/p5/) for optimal, real-time latency. We recommend starting out with [G5 EC2 instances](https://aws.amazon.com/ec2/instance-types/g5/) as it provides a good starting point for getting up and running quickly, then evaluating whether it’s the right fit for your workloads through performance benchmark testing. For more advanced cases, consider [G6 EC2 instances](https://aws.amazon.com/ec2/instance-types/g6/).
+  [Amazon EC2 M7g Instances](https://aws.amazon.com/ec2/instance-types/m7g/) — For CPU-intensive tasks like data preprocessing, API request handling, hosting the Karpenter controller, add-ons, and other system components, we are using the m5.xlarge M7g EC2 instance type. M7g instances are ARM-based instance which features 4 vCPUs, 16 GB of memory, up to 12.5 Gbps network bandwidth, and is powered by AWS Graviton3 processors. Choosing an EC2 instance type is application-specific and depends on your workload’s compute, memory, and scalability requirements. For compute-optimized workloads, you might consider [C7g EC2 instances](https://aws.amazon.com/ec2/instance-types/c7g/), which also use Graviton3 processors but are optimized for higher compute performance than M7g instances for certain use cases. Alternatively, newer [C8g EC2 instances](https://aws.amazon.com/ec2/instance-types/c8g/) (where available) provide up to 30% better compute performance than C7g instances. We recommend starting out with M7g EC2 instances for their cost efficiency and compatibility with a wide range of workloads (e.g., application servers, microservices, gaming servers, mid-size data stores), then evaluating whether it’s the right fit for your workloads through performance benchmark testing.
+  [Amazon S3 Mountpoint CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html) — For workloads on single-GPU instances where multiple pods share a GPU (e.g., multiple pods scheduled on the same node to utilize its GPU resources), we are using the Mountpoint S3 CSI Driver to optimize memory usage—essential for tasks like large-model inference in cost-sensitive, low-complexity setups. It exposes Amazon S3 buckets as a POSIX-like file system available to the Kubernetes cluster, which allows inference pods to read model artifacts (e.g., model weights) directly into memory without having to download them first, and input datasets using standard file operations. Additionally, S3 has virtually unlimited storage capacity and accelerates data-intensive inference workloads. Choosing a storage CSI driver is application-specific, and depends on your workload’s throughput and latency requirements. Though the [FSx for OpenZFS CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/fsx-openzfs-csi.html) offers sub-millisecond latency for random I/O or fully POSIX-compliant shared persistent volumes across nodes, we recommend starting out with the Mountpoint S3 CSI Driver due to its scalability, lower costs for large datasets, and built-in integration with S3-managed object storage for read-heavy inference patterns (e.g., streaming model inputs), then evaluating whether it’s the right fit for your workloads through performance benchmark testing.
+  [EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html) — To enable access to AWS services, we are using the EKS Pod Identity Agent, which uses a single service principal and facilitates pod-level IAM role associations within the Amazon EKS cluster. EKS Pod Identity offers a streamlined alternative to the traditional [IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) approach by utilizing a single service principal (pods.eks.amazonaws.com) instead of relying on individual OIDC providers for each cluster, which makes it easier to assign permissions. Additionally, it enables roles to be reused across multiple clusters and it supports advanced features like [IAM role session tags](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-abac.html) and [Target IAM roles](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-assign-target-role.html).
+  [EKS Node Monitoring Agent](https://docs.aws.amazon.com/eks/latest/userguide/node-health.html) — To ensure continuous availability and reliability of inference services, we are using the EKS Node Monitoring Agent with Auto Repair, which automatically detects and replaces unhealthy nodes, minimizing downtime. It continuously monitors nodes for hardware, kernel, networking, and storage issues using enhanced health checks (e.g., KernelReady, NetworkingReady). For GPU nodes, it detects accelerator-specific failures, initiating graceful remediation by cordoning unhealthy nodes, waiting 10 minutes for transient GPU issues to resolve, and replacing nodes after 30 minutes for persistent failures.
+  [Bottlerocket AMI](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami-bottlerocket.html) — To provide a security-hardened foundation for our EKS cluster, we are using the Bottlerocket AMI, which includes only the essential components required to run containers and offers minimal boot times for fast scaling. Choosing a node AMI is application-specific and depends on your workload’s customization, security, and scalability requirements. Though the [AL2023 AMI](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html) provides greater flexibility for host-level installations and customizations (e.g., specifying a dedicated cache directory in a PV/PVC without any additional node configurations), we recommend starting out with the Bottlerocket AMI for its smaller footprint and built-in optimization for containerized workloads (e.g., microservices, inference servers, scalable APIs), then evaluating whether it’s the right fit for your workloads through performance benchmark testing.
+  [AWS Load Balancer Controller (LBC)](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html) — To expose real-time inference endpoints, we are using the AWS Load Balancer Controller, which automatically provisions and manages Application Load Balancers (ALBs) for HTTP/HTTPS traffic and Network Load Balancers (NLBs) for TCP/UDP traffic based on Kubernetes Ingress and Service resources, enabling the integration of inference models with external clients. Additionally, it supports features like path-based routing to distribute inference requests across multiple pods or nodes, ensuring scalability during traffic spikes and minimizing latency through AWS-native optimizations like connection multiplexing and health checks.

## 1. Create your EKS cluster
<a name="_1_create_your_eks_cluster"></a>

In this step, we create a cluster with CPU nodes and a managed node group using an AWS CloudFormation-powered eksctl [ClusterConfig](https://eksctl.io/usage/creating-and-managing-clusters/) template. Initializing the cluster with only CPU nodes allows us to use Karpenter exclusively to manage CPU-intensive and GPU nodes for optimized resource allocation using Karpenter NodePools which we create in later steps. To support our real-time inference workloads, we provision the cluster with the EKS Bottlerocket AMI, EKS Node Monitoring Agent, EKS Pod Identity Agent, Mountpoint S3 CSI Driver, AWS Load Balancer Controller (LBC), and [kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html), [vpc-cni](https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html), and [coredns](https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html) drivers. The m7g.xlarge instances will be used for CPU system tasks, including hosting the Karpenter controller, add-ons, and other system components.

By default, `eksctl` will create a dedicated VPC for the cluster with a CIDR block of `192.168.0.0/16`. The VPC includes three public subnets and three private subnets, each distributed across three different Availability Zones (or two AZs in the `us-east-1` region) which is the ideal method for deploying Kubernetes workloads. The template also deploys an internet gateway, providing internet access to the public subnets through default routes in their route tables and a single NAT gateway in one of the public subnets, with default routes in the private subnets' route tables directing outbound traffic through the NAT gateway for internet access. To learn more about this setup, see [Deploy Nodes to Private Subnets](https://docs.aws.amazon.com/eks/latest/best-practices/subnets.html#_deploy_nodes_to_private_subnets).

### Check your credentials
<a name="_check_your_credentials"></a>

Check whether your AWS CLI credentials are valid and can authenticate with AWS services:

```
aws sts get-caller-identity
```

If successful, the CLI will return details about your AWS identity (UserId, Account, and Arn).

### Check instance availability
<a name="_check_instance_availability"></a>

G5 instance types are not available in all regions. Check your nearest region. For example:

```
aws ec2 describe-instance-types --instance-types g5.xlarge g5.2xlarge --region us-east-1
```

If successful, the G5 instance type is available in the region you specified.

The Bottlerocket AMI is not available in all regions. Check by retrieving a Bottlerocket AMI ID for your nearest region. For example:

```
aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.33/arm64/latest/image_id \
    --region us-east-1 --query "Parameter.Value" --output text
```

If successful, the Bottlerocket AMI is available in the region you specified.

### Prepare your environment
<a name="_prepare_your_environment"></a>

First, set the following environment variables in a new terminal window. **Note**: Be sure to substitute the sample placeholders with your unique values, including cluster name, your desired region, [Karpenter release version](https://github.com/kubernetes-sigs/karpenter/releases), and [Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html).

**Tip**  
Some variables (such as `${AWS_REGION}` and `${K8S_VERSION}`) are defined early in the block and then referenced in later commands for consistency and to avoid repetition. Make sure to run the commands in sequence so that these values are properly exported and available for use in subsequent definitions.

```
export TEMPOUT="$(mktemp)"
export K8S_VERSION=1.33
export KARPENTER_VERSION="1.5.0"
export AWS_REGION="us-east-1"
export EKS_CLUSTER_NAME="eks-rt-inference-${AWS_REGION}"
export S3_BUCKET_NAME="eks-rt-inference-models-${AWS_REGION}-$(date +%s)"
export NVIDIA_BOTTLEROCKET_AMI="$(aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-${K8S_VERSION}-nvidia/x86_64/latest/image_id --query Parameter.Value --output text)"
export STANDARD_BOTTLEROCKET_AMI="$(aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-${K8S_VERSION}/arm64/latest/image_id --query Parameter.Value --output text)"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export ALIAS_VERSION="$(aws ssm get-parameter --name "/aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2023/x86_64/standard/recommended/image_id" --query Parameter.Value | xargs aws ec2 describe-images --query 'Images[0].Name' --image-ids | sed -r 's/^.*(v[[:digit:]]+).*$/\1/')"
```

### Create required roles and policies
<a name="_create_required_roles_and_policies"></a>

Karpenter needs specific IAM roles and policies (e.g., Karpenter controller IAM role, instance profile, and policies) to manage EC2 instances as Kubernetes worker nodes. It uses these roles to perform actions like launching and terminating EC2 instances, tagging resources, and interacting with other AWS services. Create the Karpenter roles and policies using the Karpenter’s [cloudformation.yaml](https://raw.githubusercontent.com/aws/karpenter-provider-aws/v1.5.0/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml):

```
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
&& aws cloudformation deploy \
  --stack-name "Karpenter-${EKS_CLUSTER_NAME}" \
  --template-file "${TEMPOUT}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${EKS_CLUSTER_NAME}"
```

The AWS LBC needs permission to provision and manage AWS load balancers, such as creating ALBs for Ingress resources or NLBs for services of type `LoadBalancer`. We’ll specify this permissions policy during cluster creation. During cluster creation, we will create the service account with eksctl in the ClusterConfig. Create the LBC IAM policy:

```
aws iam create-policy \
  --policy-name AWSLoadBalancerControllerIAMPolicy \
  --policy-document "$(curl -fsSL https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.14.1/docs/install/iam_policy.json)"
```

When the Mountpoint S3 CSI Driver is installed, its DaemonSet pods are configured to use a service account for execution. The Mountpoint for Mountpoint S3 CSI driver needs permission to interact with the Amazon S3 bucket you create later in this guide. We’ll specify this permissions policy during cluster creation. During cluster creation, we will create the service account with eksctl in the ClusterConfig. Create the S3 IAM policy:

```
aws iam create-policy \
    --policy-name S3CSIDriverPolicy \
    --policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [{\"Effect\": \"Allow\", \"Action\": [\"s3:GetObject\", \"s3:PutObject\", \"s3:AbortMultipartUpload\", \"s3:DeleteObject\", \"s3:ListBucket\"], \"Resource\": [\"arn:aws:s3:::${S3_BUCKET_NAME}\", \"arn:aws:s3:::${S3_BUCKET_NAME}/*\"]}]}"
```

 **Note**: if a role already exists with this name, give the role a different name. The role we create in this step is specific to your cluster and your S3 bucket.

### Create the cluster
<a name="_create_the_cluster"></a>

In this template, eksctl automatically creates a Kubernetes service account for EKS Pod Identity, Node Monitoring Agent, CoreDNS, Kubeproxy, the VPC CNI Plugin. As of today, the Mountpoint S3 CSI Driver is not available for EKS Pod Identity, so we create an IAM Roles for Service Account (IRSA) and an [OIDC endpoint](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html). In addition, we create a service account for the AWS Load Balancer Controller (LBC). For access to Bottlerocket nodes, eksctl automatically attaches AmazonSSMManagedInstanceCore for Bottlerocket to allow secure shell sessions via SSM.

In the same terminal where you set your environment variables, run the following command block to create the cluster:

```
eksctl create cluster -f - <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${EKS_CLUSTER_NAME}
  region: ${AWS_REGION}
  version: "${K8S_VERSION}"
  tags:
    karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
    # Add more tags if needed for billing
iam:
  # Creates an OIDC endpoint and IRSA service account for the Mountpoint S3 CSI Driver
  # Uses the S3 CSI Driver policy for permissions
  withOIDC: true
  podIdentityAssociations:
  # Creates the pod identity association and service account
  # Uses the Karpenter controller IAM policy for permissions
  - namespace: "kube-system"
    serviceAccountName: karpenter
    roleName: ${EKS_CLUSTER_NAME}-karpenter
    permissionPolicyARNs:
    - arn:aws:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${EKS_CLUSTER_NAME}
  # Creates the pod identity association and service account
  # Uses the {aws} LBC policy for permissions
  - namespace: kube-system
    serviceAccountName: aws-load-balancer-controller
    createServiceAccount: true
    roleName: AmazonEKSLoadBalancerControllerRole
    permissionPolicyARNs:
    - arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AWSLoadBalancerControllerIAMPolicy
iamIdentityMappings:
- arn: "arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${EKS_CLUSTER_NAME}"
  username: system:node:{{EC2PrivateDNSName}}
  groups:
  - system:bootstrappers
  - system:nodes
managedNodeGroups:
  # Creates 2 CPU nodes for lightweight system tasks
  - name: ${EKS_CLUSTER_NAME}-m7-cpu
    instanceType: m7g.xlarge
    amiFamily: Bottlerocket
    desiredCapacity: 2
    minSize: 1
    maxSize: 10
    labels:
      role: cpu-worker
# Enable automatic Pod Identity associations for VPC CNI Driver, coreDNS, kube-proxy
addonsConfig:
  autoApplyPodIdentityAssociations: true
addons:
  # Installs the S3 CSI Driver addon and creates IAM role
  # Uses the S3 CSI Driver policy for IRSA permissions
  - name: aws-mountpoint-s3-csi-driver
    attachPolicyARNs:
      - "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/S3CSIDriverPolicy"
  - name: eks-pod-identity-agent
  - name: eks-node-monitoring-agent
  - name: coredns
  - name: kube-proxy
  - name: vpc-cni
EOF
```

This process takes several minutes to complete. If you’d like to monitor the status, see the [AWS CloudFormation](https://console.aws.amazon.com/cloudformation) console.

## 2. Verify Cluster Node and Pod Health
<a name="_2_verify_cluster_node_and_pod_health"></a>

Let’s perform a few health checks to ensure the cluster is ready. When the previous command completes, view the instance types and verify that your CPU system nodes have reached the `Ready` state with the following command:

```
kubectl get nodes -L node.kubernetes.io/instance-type
```

The expected output should look like this:

```
NAME                             STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-192-168-35-103.ec2.internal   Ready    <none>   12m     v1.33.0-eks-802817d   m7g.xlarge
ip-192-168-7-15.ec2.internal     Ready    <none>   12m     v1.33.0-eks-802817d   m7g.xlarge
```

Verify all the Pod Identity associations and how they map a role to a service account in a namespace in the cluster with the following command:

```
eksctl get podidentityassociation --cluster ${EKS_CLUSTER_NAME} --region ${AWS_REGION}
```

The output should show the IAM roles for Karpenter ("karpenter") and the AWS LBC ("aws-load-balancer-controller").

Verify the DaemonSets are available:

```
kubectl get daemonsets -n kube-system
```

The expected output should look like this:

```
NAME                           DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR          AGE
aws-node                       3       3       3     3          3         <none>                 12m
dcgm-server                    0       0       0     0          0         kubernetes.io/os=linux 12m
eks-node-monitoring-agent      3       3       3     3          3         kubernetes.io/os=linux 12m
eks-pod-identity-agent         3       3       3     3          3         <none>                 12m
kube-proxy                     3       3       3     3          3         <none>                 12m
s3-csi-node                    2       2       2     2          2         kubernetes.io/os=linux 12m
```

Verify all addons are installed on the cluster:

```
eksctl get addons --cluster ${EKS_CLUSTER_NAME} --region ${AWS_REGION}
```

The expected output should look like this:

```
NAME                           VERSION              STATUS    ISSUES    IAMROLE                                           UPDATE AVAILABLE    CONFIGURATION VALUES    POD IDENTITY ASSOCIATION ROLES
aws-mountpoint-s3-csi-driver   v1.15.0-eksbuild.1   ACTIVE    0    arn:aws:iam::143095308808:role/eksctl-eks-rt-inference-us-east-1-addon-aws-m-Role1-RAUjk4sJnc0L
coredns                        v1.12.1-eksbuild.2   ACTIVE    0
eks-node-monitoring-agent      v1.3.0-eksbuild.2    ACTIVE    0
eks-pod-identity-agent         v1.3.7-eksbuild.2    ACTIVE    0
kube-proxy                     v1.33.0-eksbuild.2   ACTIVE    0
metrics-server                 v0.7.2-eksbuild.3    ACTIVE    0
vpc-cni                        v1.19.5-eksbuild.1   ACTIVE    0
```

## 3. Install Karpenter
<a name="_3_install_karpenter"></a>

Install the Karpenter controller on your CPU worker nodes (`cpu-worker`) to optimize costs and conserve GPU resources. We’ll be installing it in the "kube-system" namespace and specifying the "karpenter" service account we defined during cluster creation. Additionally, this command configures the cluster name and a Spot Instance interruption queue for CPU nodes. Karpenter will use IRSA to assume this IAM role.

```
# Logout of helm registry before pulling from public ECR
helm registry logout public.ecr.aws

# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "kube-system" --create-namespace  \
  --set "settings.clusterName=${EKS_CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${EKS_CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::${AWS_ACCOUNT_ID}:role/${EKS_CLUSTER_NAME}-karpenter" \
  --wait
```

The expected output should look like this:

```
Release "karpenter" does not exist. Installing it now.
Pulled: public.ecr.aws/karpenter/karpenter:1.5.0
Digest: sha256:9a155c7831fbff070669e58500f68d7ccdcf3f7c808dcb4c21d3885aa20c0a1c
NAME: karpenter
LAST DEPLOYED: Thu Jun 19 09:57:06 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
```

Verify that Karpenter is running:

```
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
```

The expected output should look like this:

```
NAME                       READY   STATUS    RESTARTS   AGE
karpenter-555895dc-865bc   1/1     Running   0          5m58s
karpenter-555895dc-j7tk9   1/1     Running   0          5m58s
```

## 4. Setup Karpenter NodePools
<a name="_4_setup_karpenter_nodepools"></a>

In this step, we configure mutually exclusive CPU and GPU [Karpenter NodePools](https://karpenter.sh/docs/concepts/nodepools/). The `limits` field in the NodePool spec constrains the maximum total resources (e.g., CPU, memory, GPUs) that each NodePool can consume across all provisioned nodes, preventing additional node provisioning if these limits are exceeded. While NodePools support broad instance categories (e.g., `c`, `g`), specifying specific [instance types](https://karpenter.sh/docs/concepts/nodepools/#instance-types), [capacity types](https://karpenter.sh/docs/concepts/nodepools/#capacity-type), and resource [limits](https://karpenter.sh/docs/concepts/nodepools/#speclimits) help you more easily estimate costs for your on-demand workloads. In these NodePools, we use a diverse set of instance types within the G5 instance family. This allows Karpenter to automatically select the most appropriate instance type based on pod resource requests, optimizing resource utilization while respecting the NodePool’s total limits. To learn more, see [Creating NodePools](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html#_creating_nodepools).

### Setup the GPU NodePool
<a name="_setup_the_gpu_nodepool"></a>

In this NodePool, we set resource limits to manage the provisioning of nodes with GPU capabilities. These limits are designed to cap the total resources across all nodes in the pool, allowing for up to 10 instances in total. Each instance can be either g5.xlarge (4 vCPUs, 16 GiB memory, 1 GPU) or g5.2xlarge (8 vCPUs, 32 GiB memory, 1 GPU), as long as the total vCPUs do not exceed 80, total memory does not exceed 320GiB, and total GPUs do not exceed 10. For example, the pool can provision 10 g5.2xlarge instances (80 vCPUs, 320 GiB, 10 GPUs), or 10 g5.xlarge instances (40 vCPUs, 160 GiB, 10 GPUs), or a mix such as 5 g5.xlarge and 5 g5.2xlarge (60 vCPUs, 240 GiB, 10 GPUs), ensuring flexibility based on workload demands while respecting resource constraints.

Additionally, we specify the ID of the Nvidia variant of the Bottlerocket AMI. Finally, we set a [disruption policy](https://karpenter.sh/docs/concepts/disruption/#nodepool-disruption-budgets) to remove empty nodes after 30 minutes (`consolidateAfter: 30m`) and set a maximum node lifetime of 30 days (`expireAfter: 720h`) to optimize costs and maintain node health for GPU-intensive tasks. To learn more, see [Disable Karpenter Consolidation for interruption sensitive workloads](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_disable_karpenter_consolidation_for_interruption_sensitive_workloads), and [Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_use_ttlsecondsafterfinished_to_auto_clean_up_kubernetes_jobs).

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-a10g-inference-g5
spec:
  template:
    metadata:
      labels:
        role: gpu-worker
        gpu-type: nvidia-a10g
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.xlarge", "g5.2xlarge"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
      nodeClassRef:
        name: gpu-a10g-inference-ec2
        group: karpenter.k8s.aws
        kind: EC2NodeClass
      expireAfter: 720h
  limits:
    cpu: "80"
    memory: "320Gi"
    nvidia.com/gpu: "10"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-a10g-inference-ec2
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
    - id: ${NVIDIA_BOTTLEROCKET_AMI}
  role: "KarpenterNodeRole-${EKS_CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${EKS_CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${EKS_CLUSTER_NAME}"
  tags:
    nvidia.com/gpu: "true"
EOF
```

The expected output should look like this:

```
nodepool.karpenter.sh/gpu-a10g-inference-g5 created
ec2nodeclass.karpenter.k8s.aws/gpu-a10g-inference-ec2 created
```

Verify the NodePool is created and healthy:

```
kubectl get nodepool gpu-a10g-inference-g5 -o yaml
```

Look for `status.conditions` like `ValidationSucceeded: True`, `NodeClassReady: True`, and `Ready: True` to confirm the NodePool is healthy.

### Setup the CPU NodePool
<a name="_setup_the_cpu_nodepool"></a>

In this NodePool, we set limits to support approximately 50 instances, aligning with a moderate CPU workload (e.g., 100-200 pods) and typical AWS vCPU quotas (e.g., 128-1152). The limits are calculated assuming the NodePool should scale up to 50 m7.xlarge instances: CPU (4 vCPUs per instance × 50 instances = 200 vCPUs) and memory (16 GiB per instance × 50 instances = 800 GiB). These limits are designed to cap the total resources across all nodes in the pool, allowing for up to 50 m7g.xlarge instances (each with 4 vCPUs and 16 GiB memory), as long as the total vCPUs do not exceed 200 and total memory does not exceed 800GiB.

Additionally, we specify the ID of the standard variant of the Bottlerocket AMI. Finally, we set a [disruption policy](https://karpenter.sh/docs/concepts/disruption/#nodepool-disruption-budgets) to remove empty nodes after 60 minutes (`consolidateAfter: 60m`) and set a maximum node lifetime of 30 days (`expireAfter: 720h`) to optimize costs and maintain node health for GPU-intensive tasks. To learn more, see [Disable Karpenter Consolidation for interruption sensitive workloads](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_disable_karpenter_consolidation_for_interruption_sensitive_workloads), and [Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_use_ttlsecondsafterfinished_to_auto_clean_up_kubernetes_jobs).

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: cpu-inference-m7gxlarge
spec:
  template:
    metadata:
      labels:
        role: cpu-worker
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m7g.xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
      taints:
        - key: role
          value: cpu-intensive
          effect: NoSchedule
      nodeClassRef:
        name: cpu-inference-m7gxlarge-ec2
        group: karpenter.k8s.aws
        kind: EC2NodeClass
      expireAfter: 720h
  limits:
    cpu: "200"
    memory: "800Gi"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60m
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: cpu-inference-m7gxlarge-ec2
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
    - id: ${STANDARD_BOTTLEROCKET_AMI}
  role: "KarpenterNodeRole-${EKS_CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${EKS_CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${EKS_CLUSTER_NAME}"
EOF
```

The expected output should look like this:

```
nodepool.karpenter.sh/cpu-inference-m7gxlarge created
ec2nodeclass.karpenter.k8s.aws/cpu-inference-m7gxlarge-ec2 created
```

Verify the NodePool is created and healthy:

```
kubectl get nodepool cpu-inference-m7gxlarge -o yaml
```

Look for `status.conditions` like `ValidationSucceeded: True`, `NodeClassReady: True`, and `Ready: True` to confirm the NodePool is healthy.

## 5. Deploy a GPU Pod to Expose a GPU
<a name="_5_deploy_a_gpu_pod_to_expose_a_gpu"></a>

You need the Nvidia Device Plugin to enable Kubernetes to expose GPU devices to the Kubernetes cluster. Typically, you would need to deploy the plugin as a DaemonSet; however, the Bottlerocket AMI pre-installs the plugin as part of the AMI. That means when using Bottlerocket AMIs, there is no need to deploy the Nvidia device plugin DaemonSet. To learn more, see [Kubernetes Device Plugin to expose GPUs](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_use_kubernetes_device_plugin_for_exposing_gpus).

### Deploy a sample pod
<a name="_deploy_a_sample_pod"></a>

Karpenter acts dynamically: it provisions GPU nodes when a workload (pod) requests GPU resources. To verify that pods are able to request and use GPUs, deploy a pod that requests the `nvidia.com/gpu` resource in its limits (e.g., `nvidia.com/gpu: 1`). To learn more about these labels, see [Schedule workloads with GPU requirements using Well-Known labels](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html#_schedule_workloads_with_gpu_requirements_using_well_known_labels).

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-nvidia-smi
spec:
  restartPolicy: OnFailure
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  nodeSelector:
    role: gpu-worker  # Matches GPU NodePool's label
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.9.1-base-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
EOF
```

The expected output should look like this:

```
pod/gpu-nvidia-smi created
```

Give it a minute then check if the Pod has a "Pending," "ContainerCreating," "Running," then a "Completed" status:

```
kubectl get pod gpu-nvidia-smi -w
```

Verify the node for the pod belongs to the GPU NodePool:

```
kubectl get node $(kubectl get pod gpu-nvidia-smi -o jsonpath='{.spec.nodeName}') -o custom-columns="Name:.metadata.name,Nodepool:.metadata.labels.karpenter\.sh/nodepool"
```

The expected output should look like this:

```
Name                             Nodepool
ip-192-168-83-245.ec2.internal   gpu-a10g-inference-g5
```

Check the pod’s logs:

```
kubectl logs gpu-nvidia-smi
```

The expected output should look like this:

```
Thu Jul 17 04:31:33 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08                 Driver Version: 570.148.08         CUDA Version: 12.9 |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU        GI     CI        PID   Type   Process name                  GPU Memory    |
|                     ID        ID                                         Usage        |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
```

## 6. (Optional) Prepare and Upload Model Artifacts for Deployment
<a name="_6_optional_prepare_and_upload_model_artifacts_for_deployment"></a>

In this step, you’ll deploy a model service for real-time image classification, starting with uploading model weights to an Amazon S3 bucket. For demonstration, we are using the open-source [GPUNet-0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/gpunet_0_pyt_ckpt) vision model part of NVIDIA’s [GPUNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/gpunet_pyt), which supports low-latency inference on images using NVIDIA GPUs and TensorRT. This model is pretrained on [ImageNet](https://www.image-net.org/), allows us to classifies objects in photos or video streams on the fly, and is considered a small model with 11.9 million parameters.

### Set up your environment
<a name="_set_up_your_environment"></a>

To download the GPUNet-0 model weights In this step, you need access to NVIDIA’s NGC catalog and [Docker](https://docs.docker.com/get-started/) installed on your local machine. Follow these steps to set up a free account and configure the NGC CLI:
+  [Sign up for a free NGC account](https://ngc.nvidia.com/signup) and generate an API key from the NGC dashboard (User Icon > Setup > Generate API Key > Generate Personal Key > NGC Catalog).
+  [Download and install the NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli) (Linux/macOS/Windows) and configure the CLI using: `ngc config set`. Enter your API key when prompted; set org to `nvidia` and hit Enter to accept defaults for others. If successful, you should see something like: `Successfully saved NGC configuration to /Users/your-username/.ngc/config`.

### Verify service account permissions
<a name="_verify_service_account_permissions"></a>

Before we start, check the Kubernetes service account permissions:

```
kubectl get serviceaccount s3-csi-driver-sa -n kube-system -o yaml
```

During cluster creation, we attached the S3CSIDriverPolicy to an IAM role and annotated the service account ("s3-csi-driver-sa"). The Mountpoint S3 CSI driver pods inherits the IAM role’s permissions when interacting with S3. The expected output should look like this:

```
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::143095308808:role/eksctl-eks-rt-inference-us-east-1-addon-aws-m-Role1-fpXXjRYdKN8r
  creationTimestamp: "2025-07-17T03:55:29Z"
  labels:
    app.kubernetes.io/component: csi-driver
    app.kubernetes.io/instance: aws-mountpoint-s3-csi-driver
    app.kubernetes.io/managed-by: EKS
    app.kubernetes.io/name: aws-mountpoint-s3-csi-driver
  name: s3-csi-driver-sa
  namespace: kube-system
  resourceVersion: "2278"
  uid: 50b36272-6716-4c68-bdc3-c4054df1177c
```

### Add a toleration
<a name="_add_a_toleration"></a>

The S3 CSI Driver runs as a DaemonSet on all nodes. Pods use the CSI driver on those nodes to mount S3 volumes. To allow it to schedule on our GPU nodes which have taints, add a toleration to the DaemonSet:

```
kubectl patch daemonset s3-csi-node -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}}]'
```

The expected output should look like this:

```
daemonset.apps/s3-csi-node patched
```

### Upload model weights to S3
<a name="_upload_model_weights_to_s3"></a>

In this step, you’ll create an Amazon S3 bucket, download the GPUNet-0 model weights from NVIDIA GPU Cloud (NGC), and upload them to the bucket. These weights will be accessed by our application at runtime for inference.

Create your Amazon S3 bucket:

```
aws s3 mb s3://${S3_BUCKET_NAME} --region ${AWS_REGION}
```

Enable [S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) for the bucket, to prevent accidental deletions and overwrites from causing immediate and permanent data loss:

```
aws s3api put-bucket-versioning --bucket ${S3_BUCKET_NAME} --versioning-configuration Status=Enabled
```

Apply a lifecycle rule to the bucket to remove overwritten or deleted object versions 14 days after they become non-current, remove expired delete markers, and remove incomplete multi-part uploads after 7 days. To learn more, see [Examples of S3 Lifecycle configurations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html).

```
aws s3api put-bucket-lifecycle-configuration --bucket $S3_BUCKET_NAME --lifecycle-configuration '{"Rules":[{"ID":"LifecycleRule","Status":"Enabled","Filter":{},"Expiration":{"ExpiredObjectDeleteMarker":true},"NoncurrentVersionExpiration":{"NoncurrentDays":14},"AbortIncompleteMultipartUpload":{"DaysAfterInitiation":7}}]}'
```

Download the GPUNet-0 model weights from NGC. For example, on macOS:

```
ngc registry model download-version nvidia/dle/gpunet_0_pyt_ckpt:21.12.0_amp --dest ~/downloads
```

**Note**  
You may need to adjust this download command for your operating system. For this command to work on a Linux system, you likely need to create the directory as part of the command (e.g., `mkdir ~/downloads`).

The expected output should look like this:

```
{
  "download_end": "2025-07-18 08:22:39",
  "download_start": "2025-07-18 08:22:33",
  "download_time": "6s",
  "files_downloaded": 1,
  "local_path": "/Users/your-username/downloads/gpunet_0_pyt_ckpt_v21.12.0_amp",
  "size_downloaded": "181.85 MB",
  "status": "Completed",
  "transfer_id": "gpunet_0_pyt_ckpt[version=21.12.0_amp]"
}
```

Rename the checkpoint file to match the expected naming in our application code in later steps (no extraction is needed, as it’s a standard PyTorch \$1.pth.tar checkpoint containing the model state dictionary):

```
mv ~/downloads/gpunet_0_pyt_ckpt_v21.12.0_amp/0.65ms.pth.tar gpunet-0.pth
```

Enable the [AWS Common Runtime](https://aws.amazon.com/blogs/storage/improving-amazon-s3-throughput-for-the-aws-cli-and-boto3-with-the-aws-common-runtime/) in the AWS CLI to optimize S3 throughput:

```
aws configure set s3.preferred_transfer_client crt
```

Upload the model weights to your S3 bucket:

```
aws s3 cp gpunet-0.pth s3://${S3_BUCKET_NAME}/gpunet-0.pth
```

The expected output should look like this:

```
upload: ./gpunet-0.pth to s3://eks-rt-inference-models-us-east-1-1752722786/gpunet-0.pth
```

### Create the Model Service
<a name="_create_the_model_service"></a>

In this step, you’ll set up a FastAPI web application for GPU-accelerated image classification using the GPUNet-0 vision model. The application downloads model weights from Amazon S3 at runtime, fetches the model architecture from NVIDIA’s repository for caching, and downloads ImageNet class labels via HTTP. The application includes image preprocessing transforms and exposes two endpoints: a root GET for status check and a POST `/predict` endpoint that accepts an image URL.

We serve the model using FastAPI with PyTorch, loading weights from Amazon S3 at runtime in a containerized setup for quick prototyping and Kubernetes deployment. For other methods like optimized batching or high-throughput engines, see [Serving ML Models](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-performance.html#_serving_ml_models).

#### Create the application
<a name="_create_the_application"></a>

Create a directory for your application files such as `model-testing`, then change directories into it and add the following code to a new file named `app.py`:

```
import os
import torch
import json
import requests
from fastapi import FastAPI, HTTPException
from PIL import Image
from io import BytesIO, StringIO
import torchvision.transforms as transforms
from torch.nn.functional import softmax
import warnings
from contextlib import redirect_stdout, redirect_stderr
import argparse
import boto3
app = FastAPI()

# Suppress specific warnings from the model code (quantization is optional and unused here)
warnings.simplefilter("ignore", UserWarning)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model code from cache (if present)
# Use backed cache directory
torch.hub.set_dir('/cache/torch/hub')

# Allowlist for secure deserialization (handles potential issues in older checkpoints)
torch.serialization.add_safe_globals([argparse.Namespace])
# Load the model architecture only on container startup (changed to pretrained=False)
# Precision (FP32 for full accuracy, could be 'fp16' for speed on Ampere+ GPUs)
with redirect_stdout(StringIO()), redirect_stderr(StringIO()):
    gpunet = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_gpunet', pretrained=False, model_type='GPUNet-0', model_math='fp32')

# Download weights from S3 if not present, then load them
model_path = os.getenv('MODEL_PATH', '/cache/torch/hub/checkpoints/gpunet-0.pth')
os.makedirs(os.path.dirname(model_path), exist_ok=True)  # Ensure checkpoints dir exists
if not os.path.exists(model_path):
    s3 = boto3.client('s3')
    s3.download_file(os.getenv('S3_BUCKET_NAME'), 'gpunet-0.pth', model_path)
checkpoint = torch.load(model_path, map_location=device, weights_only=True)
gpunet.load_state_dict(checkpoint['state_dict'])
# Move to GPU/CPU
gpunet.to(device)
gpunet.eval()

# Preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load ImageNet labels
labels_url = "https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json"
response = requests.get(labels_url)
json_data = json.loads(response.text)
labels = [json_data[str(i)][1].replace('_', ' ') for i in range(1000)]

# Required, FastAPI root
@app.get("/")
async def hello():
    return {"status": "hello"}

# Serve model requests
@app.post("/predict")
async def predict(image_url: str):
    try:
        response = requests.get(image_url)
        response.raise_for_status()
        img = Image.open(BytesIO(response.content)).convert("RGB")
        input_tensor = preprocess(img).unsqueeze(0).to(device)

        with torch.no_grad():
            output = gpunet(input_tensor)

        probs = softmax(output, dim=1)[0]
        top5_idx = probs.topk(5).indices.cpu().numpy()
        top5_probs = probs.topk(5).values.cpu().numpy()

        results = [{ "label": labels[idx], "probability": float(prob) } for idx, prob in zip(top5_idx, top5_probs)]

        return {"predictions": results}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))
```

#### Create the Dockerfile
<a name="_create_the_dockerfile"></a>

The following Dockerfile creates a container image for our application utilizing the GPUNet model from the [NVIDIA Deep Learning Examples for Tensor Cores](https://github.com/NVIDIA/DeepLearningExamples) GitHub repository.

We reduce container image size by using a runtime-only PyTorch base, installing only essential packages with cache cleanup, pre-caching model code, and avoiding "baking" weights in the container image to enable faster pulls and updates. To learn more, see [Reducing Container Image Sizes](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-performance.html#_reducing_container_image_sizes).

In the same directory as `app.py`, create the `Dockerfile`:

```
FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime

# Install required system packages required for git cloning
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

# Install application dependencies
RUN pip install --no-cache-dir fastapi uvicorn requests pillow boto3 timm==0.5.4

# Pre-cache the GPUNet code from Torch Hub (without weights)
# Clone the repository containing the GPUNet code
RUN mkdir -p /cache/torch/hub && \
    cd /cache/torch/hub && \
    git clone --branch torchhub --depth 1 https://github.com/NVIDIA/DeepLearningExamples NVIDIA_DeepLearningExamples_torchhub

COPY app.py /app/app.py

WORKDIR /app

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
```

#### Test the application
<a name="_test_the_application"></a>

From the same directory as your `app.py` and `Dockerfile`, build the container image for the inference application, targeting AMD64 architecture:

```
docker build --platform linux/amd64 -t gpunet-inference-app .
```

Set environment variables for your AWS credentials, and optionally an AWS session token. For example:

```
export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID=ABCEXAMPLESCUJFEIELSMUHHAZ
export AWS_SECRET_ACCESS_KEY=123EXAMPLEMZREoQXr8XkiicsOgWDQ5TpUsq0/Z
```

Run the container locally, injecting AWS credentials as environment variables for S3 access. For example:

```
docker run --platform linux/amd64 -p 8080:80 \
  -e S3_BUCKET_NAME=${S3_BUCKET_NAME} \
  -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
  -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
  -e AWS_DEFAULT_REGION=${AWS_REGION} \
  gpunet-inference-app
```

The expected output should look like this:

```
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
```

In a new terminal window, test the inference endpoint by sending a sample POST request with a public image URL as a query parameter:

```
curl -X POST "http://localhost:8080/predict?image_url=http://images.cocodataset.org/test-stuff2017/000000024309.jpg"
```

The expected output should be a JSON response with top-5 predictions, similar to this (actual labels and probabilities may vary slightly based on the image and model precision):

```
{"predictions":[{"label":"desk","probability":0.28885871171951294},{"label":"laptop","probability":0.24679335951805115},{"label":"notebook","probability":0.08539070934057236},{"label":"library","probability":0.030645888298749924},{"label":"monitor","probability":0.02989606373012066}]}
```

Quit the application using "Ctrl \$1 C".

### Push the container to Amazon ECR
<a name="_push_the_container_to_amazon_ecr"></a>

In this step, we upload the container image for the GPUNet-0 model service to [Amazon Elastic Container Registry (ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html), making it available for deployment on Amazon EKS. This process involves creating a new ECR repository to store the image, authenticating with ECR, then tagging and pushing the container image to our registry.

First, navigate back to the directory where you set your environment variables at the beginning of this guide. For example:

```
cd ..
```

Create a repository in Amazon ECR:

```
aws ecr create-repository --repository-name gpunet-inference-app --region ${AWS_REGION}
```

Log into Amazon ECR:

```
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
```

The expected output should look like this:

```
Login Succeeded
```

Tag the image:

```
docker tag gpunet-inference-app:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/gpunet-inference-app:latest
```

Push the image to your Amazon ECR repository:

```
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/gpunet-inference-app:latest
```

This last step takes several minutes to complete.

## 7. (Optional) Expose the Model Service
<a name="_7_optional_expose_the_model_service"></a>

In this step, you’ll expose your real-time inference model service externally on Amazon EKS using the AWS Load Balancer Controller (LBC). This involves setting up the LBC, mounting model weights from Amazon S3 as a persistent volume using the Mountpoint S3 CSI Driver, deploying a GPU-accelerated application pod, creating a service and ingress to provision an Application Load Balancer (ALB), and testing the endpoint.

First, verify the Pod Identity association for the AWS LBC, confirming that the service account is properly linked to the required IAM role:

```
eksctl get podidentityassociation --cluster ${EKS_CLUSTER_NAME} --namespace kube-system --service-account-name aws-load-balancer-controller
```

The expected output should look like this:

```
ASSOCIATION ARN                                                    NAMESPACE    SERVICE ACCOUNT NAME        IAM ROLE ARN    OWNER ARN
arn:aws:eks:us-east-1:143095308808:podidentityassociation/eks-rt-inference-us-east-1/a-buavluu2wp1jropya    kube-system     aws-load-balancer-controller    arn:aws:iam::143095308808:role/AmazonEKSLoadBalancerControllerRole
```

### Tag your cluster security group
<a name="_tag_your_cluster_security_group"></a>

The AWS Load Balancer Controller only supports a single security group with the tag key `karpenter.sh/discovery: "${EKS_CLUSTER_NAME}"` for Karpenter’s security group selection. When creating a cluster with eksctl, the default cluster security group (which has the `"kubernetes.io/cluster/<cluster-name>: owned"` tag) is not automatically tagged with `karpenter.sh/discovery` tags. This tag is essential for Karpenter to discover and attach this security group to the nodes it provisions. Attaching this security group ensures compatibility with the AWS Load Balancer Controller (LBC), allowing it to automatically manage inbound traffic rules for services exposed via Ingress, such as the model service in these steps.

Export the VPC ID for your cluster:

```
CLUSTER_VPC_ID="$(aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --query cluster.resourcesVpcConfig.vpcId --output text)"
```

Export the default security group for your cluster:

```
CLUSTER_SG_ID="$(aws ec2 describe-security-groups --filters Name=vpc-id,Values=$CLUSTER_VPC_ID Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER_NAME} --query 'SecurityGroups[].[GroupId]' --output text)"
```

Add the `karpenter.sh/discovery` tag to the default cluster security group. This will allow our CPU and GPU EC2NodeClass selectors to use it:

```
aws ec2 create-tags --resources ${CLUSTER_SG_ID} --tags Key=karpenter.sh/discovery,Value=${EKS_CLUSTER_NAME}
```

Verify the tag was added:

```
aws ec2 describe-security-groups --group-ids ${CLUSTER_SG_ID} --query "SecurityGroups[].Tags"
```

Among the results, you should see the following with the tag and your cluster name. For example:

```
{
  "Key": "karpenter.sh/discovery",
  "Value": "eks-rt-inference-us-east-1"
}
```

### Setup the AWS Load Balancer Controller (LBC)
<a name="setup_the_shared_aws_load_balancer_controller_lbc"></a>

The AWS LBC is essential for managing ingress traffic to AI/ML workloads on Amazon EKS, ensuring access to inference endpoints or data processing pipelines. By integrating with AWS Application Load Balancers (ALB) and Network Load Balancers (NLB), the LBC dynamically routes traffic to containerized applications, such as those running large language models, computer vision models, or real-time inference services. Since we’ve already created the service account and the Pod Identity Association during cluster creation, we set the `serviceAccount.name` to match what’s defined in our cluster config (`aws-load-balancer-controller`).

Add the AWS-owned **eks-charts** Helm chart repository:

```
helm repo add eks https://aws.github.io/eks-charts
```

Refresh your local Helm repositories with the most recent charts:

```
helm repo update eks
```

Deploy the AWS LBC using Helm, specifying the EKS cluster name and referencing the pre-created service account:

```
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=${EKS_CLUSTER_NAME} \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller
```

The expected output should look like this:

```
NAME: aws-load-balancer-controller
LAST DEPLOYED: Wed Jul 9 15:03:31 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
AWS Load Balancer controller installed!
```

### Mount the model in a persistent volume
<a name="_mount_the_model_in_a_persistent_volume"></a>

In this step, you’ll mount model weights from your Amazon S3 bucket using a PersistentVolume (PV) backed by the Mountpoint for Amazon S3 CSI driver. This allows Kubernetes pods to access S3 objects as local files, eliminating resource-intensive downloads to ephemeral pod storage or init containers—ideal for large, multi-gigabyte model weights.

The PV mounts the entire bucket root (no path specified in `volumeAttributes`), supports concurrent read-only access by multiple pods, and exposes files like the model weights (`/models/gpunet-0.pth`) inside the container for inference. This ensures the fallback "download" in our application (`app.py`) does not trigger because the file exists via the mount. By decoupling the model from the container image, this enables shared access and independent model version updates without image rebuilds.

#### Create the PersistentVolume (PV)
<a name="_create_the_persistentvolume_pv"></a>

Create a PersistentVolume (PV) resource to mount the S3 bucket containing your model weights, enabling read-only access for multiple pods without downloading files at runtime:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-model-pv
spec:
  capacity:
    storage: 5Gi  # Ignored by the driver; can be any value
  accessModes:
    - ReadOnlyMany  # Read only
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""  # Required for static provisioning
  claimRef:
    namespace: default  # Adjust if you prefer a different namespace
    name: s3-model-pvc
  mountOptions:
    - allow-other  # Enables multi-user access (useful for non-root pods)
    - region ${AWS_REGION} # Optional, include if your bucket is in a different region than the cluster
  csi:
    driver: s3.csi.aws.com
    volumeHandle: gpunet-model-volume  # Must be unique across all PVs
    volumeAttributes:
      bucketName: ${S3_BUCKET_NAME}
EOF
```

#### Create the PersistentVolumeClaim (PVC)
<a name="_create_the_persistentvolumeclaim_pvc"></a>

Create a PersistentVolumeClaim (PVC) to bind to the PV, requesting read-only access to the mounted S3 model data:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: s3-model-pvc
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: ""  # Required for static provisioning
  resources:
    requests:
      storage: 5Gi  # Ignored, match PV capacity
  volumeName: s3-model-pv  # Bind to the PV created above
EOF
```

#### Deploy the application
<a name="_deploy_the_application"></a>

Deploy the inference application as a Kubernetes Deployment, mounting the S3-backed persistent volume for model access, applying GPU node selectors and tolerations, and setting environment variables for the model path. This Deployment sets the model path (env var of `"/models/gpunet-0.pth"`), so our application (in `app.py`) will use this path by default. With the Deployment’s volume mount at `/models` (read-only), the model download won’t trigger if the file is already present via the PVC.

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpunet-inference-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpunet-inference-app
  template:
    metadata:
      labels:
        app: gpunet-inference-app
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        role: gpu-worker
      containers:
      - name: inference
        image: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/gpunet-inference-app:latest
        ports:
        - containerPort: 80
        env:
        - name: MODEL_PATH
          value: "/models/gpunet-0.pth"
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-volume
          mountPath: /models
          readOnly: true
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: s3-model-pvc
EOF
```

It will take a few minutes for Karpenter to provision a GPU node if one isn’t already available. Verify that the inference pod is in a "Running" state:

```
kubectl get pods -l app=gpunet-inference-app
```

The expected output should look like this:

```
NAME                               READY   STATUS    RESTARTS   AGE
gpunet-inference-app-5d4b6c7f8-abcde        1/1     Running   0          2m
```

### Expose the Service with Ingress and Load Balancer
<a name="_expose_the_service_with_ingress_and_load_balancer"></a>

Create a ClusterIP Service to expose the inference deployment internally within the EKS cluster, targeting the application’s port:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: gpunet-model-service
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: gpunet-inference-app
EOF
```

Create an Ingress resource to provision an internet-facing Application Load Balancer (ALB) via the AWS LBC, routing external traffic to the inference service:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gpunet-model-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: alb
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: gpunet-model-service
            port:
              number: 80
EOF
```

Give it a few minutes for the Application Load Balancer (ALB) to finish provisioning. Monitor the Ingress resource status to confirm the ALB has been provisioned:

```
kubectl get ingress gpunet-model-ingress
```

The expected output should look like this (with the ADDRESS field populated):

```
NAME                   CLASS   HOSTS   ADDRESS                                         PORTS   AGE
gpunet-model-ingress   alb     *       k8s-default-gpunetmo-183de3f819-516310036.us-east-1.elb.amazonaws.com   80      6m58s
```

Extract and export the ALB hostname from the Ingress status for use in subsequent testing:

```
export ALB_HOSTNAME=$(kubectl get ingress gpunet-model-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
```

### Test the Model Service
<a name="_test_the_model_service"></a>

Validate the exposed inference endpoint by sending a POST request with a sample image URL (e.g., from the COCO dataset), simulating real-time prediction:

```
curl -X POST "http://${ALB_HOSTNAME}/predict?image_url=http://images.cocodataset.org/test-stuff2017/000000024309.jpg"
```

The expected output should be a JSON response with top-5 predictions, similar to this (actual labels and probabilities may vary slightly based on the image and model precision):

```
{"predictions":[{"label":"desk","probability":0.2888975441455841},{"label":"laptop","probability":0.2464350312948227},{"label":"notebook","probability":0.08554483205080032},{"label":"library","probability":0.030612602829933167},{"label":"monitor","probability":0.029896672815084457}]}
```

You can optionally continue testing other images in a new POST request. For example:

```
http://images.cocodataset.org/test-stuff2017/000000024309.jpg
http://images.cocodataset.org/test-stuff2017/000000028117.jpg
http://images.cocodataset.org/test-stuff2017/000000006149.jpg
http://images.cocodataset.org/test-stuff2017/000000004954.jpg
```

## Conclusion
<a name="_conclusion"></a>

In this guide, you set up an Amazon EKS cluster optimized for GPU-accelerated real-time inference workloads. You provisioned a cluster with [G5 EC2 instances](https://aws.amazon.com/ec2/instance-types/g5/), installed the [Mountpoint S3 CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html), [EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html), [EKS Node Monitoring Agent](https://docs.aws.amazon.com/eks/latest/userguide/node-health.html), [Bottlerocket AMI](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami-bottlerocket.html), [AWS Load Balancer Controller (LBC)](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html), and [Karpenter](https://karpenter.sh/) to manage CPU and GPU NodePools. You used the NVIDIA Device Plugin to enable GPU scheduling and configured S3 with a PersistentVolume and PersistentVolumeClaim for model access. You validated the setup by deploying a sample GPU pod, setting up model access for the NVIDIA [GPUNet-0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/gpunet_0_pyt_ckpt) model on [Amazon S3](https://aws.amazon.com/s3/), enabling pod initialization, and exposing the inference service via Application Load Balancer. To fully utilize your cluster, configure the [EKS Node Monitoring Agent](https://docs.aws.amazon.com/eks/latest/userguide/node-health.html) with auto-repair. Be sure to conduct benchmark tests, including GPU performance, latency, and throughput assessments to optimize response times. To learn more, see [Using Monitoring and Observability Tools for your AI/ML Workloads](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-observability.html#_using_monitoring_and_observability_tools_for_your_aiml_workloads).

## Clean up
<a name="_clean_up"></a>

To avoid incurring future charges, you need to delete the associated CloudFormation stack manually to delete all resources created during this guide, including the VPC network.

Delete the CloudFormation stack using the `--wait` flag with eksctl:

```
eksctl delete cluster --region ${AWS_REGION} --name ${EKS_CLUSTER_NAME} --wait
```

Upon completion, you should see the following response output:

```
2025-07-29 13:03:55 [✔]  all cluster resources were deleted
```

Delete the Amazon S3 bucket created during this guide using the [Amazon S3 Console](https://console.aws.amazon.com/s3/home).

# Quickstart: High-throughput LLM inference with vLLM on Amazon EKS
<a name="ml-realtime-inference-llm-inference-vllm"></a>

## Introduction
<a name="_introduction"></a>

This quickstart guide provides a walkthrough for deploying Large Language Models (LLMs) on Amazon EKS using vLLM and GPUs for text-based real-time inference applications.

The solution leverages Amazon EKS for container orchestration and vLLM for efficient model serving, enabling you to build scalable AI applications with GPU acceleration and high-throughput inference serving. The Llama 3.1 8B Instruct model is used for demonstration, but you can deploy any other LLM supported by vLLM (check [vLLM documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) for a list of supported models). To test LLM inference, we use a sample chatbot application based on the project [nextjs-vllm-ui](https://github.com/yoziru/nextjs-vllm-ui). Finally, we use GuideLLM to benchmark and tune vLLM configuration parameters to optimize inference performance.

 **vLLM Architecture on EKS** 

![\[vLLM Architecture diagram\]](http://docs.aws.amazon.com/eks/latest/userguide/images/llm-inference-vllm-architecture.png)


When you complete this procedure, you will have a vLLM inference endpoint optimized for throughput and low latency, and you will be able to interact with a Llama model through a chat frontend application, demonstrating a typical use case for chatbot assistants and other LLM-based applications.

For additional guidance and advanced deployment resources, check our [EKS Best Practices Guide for AI/ML workloads](https://docs.aws.amazon.com/eks/latest/best-practices/aiml.html) and production-ready [AI on EKS inference charts](https://github.com/awslabs/ai-on-eks/tree/main/blueprints/inference/inference-charts).

## Before you begin
<a name="_before_you_begin"></a>

Before getting started, ensure you have:
+ An Amazon EKS cluster with the following main components: Karpenter nodepools with G5 or G6 EC2 instance family, the NVIDIA Device Plugin installed on your GPU-enabled worker nodes, and the S3 Mountpoint CSI Driver installed. To create this baseline setup, follow steps in [Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS](ml-realtime-inference-cluster.md), up to completing step \$14.
+ A Hugging Face account. To sign up, see https://huggingface.co/login.

## Set Up Model Storage with Amazon S3
<a name="_set_up_model_storage_with_amazon_s3"></a>

Store large LLM files efficiently in Amazon S3 to separate storage from compute resources. This approach streamlines model updates, reduces costs, and simplifies management in production setups. S3 handles massive files reliably, while integration with Kubernetes via the Mountpoint CSI driver lets pods access models like local storage—no need for time-consuming downloads during startup. Follow these steps to create an S3 bucket, upload an LLM, and mount it as a volume in your inference serving container.

Other storage solutions are also available on EKS for model caching, such as EFS and FSx for Lustre. For more information, check [EKS Best Practices](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-storage.html).

### Set environment variables
<a name="_set_environment_variables"></a>

Create a unique name for a new Amazon S3 bucket that we will create later in this guide. Once created, use this same bucket name for all steps. For example:

```
MY_BUCKET_NAME=model-store-$(date +%s)
```

Define environment variables and store them in a file:

```
cat << EOF > .env-quickstart-vllm
export BUCKET_NAME=${MY_BUCKET_NAME}
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
EOF
```

Load environment variables in your shell environment. If you close the current shell environment and open a new one, make sure to re-source environment variables using this same command:

```
source .env-quickstart-vllm
```

### Create an S3 bucket to store model files
<a name="_create_an_s3_bucket_to_store_model_files"></a>

Create an S3 bucket to store model files:

```
aws s3 mb s3://${BUCKET_NAME} --region ${AWS_REGION}
```

### Download model from Hugging Face
<a name="_download_model_from_hugging_face"></a>

Hugging Face is one of the main model hubs for accessing LLM models. To download the Llama model, you’ll need to accept the model license and set up token authentication:

1. Accept the Llama 3.1 8B Instruct model license at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.

1. Generate an access token (go to your Profile > Settings > Access Tokens, then create a new token using Read token type).

Set an environment variable with your Hugging Face token:

```
export HF_TOKEN=your_token_here
```

Install pip3 package if not already installed in your environment. Example command in Amazon Linux 2023:

```
sudo dnf install -y python3-pip
```

Install the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli):

```
pip install huggingface-hub
```

Download Llama-3.1-8B-Instruct model from Hugging Face (\$115 GB) using the `--exclude` flag to skip the legacy PyTorch format and only download the optimized safetensors format files, which reduces download size while maintaining full compatibility with popular inference engines:

```
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
  --exclude "original/*" \
  --local-dir ./llama-3.1-8b-instruct \
  --token $HF_TOKEN
```

Verify the downloaded files:

```
$ ls llama-3.1-8b-instruct
```

The expected output should look like this:

```
LICENSE        config.json                       model-00002-of-00004.safetensors  model.safetensors.index.json  tokenizer_config.json
README.md      generation_config.json            model-00003-of-00004.safetensors  special_tokens_map.json
USE_POLICY.md  model-00001-of-00004.safetensors  model-00004-of-00004.safetensors  tokenizer.json
```

### Upload model files
<a name="_upload_model_files"></a>

Enable AWS Common Runtime (CRT) for improved S3 transfer performance. The CRT-based transfer client provides enhanced throughput and reliability for large file operations:

```
aws configure set s3.preferred_transfer_client crt
```

Upload the model:

```
aws s3 cp ./llama-3.1-8b-instruct s3://$BUCKET_NAME/llama-3.1-8b-instruct \
  --recursive
```

The expected output should look like this:

```
...
upload: llama-3.1-8b-instruct/tokenizer.json to s3://model-store-1753EXAMPLE/llama-3.1-8b-instruct/tokenizer.json
upload: llama-3.1-8b-instruct/model-00004-of-00004.safetensors to s3://model-store-1753890326/llama-3.1-8b-instruct/model-00004-of-00004.safetensors
upload: llama-3.1-8b-instruct/model-00002-of-00004.safetensors to s3://model-store-1753890326/llama-3.1-8b-instruct/model-00002-of-00004.safetensors
upload: llama-3.1-8b-instruct/model-00003-of-00004.safetensors to s3://model-store-1753890326/llama-3.1-8b-instruct/model-00003-of-00004.safetensors
upload: llama-3.1-8b-instruct/model-00001-of-00004.safetensors to s3://model-store-1753890326/llama-3.1-8b-instruct/model-00001-of-00004.safetensors
```

### Set Up S3 Mountpoint CSI permissions
<a name="_set_up_s3_mountpoint_csi_permissions"></a>

The S3 Mountpoint CSI driver enables native integration between Kubernetes and S3, allowing pods to directly access model files as if they were local storage, eliminating the need for local copies during container startup.

Create an IAM policy to allow the S3 mount point to read from your S3 bucket:

```
aws iam create-policy \
  --policy-name S3BucketAccess-${BUCKET_NAME} \
  --policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [{\"Effect\": \"Allow\", \"Action\": [\"s3:GetObject\", \"s3:GetObjectVersion\", \"s3:ListBucket\", \"s3:GetBucketLocation\"], \"Resource\": [\"arn:aws:s3:::${BUCKET_NAME}\", \"arn:aws:s3:::${BUCKET_NAME}/*\"]}]}"
```

Find the IAM role name used by the S3 Mountpoint CSI Driver by checking S3 CSI Driver service account annotations:

```
ROLE_NAME=$(kubectl get serviceaccount s3-csi-driver-sa -n kube-system -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}' | cut -d'/' -f2)
```

Attach your IAM policy with the S3 Mountpoint CSI role:

```
aws iam attach-role-policy \
 --role-name ${ROLE_NAME} \
 --policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/S3BucketAccess-${BUCKET_NAME}
```

If S3 Mountpoint CSI is not installed in the cluster, follow the deployment steps in [Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS](ml-realtime-inference-cluster.md).

### Mount S3 bucket as a Kubernetes volume
<a name="_mount_s3_bucket_as_a_kubernetes_volume"></a>

Create a Persistent Volume (PV) and Persistent Volume Claim (PVC) to provide read-only access to the S3 bucket across multiple inference pods. The ReadOnlyMany access mode ensures concurrent access to model files, while the CSI driver handles the S3 bucket mounting:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-store
spec:
  storageClassName: ""
  capacity:
    storage: 100Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
    - region ${AWS_REGION}
  csi:
    driver: s3.csi.aws.com
    volumeHandle: model-store
    volumeAttributes:
      bucketName: ${BUCKET_NAME}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-store
spec:
  storageClassName: ""
  volumeName: model-store
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 100Gi
EOF
```

## GPU Infrastructure Setup
<a name="_gpu_infrastructure_setup"></a>

### Cluster nodes
<a name="_cluster_nodes"></a>

We are using the EKS cluster created in [Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS](ml-realtime-inference-cluster.md). This cluster includes Karpenter nodepools that can provision GPU enabled nodes with sufficient node storage to download vLLM container image. If using your custom EKS cluster, ensure that it can launch GPU enabled nodes.

 **Instance Selection** 

Proper instance selection for LLM inference requires ensuring that available GPU memory is sufficient to load model weights. Model weights for Llama 3.1 8B Instruct are approximately 16GB (size of model files .safetensor), therefore we need to provide at least this amount of memory to the vllm process to load the model.

 [Amazon G5 EC2 Instances](https://aws.amazon.com/ec2/instance-types/g5/) with A10G GPUs and [G6 EC2 instances](https://aws.amazon.com/ec2/instance-types/g6/) with L4 GPUs both provide 24GB VRAM per GPU, sufficient for loading Llama 3.1 8B Instruct weights. If you are deploying a model with larger weights, consider using a multi-GPU instance type or a multi-node setup.

 **NVIDIA device drivers** 

NVIDIA drivers provide the necessary runtime environment for containers to access GPU resources efficiently. It enables GPU resource allocation and management within Kubernetes, making GPUs available as schedulable resources.

Our cluster uses EKS Bottlerocket AMIs, which include all necessary NVIDIA device drivers and plugins on all GPU-enabled nodes, ensuring immediate GPU accessibility for containerized workloads without additional setup. If you are using other types of EKS nodes, you need to ensure all necessary drivers and plugins are installed.

### Test GPU Infrastructure
<a name="_test_gpu_infrastructure"></a>

Test your cluster’s GPU capabilities by executing the steps below to ensure pods can access NVIDIA GPU resources and schedule correctly on GPU-enabled nodes.

Deploy an Nvidia SMI test pod:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-nvidia-smi-test
spec:
  restartPolicy: OnFailure
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  nodeSelector:
    role: gpu-worker  # Matches GPU NodePool's label
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.9.1-base-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      requests:
        memory: "24Gi"
      limits:
        nvidia.com/gpu: 1
EOF
```

Review pod logs to check that GPU details are listed, similar to output below (not necessarily the same GPU model):

```
$ kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/gpu-nvidia-smi-test
$ kubectl logs gpu-nvidia-smi-test
```

```
Wed Jul 30 15:39:58 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8              9W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

This output shows that pods can successfully access GPU resources.

 **IMPORTANT**: This pod uses a nodeSelector configuration that aligns with Karpenter node pools in [Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS](ml-realtime-inference-cluster.md). If you are using different node pools, ensure the pod matches nodeSelector and Tolerations accordingly.

## Deploy Inference Container
<a name="_deploy_inference_container"></a>

The serving stack determines both performance and scalability capabilities of your inference infrastructure. vLLM has emerged as a leading solution for production deployments. vLLM’s architecture provides continuous batching for dynamic request processing, kernel optimizations for faster inference, and efficient GPU memory management through PagedAttention. These features, combined with a production-ready REST API and support for popular model formats, make it an optimal choice for high-performance inference deployments.

### Select AWS Deep Learning Container image
<a name="select_shared_aws_deep_learning_container_image"></a>

 [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/tree/master) (DLCs) provide pre-optimized environments with security updates, AWS infrastructure compatibility, and optimized driver configurations. This reduces deployment complexity and maintenance overhead while ensuring production readiness.

For this deployment, we’ll use the AWS DLC for vLLM 0.9, which includes Nvidia libraries and optimized GPU performance configurations specifically tuned for transformer model inference on AWS GPU instances.

```
image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.9-gpu-py312-ec2
```

### Apply vLLM Kubernetes manifests
<a name="_apply_vllm_kubernetes_manifests"></a>

There are multiple ways to deploy vLLM in EKS. This guide demonstrates vLLM deployment using a Kubernetes deployment, which is a Kubernetes-native and easy way to get started. For advanced deployment options see [vLLM docs](https://docs.vllm.ai/en/latest/deployment/k8s.html) and [AI on EKS blueprints](https://awslabs.github.io/ai-on-eks/docs/blueprints).

Define deployment parameters through Kubernetes manifests to control resource allocation, node placement, health probes, exposing the service, etc. Configure your deployment to run a GPU-enabled pod using AWS Deep Learning Container image for vLLM. Set optimized parameters for LLM inference and expose the vLLM OpenAPI-compatible endpoint via AWS Load Balancer service:

```
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference-app
  template:
    metadata:
      labels:
        app: vllm-inference-app
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        role: gpu-worker
      containers:
      - name: vllm-inference
        image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.9-gpu-py312-ec2
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: "/mnt/models/llama-3.1-8b-instruct"
        args:
        - "--model=/mnt/models/llama-3.1-8b-instruct"
        - "--host=0.0.0.0"
        - "--port=8000"
        - "--tensor-parallel-size=1"
        - "--gpu-memory-utilization=0.9"
        - "--max-model-len=8192"
        - "--max-num-seqs=1"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
          timeoutSeconds: 10
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "24Gi"
            cpu: "4"
            ephemeral-storage: "25Gi"  # Ensure enough node storage for vLLM container image
        volumeMounts:
        - name: models
          mountPath: /mnt/models
          readOnly: true
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-store
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference-svc
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  selector:
    app: vllm-inference-app
EOF
```

Check that vLLM pod is in `Ready 1/1` state:

```
kubectl get pod -l app=vllm-inference-app -w
```

Expected output:

```
NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
vllm-inference-app-65df5fddc8-5kmjm   1/1     1            1           5m
```

It may take several minutes while the container image is pulled and vLLM loads model files into GPU memory. Only proceed when the pod is Ready and Available.

### Expose the service
<a name="_expose_the_service"></a>

Expose the inference endpoint locally through the Kubernetes port forwarding for local development and testing. Leave this command running in a separate terminal window:

```
export POD_NAME=$(kubectl get pod -l app=vllm-inference-app -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward pod/$POD_NAME 8000:8000
```

The AWS Load Balancer Controller automatically creates a Network Load Balancer that exposes vLLM service endpoint externally. Fetch the NLB endpoint by running:

```
NLB=$(kubectl get service vllm-inference-svc -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
```

Need to install AWS Load Balancer Controller? Follow the deployment steps in [Route internet traffic with AWS Load Balancer Controller](aws-load-balancer-controller.md).

## Run inference
<a name="_run_inference"></a>

### Validate inference pod
<a name="_validate_inference_pod"></a>

Validate the inference container functionality locally through the forwarded port. Send a connection request and ensure that the response includes HTTP code 200:

```
$ curl -IX GET "http://localhost:8000/v1/models"
```

```
HTTP/1.1 200 OK
date: Mon, 13 Oct 2025 23:24:57 GMT
server: uvicorn
content-length: 516
content-type: application/json
```

Test inference capabilities and validate external connectivity by sending a completion request to the LLM via the NLB endpoint:

```
curl -X POST "http://$NLB:80/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/mnt/models/llama-3.1-8b-instruct",
    "prompt": "Explain artificial intelligence:",
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

This endpoint follows the OpenAI API format, making it compatible with existing applications while providing configurable generation parameters like response length and temperature for controlling output diversity.

### Run chatbot app
<a name="_run_chatbot_app"></a>

For demonstration, this guide runs a sample chatbot application using project [nextjs-vllm-ui](https://github.com/yoziru/nextjs-vllm-ui) to showcase user interactions with the model.

Run a chatbot UI as a Docker container that maps port 3000 to localhost and connects to the vLLM NLB endpoint:

```
docker run --rm \
  -p 3000:3000 \
  -e VLLM_URL="http://${NLB}:80" \
  --name nextjs-vllm-ui-demo \
  ghcr.io/yoziru/nextjs-vllm-ui:latest
```

Open your web browser and navigate to: http://localhost:3000/

You should see the chat interface where you can interact with the Llama model.

 **Chat UI Interface** 

![\[Chat UI Interface\]](http://docs.aws.amazon.com/eks/latest/userguide/images/llm-inference-vllm-chatui.png)


## Optimize inference performance
<a name="_optimize_inference_performance"></a>

Specialized inference engines like vLLM provide advanced features that significantly boost inference performance, including continuous batching, efficient KV caching, and optimized memory attention mechanisms. You can tune vLLM configuration parameters to improve inference performance while meeting your specific use case requirements and workload patterns. Proper configuration is essential for achieving GPU saturation, ensuring you extract maximum value from expensive GPU resources while delivering high throughput, low latency, and cost-effective operations. The following optimizations will help you maximize your vLLM deployment’s performance on EKS.

### Benchmark vLLM configurations
<a name="_benchmark_vllm_configurations"></a>

To tune vLLM configuration parameters for your use case, benchmark different settings using a comprehensive inference benchmarking tool like [GuideLLM](https://github.com/vllm-project/guidellm). This will collect key metrics like request per second throughput (RPS), end-to-end latency (E2E), time to first token (TTFT), and tail latency (TPOT) to compare different configurations.

### Baseline vLLM configuration
<a name="_baseline_vllm_configuration"></a>

This is the baseline configuration that was used to run vLLM:


| vLLM Parameter | Description | 
| --- | --- | 
|  tensor\$1parallel\$1size: 1  |  Distribute model across 1 GPU  | 
|  gpu\$1memory\$1utilization: 0.90  |  Reserve 10% GPU memory for system overhead  | 
|  max\$1sequence\$1length: 8192  |  Maximum total sequence length (input \$1 output)  | 
|  max\$1num\$1seqs: 1  |  Maximum concurrent requests per GPU (Batching)  | 

Run GuideLLM with this baseline setup to establish a performance baseline. For this test, GuideLLM is configured to generate 1 request per second, with 256-token requests and 128-token responses.

```
guidellm benchmark \
--target "http://${NLB}:80" \
--processor meta-llama/Llama-3.1-8B-Instruct \
--rate-type constant \
--rate 1 \
--max-seconds 30 \
--data "prompt_tokens=256,output_tokens=128"
```

Expected output:

 **Baseline Benchmark Results** 

![\[Baseline Benchmark Results\]](http://docs.aws.amazon.com/eks/latest/userguide/images/llm-inference-vllm-guidellm-baseline.png)


### Tuned vLLM configuration
<a name="_tuned_vllm_configuration"></a>

Adjust vLLM parameters to better utilize GPU resources and parallelization:


| vLLM Parameter | Description | 
| --- | --- | 
|  tensor\$1parallel\$1size: 1  |  Keep at 1 GPU. Tensor parallelization must match the number of GPUs to be used by vLLM  | 
|  gpu\$1memory\$1utilization: 0.92  |  Reduce overhead GPU memory if possible, while ensuring that vLLM continues to run without errors  | 
|  max\$1sequence\$1length: 4096  |  Adjust max sequence per your use case requirements; lower max sequence frees up resources that can be used for increased parallelization  | 
|  max\$1num\$1seqs: 8  |  Increasing max seq increases throughput but also increases latency. Increase this value to maximize throughput while ensuring that latency stays within your use case requirements  | 

Apply these changes to the running deployment using kubectl patch command:

```
kubectl patch deployment vllm-inference-app --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/args/4", "value": "--gpu-memory-utilization=0.92"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/args/5", "value": "--max-model-len=4096"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/args/6", "value": "--max-num-seqs=8"}
]'
```

Check that vLLM pod is in `Ready 1/1` state:

```
kubectl get pod -l app=vllm-inference-app -w
```

Expected output:

```
NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
vllm-inference-app-65df5fddc8-5kmjm   1/1     1            1           5m
```

Then run GuideLLM again using the same benchmarking values as before:

```
guidellm benchmark \
--target "http://${NLB}:80" \
--processor meta-llama/Llama-3.1-8B-Instruct \
--rate-type constant \
--rate 1 \
--max-seconds 30 \
--data "prompt_tokens=256,output_tokens=128"
```

Expected output:

 **Optimized Benchmark Results** 

![\[Optimized Benchmark Results\]](http://docs.aws.amazon.com/eks/latest/userguide/images/llm-inference-vllm-guidellm-optimized.png)


### Benchmarking results
<a name="_benchmarking_results"></a>

Compute benchmarking results in a table for both baseline and optimized vLLM configuration:


| Avg Values | Baseline config | Optimized config | 
| --- | --- | --- | 
|  RPS  |  0.23 req/sec  |  0.86 req/sec  | 
|  E2E  |  12.99 s  |  5.19 s  | 
|  TTFT  |  8637.2 ms  |  147.9 ms  | 
|  TPOT  |  34.0 ms  |  39.5 ms  | 

The optimized vLLM configurations significantly improved inference throughput (RPS) and reduced latency (E2E, TTFT) with only a minor millisecond increase in tail latency (TPOT). These results demonstrate how vLLM significantly improves inference performance, allowing each container to process more requests in less time for cost-effective operation.

# Amazon EKS cluster configuration for AI/ML workloads
<a name="ml-cluster-configuration"></a>

This section is designed to help you configure Amazon EKS clusters optimized for AI/ML workloads. You’ll find guidance on running GPU-accelerated containers using Linux and Windows optimized AMIs, setting up training clusters with Elastic Fabric Adapter (EFA) for high-performance networking, and creating inference clusters with AWS Inferentia instances, including prerequisites, step-by-step procedures, and deployment considerations.

**Topics**
+ [

# Use EKS-optimized accelerated AMIs for GPU instances
](ml-eks-optimized-ami.md)
+ [

# Run GPU-accelerated containers (Windows on EC2 G-Series)
](ml-eks-windows-optimized-ami.md)
+ [

# Run machine learning training on Amazon EKS with Elastic Fabric Adapter
](node-efa.md)
+ [

# Use AWS Inferentia instances with Amazon EKS for Machine Learning
](inferentia-support.md)

# Use EKS-optimized accelerated AMIs for GPU instances
<a name="ml-eks-optimized-ami"></a>

Amazon EKS supports EKS-optimized Amazon Linux and Bottlerocket AMIs for GPU instances. The EKS-optimized accelerated AMIs simplify running AI and ML workloads in EKS clusters by providing pre-built, validated operating system images for the accelerated Kubernetes stack. In addition to the core Kubernetes components that are included in the standard EKS-optimized AMIs, the EKS-optimized accelerated AMIs include the kernel modules and drivers required to run the NVIDIA GPU `G` and `P` EC2 instances, and the AWS GPU [Inferentia](https://aws.amazon.com/machine-learning/inferentia/) and [Trainium](https://aws.amazon.com/machine-learning/trainium/) EC2 instances in EKS clusters.

The table below shows the supported GPU instance types for each EKS-optimized accelerated AMI variant. See the EKS-optimized [AL2023 releases](https://github.com/awslabs/amazon-eks-ami/releases) and [Bottlerocket releases](https://github.com/bottlerocket-os/bottlerocket/blob/develop/CHANGELOG.md) on GitHub for the latest updates to the AMI variants.


| EKS AMI variant | EC2 instance types | 
| --- | --- | 
|  AL2023 x86\$164 NVIDIA  |  p6-b300, p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, g7e, gr6, g6, g6e, g6f, gr6f, g5, g4dn  | 
|  AL2023 ARM NVIDIA  |  p6e-gb200, p6e-gb300, g5g  | 
|  AL2023 x86\$164 Neuron  |  inf1, inf2, trn1, trn2  | 
|  Bottlerocket x86\$164 aws-k8s-nvidia  |  p6-b300, p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, g7e, gr6, g6, g6e, g6f, gr6f, g5, g4dn  | 
|  Bottlerocket aarch64/arm64 aws-k8s-nvidia  |  g5g  | 
|  Bottlerocket x86\$164 aws-k8s  |  inf1, inf2, trn1, trn2  | 

## EKS-optimized NVIDIA AMIs
<a name="eks-amis-nvidia"></a>

By using the EKS-optimized NVIDIA AMIs, you agree to [NVIDIA’s Cloud End User License Agreement (EULA)](https://s3.amazonaws.com/EULA/NVidiaEULAforAWS.pdf).

To find the latest EKS-optimized NVIDIA AMIs, see [Retrieve recommended Amazon Linux AMI IDs](retrieve-ami-id.md) and [Retrieve recommended Bottlerocket AMI IDs](retrieve-ami-id-bottlerocket.md).

When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket NVIDIA AMIs, you must install the EFA device plugin separately. For more information, see [Run machine learning training on Amazon EKS with Elastic Fabric Adapter](node-efa.md).

## EKS AL2023 NVIDIA AMIs
<a name="eks-amis-nvidia-al2023"></a>

When using the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) with the EKS-optimized AL2023 NVIDIA AMIs, you must disable the operator installation of the driver and toolkit, as these are already included in the EKS AMIs. The EKS-optimized AL2023 NVIDIA AMIs do not include the NVIDIA Kubernetes device plugin or the NVIDIA DRA driver, and these must be installed separately. For more information, see [Manage NVIDIA GPU devices on Amazon EKS](device-management-nvidia.md).

In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA AMIs include the following components.
+ NVIDIA driver
+ NVIDIA CUDA user mode driver
+ NVIDIA container toolkit
+ NVIDIA fabric manager
+ NVIDIA persistenced
+ NVIDIA IMEX driver
+ NVIDIA NVLink Subnet Manager
+ EFA minimal (kernel module and rdma-core)

For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the [NVIDIA documentation](https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility). The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.

The EKS-optimized AL2023 NVIDIA AMIs support kernel 6.12 for Kubernetes versions 1.33 and above, and the NVIDIA driver 580 version for all Kubernetes versions. The NVIDIA 580 driver is required to use CUDA 13\$1.

See the EKS-optimized [AL2023 releases](https://github.com/awslabs/amazon-eks-ami/releases) on GitHub for details of the component versions included in the AMIs. See the EKS AL2023 NVIDIA AMI [installation script](https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh) and [kernel loading script](https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh) for details on how the EKS AMIs configure the NVIDIA dependencies. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command.

When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. `dnf upgrade`) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production.

When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for [installation](https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh) and [runtime](https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh).

## EKS Bottlerocket NVIDIA AMIs
<a name="eks-amis-nvidia-bottlerocket"></a>

When using the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) with the EKS-optimized Bottlerocket NVIDIA AMIs, you must disable the operator installation of the driver, toolkit, and device plugin as these are already included in the EKS AMIs.

In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. The minimal dependencies for EFA (kernel module and rdma-core) are installed in all Bottlerocket variants.
+ NVIDIA Kubernetes device plugin
+ NVIDIA driver
+ NVIDIA CUDA user mode driver
+ NVIDIA container toolkit
+ NVIDIA fabric manager
+ NVIDIA persistenced
+ NVIDIA IMEX driver
+ NVIDIA NVLink Subnet Manager
+ NVIDIA MIG manager

For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the [NVIDIA documentation](https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility). The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.

See the Bottlerocket Version Information in the [Bottlerocket documentation](https://bottlerocket.dev/en/) for details on the installed packages and their versions. The EKS-optimized Bottlerocket NVIDIA AMIs support kernel 6.12 for Kubernetes versions 1.33 and above, and the NVIDIA driver 580 version for Kubernetes versions 1.34 and above. The NVIDIA 580 driver is required to use CUDA 13\$1.

## EKS-optimized Neuron AMIs
<a name="eks-amis-neuron"></a>

For details on how to run training and inference workloads using Neuron with Amazon EKS, see the following references:
+  [Containers - Kubernetes - Getting Started](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html) in the AWS Neuron Documentation
+  [Training example](https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training) in AWS Neuron EKS Samples on GitHub
+  [Deploy ML inference workloads with Inferentia on Amazon EKS](inferentia-support.md) 

To find the latest EKS-optimized Neuron AMIs, see [Retrieve recommended Amazon Linux AMI IDs](retrieve-ami-id.md) and [Retrieve recommended Bottlerocket AMI IDs](retrieve-ami-id-bottlerocket.md).

When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket Neuron AMIs, you must install the EFA device plugin separately. For more information, see [Run machine learning training on Amazon EKS with Elastic Fabric Adapter](node-efa.md).

## EKS AL2023 Neuron AMIs
<a name="eks-amis-neuron-al2023"></a>

The EKS-optimized AL2023 Neuron AMIs do not include the Neuron DRA driver, Neuron Kubernetes device plugin, or the [Neuron Kubernetes scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html), and these must be installed separately. For more information, see [Manage Neuron devices on Amazon EKS](device-management-neuron.md).

In addition to the standard EKS AMI components, the EKS-optimized AL2023 Neuron AMIs include the following components.
+ Neuron driver (aws-neuronx-dkms)
+ Neuron tools (aws-neuronx-tools)
+ EFA minimal (kernel module and rdma-core)

See the EKS AL2023 Neuron AMI [installation script](https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-neuron-driver.sh) for details on how the EKS AMIs configure the Neuron dependencies. See the EKS-optimized [AL2023 releases](https://github.com/awslabs/amazon-eks-ami/releases) on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command.

## EKS Bottlerocket Neuron AMIs
<a name="eks-amis-neuron-bottlerocket"></a>

The standard Bottlerocket variants (aws-k8s) include the Neuron dependencies that are automatically detected and loaded when running on AWS Inferentia or Trainium EC2 instances.

The EKS-optimized Bottlerocket AMIs do not include the Neuron DRA driver, Neuron Kubernetes device plugin, or the [Neuron Kubernetes scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html), and these must be installed separately. For more information, see [Manage Neuron devices on Amazon EKS](device-management-neuron.md).

In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket Neuron AMIs include the following components.
+ Neuron driver (aws-neuronx-dkms)
+ EFA minimal (kernel module and rdma-core)

When using the EKS-optimized Bottlerocket AMIs with Neuron instances, the following must be configured in the Bottlerocket user-data. This setting allows the container to take ownership of the mounted Neuron device based on the `runAsUser` and `runAsGroup` values provided in the workload specification. For more information on Neuron support in Bottlerocket, see the [Quickstart on EKS readme](https://github.com/bottlerocket-os/bottlerocket/blob/develop/QUICKSTART-EKS.md#neuron-support) on GitHub.

```
[settings]
[settings.kubernetes]
device-ownership-from-security-context = true
```

See the [Bottlerocket kernel kit changelog](https://github.com/bottlerocket-os/bottlerocket-kernel-kit/blob/develop/CHANGELOG.md) for information on the Neuron driver version included in the EKS-optimized Bottlerocket AMIs.

# Run GPU-accelerated containers (Windows on EC2 G-Series)
<a name="ml-eks-windows-optimized-ami"></a>

**Important**  
The [Kubernetes Device Plugin for DirectX](https://github.com/TensorWorks/DirectX-Device-Plugins) by TensorWorks is a third-party tool that is not endorsed, supported, or maintained by AWS. AWS assumes no responsibility for the security, reliability, or performance of this plugin.

Learn how to run GPU-accelerated Windows container workloads on Amazon EKS (Elastic Kubernetes Service) using NVIDIA GPUs with the Kubernetes Device Plugin for DirectX by TensorWorks. For more information, see [Kubernetes Device Plugin for DirectX](https://github.com/TensorWorks/DirectX-Device-Plugins).

There are two main approaches to setting up GPU-acceleration for your Windows containers:
+  **Option 1**: [Build a custom EKS Windows Optimized AMI](eks-custom-ami-windows.md) with the required GPU drivers pre-installed.
  + Use this approach when you need a consistent, pre-configured environment ready to run GPU-accelerated Windows containers, and you’re able to invest the additional effort to build and maintain the custom AMI.
+  **Option 2**: Install the necessary GPU drivers on your EKS worker nodes after launching your instance.
  + Use this approach when you want a simpler setup process and don’t mind installing the GPU drivers on each new worker node. More suited to a development environment when you are evaluating or prototyping GPU-accelerated workloads.

Both approaches can be leveraged using the steps detailed in this guide.

## Considerations
<a name="_considerations"></a>

This guide provides steps to install and set up GPU-acceleration for your Windows containers using NVIDIA GPUs, NVIDIA GRID drivers, and the [Kubernetes Device Plugin for DirectX](https://github.com/TensorWorks/DirectX-Device-Plugins) by TensorWorks. The steps have been tested and verified to provide GPU-acceleration for your Windows container workloads on Amazon EKS. See [Known limitations](#ml-eks-windows-ami-known-limitations) for more information on compatible drivers and device plugins. Before proceeding, note the following:
+ Only G-family instance types with [NVIDIA GRID drivers](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver#nvidia-GRID-driver) have been tested and verified to work with this guide. While other instance types and driver combinations may also be capable of running GPU-accelerated Windows containers, they may require additional configuration steps not covered in this guide.
+ Only DirectX-based workloads have been tested and verified to work with this guide. While other GPU APIs like OpenGL, Vulkan, and OpenCL may potentially be compatible to run GPU-accelerated Windows containers, they may require additional configuration steps not covered in this guide.
+ There are some known limitations to be aware of before running GPU-accelerated Windows containers. Please see the [Known limitations](#ml-eks-windows-ami-known-limitations) section for more information.

## Prerequisites
<a name="ml-eks-windows-ami-prerequisites"></a>

To enable GPU acceleration for your Windows containers on Amazon EKS, you’ll need to prepare the following requirements before proceeding:
+ Launch an Amazon EKS cluster with Kubernetes v1.27 or newer.
+ Provision Windows nodes with Windows Server 2022 or newer.
+ Provision Windows nodes in the G-family of instance types, such as [G4](https://aws.amazon.com/ec2/instance-types/g4/) or [G5](https://aws.amazon.com/ec2/instance-types/g5/).
+ Provision Windows nodes with a container runtime with containerd `1.7.x` or `2.x.x`. (See [Retrieve Windows AMI version information](eks-ami-versions-windows.md) to verify the containerd version in your Amazon EKS Optimized AMI.)

## Install the GPU driver on each Windows node
<a name="ml-eks-windows-ami-install-gpu-driver"></a>

To install the NVIDIA GRID drivers on your EKS worker nodes, follow the steps outlined in [NVIDIA drivers for your Amazon EC2 instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html). Navigate to [Installation options - Option 3: GRID drivers](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver#nvidia-GRID-driver) and follow the installation steps.

 **Install for Windows Server Core** 

For Windows Server Core, which doesn’t have a desktop experience, install NVIDIA GRID drivers silently by using the following commands:

```
$nvidiaInstallerFilePath = nvidia-driver-installer.exe # Replace with path to installer
$installerArguments = "-s -clean -noreboot -noeula"
Start-Process -FilePath $nvidiaInstallerFilePath -ArgumentList $installerArguments -Wait -NoNewWindow -PassThru
```

 **Verify your installation** 

Run the following PowerShell command to show diagnostic information about the GPUs on the instance:

```
nvidia-smi
```

This command displays the NVIDIA driver version, as well as information about the GPU hardware. Ensure that the output of this command matches the NVIDIA GRID driver version you expected to be installed.

## Deploy the GPU device plugin on each node
<a name="ml-eks-windows-ami-deploy-gpu-driver"></a>

To enable discovery and exposure of the GPU resources to containers on your Windows nodes, you will need a device plugin. Deploy the [DirectX Device Plugin](https://github.com/TensorWorks/DirectX-Device-Plugins) by Tensorworks on each worker node by running it as a DaemonSet in your EKS cluster. Follow the installation guide specified in the [README.md](https://github.com/TensorWorks/DirectX-Device-Plugins/blob/main/README.md), which will entail the following steps. It is recommended to:
+ Deploy the device plugin in the `kube-system` namespace.
+ Set appropriate resource limits for the DaemonSet to ensure it does not consume excessive resources on your nodes.

**Note**  
The device plugin DaemonSet will run on every node as a host process container with elevated privileges. It is recommended to implement RBAC controls to restrict access to this DaemonSet so only authorized users can execute privileged commands.

When running GPU-accelerated containers, the device plugin supports two modes:
+  **Single-tenancy mode**: This mode dedicates all GPU resources to a single container on the instance. Install the device plugins with single-tenancy support using the following command. See README.md for more information.

  ```
  kubectl apply -f "https://raw.githubusercontent.com/TensorWorks/directx-device-plugins/main/deployments/default-daemonsets.yml"
  ```
+  **Multi-tenancy mode**: This mode allows sharing GPU resources among multiple containers on the instance. Install the device plugins with multi-tenancy support using the following command. See README.md for more information.

  ```
  kubectl apply -f "https://raw.githubusercontent.com/TensorWorks/directx-device-plugins/main/deployments/multitenancy-inline.yml"
  ```

  Alternatively, use a ConfigMap to specify the multi-tenancy.

  ```
  kubectl apply -f "https://raw.githubusercontent.com/TensorWorks/directx-device-plugins/main/deployments/multitenancy-configmap.yml"
  ```

### Verifying the device plugin deployment
<a name="ml-eks-windows-ami-verify-device-plugin"></a>

After you have deployed the device plugin, replace `<namespace>` and run the following command to verify the DirectX Device Plugin is running correctly on all your Windows nodes.

```
kubectl get ds device-plugin-wddm -n <namespace>
```

### Verifying containers are ready for deployment
<a name="ml-eks-windows-ami-verify-container-deployment"></a>

Once the device plugin DaemonSet is running on the GPU-powered Windows worker nodes, use the following command to verify that each node has allocatable GPUs. The corresponding number should match the number of DirectX devices on each node.

```
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,DirectX:.status.allocatable.directx\.microsoft\.com/display"
```

## Running Windows containers with GPU-acceleration
<a name="ml-eks-windows-ami-run-with-gpu-acceleration"></a>

Before launching your pods, specify the resource name `directx.microsoft.com/display` in `.spec.containers[].resources`. This will indicate that your containers require GPU-enabled capabilities, and the `kube-scheduler` will attempt to place your pods on your pre-configured Windows node with available GPU resources.

As an example, see the sample command below which launches a `Job` to run Monte Carlo simulation to estimate the value of pi. This example is from the [Kubernetes Device Plugins for DirectX](https://github.com/TensorWorks/DirectX-Device-Plugins) GitHub repository, which has [multiple examples](https://github.com/TensorWorks/DirectX-Device-Plugins/tree/main/examples) to choose from that you can run to test your Windows node GPU capabilities.

```
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: example-cuda-montecarlo-wddm
spec:
  template:
    spec:
      containers:
      - name: example-cuda-montecarlo-wddm
        image: "index.docker.io/tensorworks/example-cuda-montecarlo:0.0.1"
        resources:
          limits:
            directx.microsoft.com/display: 1
      nodeSelector:
        "kubernetes.io/os": windows
      restartPolicy: Never
  backoffLimit: 0
EOF
```

## Known limitations
<a name="ml-eks-windows-ami-known-limitations"></a>

### All GPUs are usable
<a name="ml-eks-windows-ami-gpus-usable"></a>

All the GPUs on the instance will be usable by each running container on the host, even when you request a specific number of GPUs for a given container. Additionally, the default behavior is that all containers running on the host will use the GPU with index 0, even if there are multiple GPUs available on the node. Thus, for multi-GPU tasks to operate correctly, you must explicitly designate the specific GPU device to be utilized within your application’s code.

The exact implementation to allocate a device to use for the application will depend on the programming language or framework you are using. For example, if you’re using CUDA programming, to select a specific GPU, you can explicitly specify the device to use in your application code by using the function [cudaSetDevice()](https://docs.nvidia.com/cuda/cuda-runtime-api/group_%5FCUDART%5F_DEVICE.html).

The need to explicitly specify the device is due to a known issue affecting Windows containers. You can track the progress on resolving this issue in the [microsoft/Windows-Containers issue \$1333](https://github.com/microsoft/Windows-Containers/issues/333). The following table represents a visual representation and practical example of this GPU allocation behavior.

Consider a scenario whereby there is a single Windows node of EC2 instance type `g4dn.12xlarge`, which comes with four GPUs. Consider a scenario where three pods are launched on this instance. The table shows that regardless of the number of GPUs requested by each container, all three pods have access to all four GPUs on the instance, and by default will utilize the GPU with device index 0.


| Pod | Requested GPUs | Actual GPU Access | Default GPU Usage | Available GPU Indices | Total Instance GPUs | 
| --- | --- | --- | --- | --- | --- | 
|  Pod 1  |  1 GPU  |  All 4 GPUs  |  GPU with index 0  |  0, 1, 2, 3  |  4  | 
|  Pod 2  |  2 GPUs  |  All 4 GPUs  |  GPU with index 0  |  0, 1, 2, 3  |  4  | 
|  Pod 3  |  1 GPU  |  All 4 GPUs  |  GPU with index 0  |  0, 1, 2, 3  |  4  | 

### Kubernetes device plugin support
<a name="ml-eks-windows-ami-device-plugin-support"></a>

NVIDIA’s official implementation of the [Kubernetes device plugin](https://github.com/NVIDIA/k8s-device-plugin) does not support Windows. You can track the progress on adding official Windows support in the [NVIDIA/k8s-device-plugin issue \$1419](https://github.com/NVIDIA/k8s-device-plugin/issues/419).

### GPU compute instance limitations
<a name="ml-eks-windows-ami-compute-instance-limitations"></a>

Depending on your AWS account configuration, you may have service limits on the number and types of Amazon EC2 GPU compute instances that you can launch. If you require additional capacity, you can [Request a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html).

### Must build a Windows GPU Optimized AMI
<a name="ml-eks-windows-ami-build-gpu-ami"></a>

There is no EKS Windows GPU Optimized AMI or EC2 Image Builder managed component provided by Amazon EKS. You will need to follow the steps in this guide to build a custom EKS Windows Optimized AMI with the required GPU drivers pre-installed, or install the necessary GPU drivers on your EKS worker nodes after launching your instances.

### Inferentia and Trainium not supported
<a name="ml-eks-windows-ami-inferentia-tranium-support"></a>

 AWS [Inferentia](https://aws.amazon.com/ai/machine-learning/inferentia/) and AWS [Trainium](https://aws.amazon.com/ai/machine-learning/trainium/) based workloads are not supported on Windows.

# Run machine learning training on Amazon EKS with Elastic Fabric Adapter
<a name="node-efa"></a>

This topic describes how to integrate Elastic Fabric Adapter (EFA) with Pods deployed in your Amazon EKS cluster. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, High Performance Computing (HPC) applications using the Message Passing Interface (MPI) and Machine Learning (ML) applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS cloud. Integrating EFA with applications running on Amazon EKS clusters can reduce the time to complete large scale distributed training workloads without having to add additional instances to your cluster. For more information about EFA, [Elastic Fabric Adapter](https://aws.amazon.com/hpc/efa/).

## Instance types with EFA
<a name="efa-instances"></a>

The * AWS EFA Kubernetes Device Plugin* supports all Amazon EC2 instance types that have EFA. To see a list of all instance types that have EFA, see [Supported instance types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) in the *Amazon EC2 User Guide*. However, to run ML applications quickly, we recommend that an instance has hardware acceleration chips such as NVIDIA GPUs, [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) chips, or [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) chips, in addition to the EFA. To see a list of instance types that have hardware acceleration chips and EFA, see [Accelerated computing](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) in the *Amazon EC2 User Guide*.

As you compare instance types to choose between them, consider the number of EFA network cards available for that instance type as well as the number of accelerator cards, amount of CPU, and amount of memory. You can assign up to one EFA per network card. An EFA counts as a network interface. To see how many EFA are available for each instance types that have EFA, see the [Network cards](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards) list in the *Amazon EC2 User Guide*.

## EFA and EFA-only interfaces
<a name="efa-only-interfaces"></a>

An *Elastic Fabric Adapter (EFA)* is a network interface that combines the capabilities of an Elastic Network Adapter (ENA) and an OS-bypass interface, powered by the AWS Scalable Reliable Datagram (SRD) protocol. The EFA functionalities allow applications to communicate directly with the hardware for low-latency transport. You can choose to access only the EFA capabilities using *EFA-only* interfaces, limiting communication to interfaces within the same Availability Zone.

To create nodes that can have EFA-only interfaces, you must use a custom EC2 Launch Template and set the `InterfaceType` to `efa-only`. In your custom Launch Template, you can’t set the network card `0` to an EFA-only interface, as that is the primary network card and network interface of the EC2 instance. You must have VPC CNI version `1.18.5` or later for EFA-only interfaces. If you are using Amazon Linux 2, ami version has to be `v20240928` or later for EFA-only interfaces.

The following procedure guides you to create an EKS cluster with `eksctl` with nodes that have NVIDIA GPUs and EFA interfaces. You can’t use `eksctl` to create nodes and node groups that use EFA-only interfaces.

## Prerequisites
<a name="efa-prereqs"></a>
+ An existing Amazon EKS cluster. If you don’t have an existing cluster, create one using [Get started with Amazon EKS](getting-started.md).. Your cluster must be deployed in a VPC that has at least one private subnet with enough available IP addresses to deploy nodes in. The private subnet must have outbound internet access provided by an external device, such as a NAT gateway.

  If you plan to use `eksctl` to create your node group, `eksctl` can also create a cluster for you.
+ Version `2.12.3` or later or version `1.27.160` or later of the AWS Command Line Interface (AWS CLI) installed and configured on your device or AWS CloudShell. To check your current version, use `aws --version | cut -d / -f2 | cut -d ' ' -f1`. Package managers such as `yum`, `apt-get`, or Homebrew for macOS are often several versions behind the latest version of the AWS CLI. To install the latest version, see [Installing](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [Quick configuration with aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) in the * AWS Command Line Interface User Guide*. The AWS CLI version that is installed in AWS CloudShell might also be several versions behind the latest version. To update it, see [Installing AWS CLI to your home directory](https://docs.aws.amazon.com/cloudshell/latest/userguide/vm-specs.html#install-cli-software) in the * AWS CloudShell User Guide*.
+ The `kubectl` command line tool is installed on your device or AWS CloudShell. The version can be the same as or up to one minor version earlier or later than the Kubernetes version of your cluster. For example, if your cluster version is `1.29`, you can use `kubectl` version `1.28`, `1.29`, or `1.30` with it. To install or upgrade `kubectl`, see [Set up `kubectl` and `eksctl`](install-kubectl.md).
+ You must have the Amazon VPC CNI plugin for Kubernetes version `1.7.10` or later installed before launching worker nodes that support multiple Elastic Fabric Adapters, such as the `p4d` or `p5`. For more information about updating your Amazon VPC CNI plugin for Kubernetes version, see [Assign IPs to Pods with the Amazon VPC CNI](managing-vpc-cni.md).
+ For p6-b200 instances, you must use EFA Device Plugin version v0.5.6 or later.

**Important**  
An important consideration required for adopting EFA with Kubernetes is configuring and managing Huge Pages as a resource in the cluster. For more information, see [Manage Huge Pages](https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/) in the Kubernetes documentation. Amazon EC2 instances with the EFA driver installed pre-allocate 5128 2MiB Huge Pages, which you can request as resources to consume in your job specifications.

## Create node group
<a name="efa-create-nodegroup"></a>

The following procedure helps you create a node group with a `p4d.24xlarge` backed node group with EFA interfaces and GPUDirect RDMA, and run an example NVIDIA Collective Communications Library (NCCL) test for multi-node NCCL Performance using EFAs. The example can be used a template for distributed deep learning training on Amazon EKS using EFAs.

1. Determine which Amazon EC2 instance types that support EFA are available in the AWS Region that you want to deploy nodes in. Replace *region-code* with the AWS Region that you want to deploy your node group in.

   ```
   aws ec2 describe-instance-types --region region-code \
       --filters Name=network-info.efa-supported,Values=true \
       --query "InstanceTypes[*].[InstanceType]" --output text
   ```

   When you deploy nodes, the instance type that you want to deploy must be available in the AWS Region that your cluster is in.

1. Determine which Availability Zones that the instance type that you want to deploy is available in. In this tutorial, the `p5.48xlarge` instance type is used and must be returned in the output for the AWS Region that you specified in the previous step. When you deploy nodes in a production cluster, replace *p5.48xlarge* with any instance type returned in the previous step.

   ```
   aws ec2 describe-instance-type-offerings --region region-code \
       --location-type availability-zone --filters Name=instance-type,Values=p4d.24xlarge,p5.48xlarge \
       --query 'InstanceTypeOfferings[*].Location' --output text
   ```

   An example output is as follows.

   ```
   us-west-2a    us-west-2c    us-west-2b
   ```

   Note the Availability Zones returned for use in later steps. When you deploy nodes to a cluster, your VPC must have subnets with available IP addresses in one of the Availability Zones returned in the output.

1. Create a node group using `eksctl`. You need version `0.215.0` or later of the `eksctl` command line tool installed on your device or AWS CloudShell. To install or update `eksctl`, see [Installation](https://eksctl.io/installation) in the `eksctl` documentation.

   1. Copy the following contents to a file named *efa-cluster.yaml*. Replace the example values with your own. You can replace `p5.48xlarge` with a different instance, but if you do, make sure that the values for `availabilityZones` are Availability Zones that were returned for the instance type in step 1.

      ```
      apiVersion: eksctl.io/v1alpha5
      kind: ClusterConfig
      
      metadata:
        name: my-efa-cluster
        region: region-code
        version: "1.XX"
      
      iam:
        withOIDC: true
      
      availabilityZones: ["us-west-2a", "us-west-2c"]
      
      managedNodeGroups:
        - name: my-efa-ng
          instanceType: p5.48xlarge
          minSize: 1
          desiredCapacity: 2
          maxSize: 3
          availabilityZones: ["us-west-2a"]
          volumeSize: 300
          privateNetworking: true
          efaEnabled: true
      ```

   1. Create a managed node group in an existing cluster.

      ```
      eksctl create nodegroup -f efa-cluster.yaml
      ```

      If you don’t have an existing cluster, you can run the following command to create a cluster and the node group.

      ```
      eksctl create cluster -f efa-cluster.yaml
      ```
**Note**  
Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you when using Amazon Linux 2. This is not necessary for Bottlerocket, as the NVIDIA device plugin is built into Bottlerocket’s EKS NVIDIA variant. When `efaEnabled` is set to `true` in the nodegroup configuration, `eksctl` will also automatically deploy the EFA device plugin on the nodes.

### Using Bottlerocket with EFA
<a name="efa-bottlerocket"></a>

Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.

Here’s an example configuration:

```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-efa-bottlerocket-cluster
  region: region-code
  version: "1.XX"

iam:
  withOIDC: true

availabilityZones: ["us-west-2a", "us-west-2c"]

managedNodeGroups:
  - name: my-efa-bottlerocket-ng
    instanceType: p5.48xlarge
    minSize: 1
    desiredCapacity: 2
    maxSize: 3
    availabilityZones: ["us-west-2a"]
    volumeSize: 300
    privateNetworking: true
    efaEnabled: true
    amiFamily: Bottlerocket
    bottlerocket:
      enableAdminContainer: true
      settings:
        kernel:
          sysctl:
            "vm.nr_hugepages": "3000"  # Configures 3000 * 2Mi = 6000Mi hugepages
```

The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 \$1 2Mi = 6000Mi of hugepages.

### Verify EFA device plugin installation
<a name="verify-efa-device-plugin"></a>

When you create a node group with `efaEnabled: true`, `eksctl` automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:

1. Check the DaemonSet status:

   ```
   kubectl get daemonsets -n kube-system
   ```

   Sample output:

   ```
   NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   aws-efa-k8s-device-plugin-daemonset   2         2         2       2            2           <none>          6m16s
   ...
   ```

   Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.

1. Next, verify the pods created by the DaemonSet:

   ```
   kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
   ```

   Sample output:

   ```
   NAME                                        READY   STATUS    RESTARTS   AGE
   aws-efa-k8s-device-plugin-daemonset-d68bs   1/1     Running   0          6m16s
   aws-efa-k8s-device-plugin-daemonset-w4l8t   1/1     Running   0          6m16s
   ```

   The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.

1. Verify resource registration:

   You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:

   ```
   kubectl describe nodes
   ```

   If the EFA resource is properly registered, you will see it listed under the node’s Capacity and Allocatable resources. For example:

   ```
   Capacity:
     ...
     vpc.amazonaws.com/efa:  4
   Allocatable:
     ...
     vpc.amazonaws.com/efa:  4
   ```

   This output confirms that the node recognizes the EFA resource, making it available for pods that request it.

## (Optional) Test the performance of the EFA
<a name="efa-application"></a>

We recommend that you test the EFA setup. You can use the [NCCL Tests](https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests) in the `aws-samples/awsome-distributed-training` repository on GitHub. [NCCL Tests](https://github.com/NVIDIA/nccl-tests) evaluate the performance of the network using the Nvidia Collective Communication Library. The following steps submit NCCL tests on Amazon EKS.

1. Deploy the Kubeflow MPI Operator:

   For the NCCL tests you can apply the Kubeflow MPI Operator. The MPI Operator makes it easy to run Allreduce-style distributed training on Kubernetes. For more information, see [MPI Operator](https://github.com/kubeflow/mpi-operator) on GitHub.

1. Run the multi-node NCCL Performance Test to verify GPUDirectRDMA/EFA:

   To verify NCCL performance with GPUDirectRDMA over EFA, run the standard NCCL Performance test. For more information, see the official [NCCL-Tests](https://github.com/NVIDIA/nccl-tests.git) repo on GitHub.

   Complete the following steps to run a two node NCCL Performance Test. In the example NCCL test job, each worker requests eight GPUs, 5210Mi of `hugepages-2Mi`, four EFAs, and 8000Mi of memory, which effectively means each worker consumes all the resources of a `p5.48xlarge` instance.

   1. Create the MPIJob manifest:

      Copy the following to a file named `nccl-tests.yaml`:

      ```
      apiVersion: kubeflow.org/v2beta1
      kind: MPIJob
      metadata:
        name: nccl-tests
      spec:
        runPolicy:
          cleanPodPolicy: Running
          backoffLimit: 20
        slotsPerWorker: 8
        mpiReplicaSpecs:
          Launcher:
            replicas: 1
            template:
               spec:
                restartPolicy: OnFailure
                containers:
                - image: public.ecr.aws/hpc-cloud/nccl-tests:latest
                  imagePullPolicy: IfNotPresent
                  name: test-nccl-launcher
                  env:
                   - name: PATH
                     value: $PATH:/opt/amazon/efa/bin:/usr/bin
                  command:
                  - /opt/amazon/openmpi/bin/mpirun
                  - --allow-run-as-root
                  - --tag-output
                  - -np
                  - "16"
                  - -N
                  - "8"
                  - --bind-to
                  - none
                  - -x
                  - PATH
                  - -x
                  - LD_LIBRARY_PATH
                  - -x
                  - NCCL_DEBUG=INFO
                  - -x
                  - NCCL_BUFFSIZE=8388608
                  - -x
                  - NCCL_P2P_NET_CHUNKSIZE=524288
                  - -x
                  - NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so
                  - --mca
                  - pml
                  - ^cm,ucx
                  - --mca
                  - btl
                  - tcp,self
                  - --mca
                  - btl_tcp_if_exclude
                  - lo,docker0,veth_def_agent
                  - /opt/nccl-tests/build/all_reduce_perf
                  - -b
                  - "8"
                  - -e
                  - "16G"
                  - -f
                  - "2"
                  - -g
                  - "1"
                  - -c
                  - "1"
                  - -n
                  - "100"
          Worker:
            replicas: 2
            template:
              spec:
                nodeSelector:
                  node.kubernetes.io/instance-type: "p5.48xlarge"
                containers:
                - image: public.ecr.aws/hpc-cloud/nccl-tests:latest
                  imagePullPolicy: IfNotPresent
                  name: nccl-tests-worker
                  volumeMounts:
                  - name: shmem
                    mountPath: /dev/shm
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                      hugepages-2Mi: 5120Mi
                      vpc.amazonaws.com/efa: 32
                      memory: 32000Mi
                    requests:
                      nvidia.com/gpu: 8
                      hugepages-2Mi: 5120Mi
                      vpc.amazonaws.com/efa: 32
                      memory: 32000Mi
                volumes:
                - name: shmem
                  hostPath:
                    path: /dev/shm
      ```

   1. Apply the NCCL-tests MPIJob:

      Submit the `MPIJob` by applying the manifest. This will create two `p5.48xlarge` Amazon EC2 instances.

      ```
      kubectl apply -f nccl-tests.yaml
      ```

      An example output is as follows.

      ```
      mpijob.kubeflow.org/nccl-tests created
      ```

   1. Verify that the job started pods:

      View your running Pods.

      ```
      kubectl get pods
      ```

      An example output is as follows.

      ```
      NAME                             READY   STATUS     RESTARTS   AGE
      nccl-tests-launcher-nbql9    0/1     Init:0/1   0          2m49s
      nccl-tests-worker-0          1/1     Running    0          2m49s
      nccl-tests-worker-1          1/1     Running    0          2m49s
      ```

      The MPI Operator creates a launcher Pod and 2 worker Pods (one on each node).

   1. Verify that the job is running successfully with the logs:

      View the log for the `nccl-tests-launcher` Pod. Replace *nbql9* with the value from your output.

      ```
      kubectl logs -f nccl-tests-launcher-nbql9
      ```

If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.

# Use AWS Inferentia instances with Amazon EKS for Machine Learning
<a name="inferentia-support"></a>

This topic describes how to create an Amazon EKS cluster with nodes running [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) instances and (optionally) deploy a sample application. Amazon EC2 Inf1 instances are powered by [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) chips, which are custom built by AWS to provide high performance and lowest cost inference in the cloud. Machine learning models are deployed to containers using [AWS Neuron](https://aws.amazon.com/machine-learning/neuron/), a specialized software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the machine learning inference performance of Inferentia chips. AWS Neuron supports popular machine learning frameworks such as TensorFlow, PyTorch, and MXNet.

**Note**  
Neuron device logical IDs must be contiguous. If a Pod requesting multiple Neuron devices is scheduled on an `inf1.6xlarge` or `inf1.24xlarge` instance type (which have more than one Neuron device), that Pod will fail to start if the Kubernetes scheduler selects non-contiguous device IDs. For more information, see [Device logical IDs must be contiguous](https://github.com/aws/aws-neuron-sdk/issues/110) on GitHub.

## Prerequisites
<a name="inferentia-prerequisites"></a>
+ Have `eksctl` installed on your computer. If you don’t have it installed, see [Installation](https://eksctl.io/installation) in the `eksctl` documentation.
+ Have `kubectl` installed on your computer. For more information, see [Set up `kubectl` and `eksctl`](install-kubectl.md).
+ (Optional) Have `python3` installed on your computer. If you don’t have it installed, then see [Python downloads](https://www.python.org/downloads/) for installation instructions.

## Create a cluster
<a name="create-cluster-inferentia"></a>

1. Create a cluster with Inf1 Amazon EC2 instance nodes. You can replace *inf1.2xlarge* with any [Inf1 instance type](https://aws.amazon.com/ec2/instance-types/inf1/). The `eksctl` utility detects that you are launching a node group with an `Inf1` instance type and will start your nodes using one of the Amazon EKS optimized accelerated Amazon Linux AMIs.
**Note**  
You can’t use [IAM roles for service accounts](iam-roles-for-service-accounts.md) with TensorFlow Serving.

   ```
   eksctl create cluster \
       --name inferentia \
       --region region-code \
       --nodegroup-name ng-inf1 \
       --node-type inf1.2xlarge \
       --nodes 2 \
       --nodes-min 1 \
       --nodes-max 4 \
       --ssh-access \
       --ssh-public-key your-key \
       --with-oidc
   ```
**Note**  
Note the value of the following line of the output. It’s used in a later (optional) step.

   ```
   [9]  adding identity "arn:aws:iam::111122223333:role/eksctl-inferentia-nodegroup-ng-in-NodeInstanceRole-FI7HIYS3BS09" to auth ConfigMap
   ```

   When launching a node group with `Inf1` instances, `eksctl` automatically installs the AWS Neuron Kubernetes device plugin. This plugin advertises Neuron devices as a system resource to the Kubernetes scheduler, which can be requested by a container. In addition to the default Amazon EKS node IAM policies, the Amazon S3 read only access policy is added so that the sample application, covered in a later step, can load a trained model from Amazon S3.

1. Make sure that all Pods have started correctly.

   ```
   kubectl get pods -n kube-system
   ```

   Abbreviated output:

   ```
   NAME                                   READY   STATUS    RESTARTS   AGE
   [...]
   neuron-device-plugin-daemonset-6djhp   1/1     Running   0          5m
   neuron-device-plugin-daemonset-hwjsj   1/1     Running   0          5m
   ```

## (Optional) Deploy a TensorFlow Serving application image
<a name="deploy-tensorflow-serving-application"></a>

A trained model must be compiled to an Inferentia target before it can be deployed on Inferentia instances. To continue, you will need a [Neuron optimized TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/index.html) model saved in Amazon S3. If you don’t already have a SavedModel, please follow the tutorial for [creating a Neuron compatible ResNet50 model](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-tf-neuron.html) and upload the resulting SavedModel to S3. ResNet-50 is a popular machine learning model used for image recognition tasks. For more information about compiling Neuron models, see [The AWS Inferentia Chip With DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia.html) in the AWS Deep Learning AMIs Developer Guide.

The sample deployment manifest manages a pre-built inference serving container for TensorFlow provided by AWS Deep Learning Containers. Inside the container is the AWS Neuron Runtime and the TensorFlow Serving application. A complete list of pre-built Deep Learning Containers optimized for Neuron is maintained on GitHub under [Available Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-inference-containers). At start-up, the DLC will fetch your model from Amazon S3, launch Neuron TensorFlow Serving with the saved model, and wait for prediction requests.

The number of Neuron devices allocated to your serving application can be adjusted by changing the `aws.amazon.com/neuron` resource in the deployment yaml. Please note that communication between TensorFlow Serving and the Neuron runtime happens over GRPC, which requires passing the `IPC_LOCK` capability to the container.

1. Add the `AmazonS3ReadOnlyAccess` IAM policy to the node instance role that was created in step 1 of [Create a cluster](#create-cluster-inferentia). This is necessary so that the sample application can load a trained model from Amazon S3.

   ```
   aws iam attach-role-policy \
       --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
       --role-name eksctl-inferentia-nodegroup-ng-in-NodeInstanceRole-FI7HIYS3BS09
   ```

1. Create a file named `rn50_deployment.yaml` with the following contents. Update the region-code and model path to match your desired settings. The model name is for identification purposes when a client makes a request to the TensorFlow server. This example uses a model name to match a sample ResNet50 client script that will be used in a later step for sending prediction requests.

   ```
   aws ecr list-images --repository-name neuron-rtd --registry-id 790709498068 --region us-west-2
   ```

   ```
   kind: Deployment
   apiVersion: apps/v1
   metadata:
     name: eks-neuron-test
     labels:
       app: eks-neuron-test
       role: master
   spec:
     replicas: 2
     selector:
       matchLabels:
         app: eks-neuron-test
         role: master
     template:
       metadata:
         labels:
           app: eks-neuron-test
           role: master
       spec:
         containers:
           - name: eks-neuron-test
             image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-neuron:1.15.4-neuron-py37-ubuntu18.04
             command:
               - /usr/local/bin/entrypoint.sh
             args:
               - --port=8500
               - --rest_api_port=9000
               - --model_name=resnet50_neuron
               - --model_base_path=s3://${your-bucket-of-models}/resnet50_neuron/
             ports:
               - containerPort: 8500
               - containerPort: 9000
             imagePullPolicy: IfNotPresent
             env:
               - name: AWS_REGION
                 value: "us-east-1"
               - name: S3_USE_HTTPS
                 value: "1"
               - name: S3_VERIFY_SSL
                 value: "0"
               - name: S3_ENDPOINT
                 value: s3.us-east-1.amazonaws.com
               - name: AWS_LOG_LEVEL
                 value: "3"
             resources:
               limits:
                 cpu: 4
                 memory: 4Gi
                 aws.amazon.com/neuron: 1
               requests:
                 cpu: "1"
                 memory: 1Gi
             securityContext:
               capabilities:
                 add:
                   - IPC_LOCK
   ```

1. Deploy the model.

   ```
   kubectl apply -f rn50_deployment.yaml
   ```

1. Create a file named `rn50_service.yaml` with the following contents. The HTTP and gRPC ports are opened for accepting prediction requests.

   ```
   kind: Service
   apiVersion: v1
   metadata:
     name: eks-neuron-test
     labels:
       app: eks-neuron-test
   spec:
     type: ClusterIP
     ports:
       - name: http-tf-serving
         port: 8500
         targetPort: 8500
       - name: grpc-tf-serving
         port: 9000
         targetPort: 9000
     selector:
       app: eks-neuron-test
       role: master
   ```

1. Create a Kubernetes service for your TensorFlow model Serving application.

   ```
   kubectl apply -f rn50_service.yaml
   ```

## (Optional) Make predictions against your TensorFlow Serving service
<a name="make-predictions-against-tensorflow-service"></a>

1. To test locally, forward the gRPC port to the `eks-neuron-test` service.

   ```
   kubectl port-forward service/eks-neuron-test 8500:8500 &
   ```

1. Create a Python script called `tensorflow-model-server-infer.py` with the following content. This script runs inference via gRPC, which is a service framework.

   ```
   import numpy as np
   import grpc
   import tensorflow as tf
   from tensorflow.keras.preprocessing import image
   from tensorflow.keras.applications.resnet50 import preprocess_input
   from tensorflow_serving.apis import predict_pb2
   from tensorflow_serving.apis import prediction_service_pb2_grpc
   from tensorflow.keras.applications.resnet50 import decode_predictions
   
   if __name__ == '__main__':
      channel = grpc.insecure_channel('localhost:8500')
      stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
      img_file = tf.keras.utils.get_file(
          "./kitten_small.jpg",
          "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
      img = image.load_img(img_file, target_size=(224, 224))
      img_array = preprocess_input(image.img_to_array(img)[None, ...])
      request = predict_pb2.PredictRequest()
      request.model_spec.name = 'resnet50_inf1'
      request.inputs['input'].CopyFrom(
          tf.make_tensor_proto(img_array, shape=img_array.shape))
      result = stub.Predict(request)
      prediction = tf.make_ndarray(result.outputs['output'])
      print(decode_predictions(prediction))
   ```

1. Run the script to submit predictions to your service.

   ```
   python3 tensorflow-model-server-infer.py
   ```

   An example output is as follows.

   ```
   [[(u'n02123045', u'tabby', 0.68817204), (u'n02127052', u'lynx', 0.12701613), (u'n02123159', u'tiger_cat', 0.08736559), (u'n02124075', u'Egyptian_cat', 0.063844085), (u'n02128757', u'snow_leopard', 0.009240591)]]
   ```

# Manage compute resources for AI/ML workloads on Amazon EKS
<a name="ml-compute-management"></a>

This section is designed to help you manage compute resources for machine learning workloads in Amazon Elastic Kubernetes Service (EKS). You’ll find details on reserving GPUs using Capacity Blocks for managed node groups and self-managed nodes, including prerequisites, launch template setup, scaling configurations, workload preparation, and key considerations for handling reservation lifecycles and graceful node termination.

**Topics**
+ [

# Create a managed node group with Capacity Blocks for ML
](capacity-blocks-mng.md)
+ [

# Create self-managed nodes with Capacity Blocks for ML
](capacity-blocks.md)
+ [

# Use P6e-GB200 UltraServers with Amazon EKS
](ml-eks-nvidia-ultraserver.md)

# Create a managed node group with Capacity Blocks for ML
<a name="capacity-blocks-mng"></a>

Capacity Blocks for machine learning (ML) allow you to reserve GPU instances on a future date to support your short duration ML workloads. For more information, see [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) in the *Amazon EC2 User Guide for Linux Instances*.

## Considerations
<a name="capacity-blocks-mng-considerations"></a>

**Important**  
Capacity Blocks are only available for certain Amazon EC2 instance types and AWS Regions. For compatibility information, see [Work with Capacity Blocks Prerequisites](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html#capacity-blocks-prerequisites) in the *Amazon EC2 User Guide for Linux Instances*.
For more information, see [Use Capacity Blocks for machine learning workloads](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-template-capacity-blocks.html) in the *Amazon EC2 Auto Scaling User Guide*.
Managed node groups with Capacity Blocks can only be created with custom launch templates.
When upgrading managed node groups with Capacity Blocks, make sure that the desired size of the node group is set to `0`.

## Create a managed node group with Amazon EC2 Capacity Blocks
<a name="capacity-blocks-mng-procedure"></a>

You can use Capacity Blocks with Amazon EKS managed node groups for provisioning and scaling GPU-accelerated worker nodes. The AWS CloudFormation template examples that follow don’t cover every aspect needed in a production clusters. Typically, you’d also want a bootstrapping script to join the node to the cluster and specify an Amazon EKS accelerated AMI. For more information, see [Create a managed node group for your cluster](create-managed-node-group.md).

1. Create a launch template that’s appropriate for your workloads and works with Amazon EKS managed node groups. For more information, see [Customize managed nodes with launch templates](launch-templates.md).

   In addition to the requirements in the above procedures, make sure that the `LaunchTemplateData` includes the following:
   +  `InstanceMarketOptions` with `MarketType` set to `"capacity-block"` 
   +  `CapacityReservationSpecification: CapacityReservationTarget` with `CapacityReservationId` set to the Capacity Block (for example: `cr-02168da1478b509e0 ` )
   +  `InstanceType` set to an instance type that supports Capacity Blocks (for example: *p5.48xlarge*)

     The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block. To create a custom AMI managed node group, you can also add `ImageId` and `UserData` parameters.

     ```
     NodeLaunchTemplate:
       Type: "AWS::EC2::LaunchTemplate"
       Properties:
         LaunchTemplateData:
           InstanceMarketOptions:
             MarketType: "capacity-block"
           CapacityReservationSpecification:
             CapacityReservationTarget:
               CapacityReservationId: "cr-02168da1478b509e0"
           InstanceType: p5.48xlarge
     ```

1. Use the launch template to create a managed node group.

   The following is an example create node group command for Capacity Blocks. Replace *example-values* with ones applicable to your cluster.

   When creating the Capacity Block managed node group, do the following:
   + Set the `capacity-type` to `"CAPACITY_BLOCK"`. If the capacity type isn’t set to `"CAPACITY_BLOCK"` or any of the other above required launch template values are missing, then the create request will be rejected.
   + When specifying `subnets` in the create request, make sure to only specify the subnet in the same Availability Zone as the capacity reservation.
   + If you specify a non-zero `desiredSize` in the create request, Amazon EKS will honor that when creating the Auto Scaling group (ASG). However, if the create request is made before the capacity reservation is active, then the ASG won’t be able to launch Amazon EC2 instances until it becomes active. As a result, ASG scaling activities will have launch errors. Whenever the reservation becomes active, then the launch of instances will succeed and the ASG will be scaled up to the `desiredSize` mentioned at create time.

     ```
     aws eks create-nodegroup \
         --cluster-name my-cluster \
         --nodegroup-name my-mng \
         --node-role node-role-arn \
         --region region-code \
         --subnets subnet-id \
         --scaling-config minSize=node-group-min-size,maxSize=node-group-max-size,desiredSize=node-group-desired-size \
         --ami-type "AL2023_x86_64_NVIDIA" \
         --capacity-type "CAPACITY_BLOCK" \
         --launch-template id="lt-id",version=1
     ```

1. Make sure that the nodes join after scale up. Amazon EKS clusters using managed node groups with Capacity Blocks don’t perform any validations that instances launched actually join and register with the cluster.

1. If you set `desiredSize` to `0` at create time, then you have different options to scale up the node group when the capacity reservation becomes active:
   + Create a scheduled scaling policy for the ASG that aligns to the Capacity Block reservation start time. For more information, see [Scheduled scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scheduled-scaling.html) in the *Amazon EC2 Auto Scaling User Guide*.
   + Use the Amazon EKS console or `eks update-nodegroup-config` to update the scaling config and set the desired size of the node group.
   + Use the Kubernetes Cluster Autoscaler. For more information, see [Cluster Autoscaler on AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md).

1. The node group is now ready for workloads and Pods to be scheduled.

1. In order for your Pods to be gracefully drained before reservation ends, Amazon EKS uses a scheduled scaling policy to scale down the node group size to `0` . This scheduled scaling will be set with name titled `Amazon EKS Node Group Capacity Scaledown Before Reservation End` . We recommend not editing or deleting this action.

   Amazon EC2 starts shutting down the instances 30 minutes before reservation end time. As a result, Amazon EKS will setup a scheduled scale down on the node group 40 minutes prior to their reservation end in order to safely and gracefully evict Pods.

# Create self-managed nodes with Capacity Blocks for ML
<a name="capacity-blocks"></a>

Capacity Blocks for machine learning (ML) allow you to reserve GPU instances on a future date to support your short duration ML workloads. For more information, see [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) in the *Amazon EC2 User Guide for Linux Instances*.

## Considerations
<a name="capacity-blocks-considerations"></a>

**Important**  
Capacity Blocks are only available for certain Amazon EC2 instance types and AWS Regions. For compatibility information, see [Work with Capacity Blocks Prerequisites](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html#capacity-blocks-prerequisites) in the *Amazon EC2 User Guide for Linux Instances*.
If you create a self-managed node group prior to the capacity reservation becoming active, then set the desired capacity to `0`.
To allow sufficient time to gracefully drain the node(s), we suggest that you schedule scaling to scale to zero more than 30 minutes before the Capacity Block reservation end time.
In order for your Pods to be gracefully drained, we recommend that you set up AWS Node Termination Handler as explained in the example steps.

## Use Capacity Blocks with self-managed nodes
<a name="capacity-blocks-procedure"></a>

You can use Capacity Blocks with Amazon EKS for provisioning and scaling your self-managed nodes. The following steps give a general example overview. The AWS CloudFormation template examples don’t cover every aspect needed in a production workload. Typically you’d also want a bootstrapping script to join the node to the cluster, specify an Amazon EKS accelerated AMI, and an appropriate instance profile for joining the cluster. For more information, see [Create self-managed Amazon Linux nodes](launch-workers.md).

1. Create a launch template that’s applicable to your workload. For more information, see [Use Capacity Blocks for machine learning workloads](https://docs.aws.amazon.com/autoscaling/ec2/userguide/launch-template-capacity-blocks.html) in the *Amazon EC2 Auto Scaling User Guide*.

   Make sure the `LaunchTemplateData` includes the following:
   +  `InstanceMarketOptions` with `MarketType` set to `"capacity-block"` 
   +  `CapacityReservationSpecification: CapacityReservationTarget` with `CapacityReservationId` set to the Capacity Block (for example: `cr-02168da1478b509e0 ` )
   +  `IamInstanceProfile` with the `Arn` set to the applicable *iam-instance-profile-arn* 
   +  `ImageId` set to the applicable *image-id* 
   +  `InstanceType` set to an instance type that supports Capacity Blocks (for example: *p5.48xlarge*)
   +  `SecurityGroupIds` set to the applicable IDs (for example: *sg-05b1d815d1EXAMPLE*)
   +  `UserData` set to the applicable *user-data* for your self-managed node group

     The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block.

     ```
     NodeLaunchTemplate:
       Type: "aws::EC2::LaunchTemplate"
       Properties:
         LaunchTemplateData:
           InstanceMarketOptions:
             MarketType: "capacity-block"
           CapacityReservationSpecification:
             CapacityReservationTarget:
               CapacityReservationId: "cr-02168da1478b509e0"
           IamInstanceProfile:
             Arn: iam-instance-profile-arn
           ImageId: image-id
           InstanceType: p5.48xlarge
           KeyName: key-name
           SecurityGroupIds:
           - sg-05b1d815d1EXAMPLE
           UserData: user-data
     ```

     You must pass the subnet in the Availability Zone in which the reservation is made because Capacity Blocks are zonal.

1. Use the launch template to create a self-managed node group. If you’re doing this prior to the capacity reservation becoming active, then set the desired capacity to `0`. When creating the node group, make sure that you are only specifying the respective subnet for the Availability Zone in which the capacity is reserved.

   The following is a sample CloudFormation template that you can reference when creating one that is applicable to your workload. This example gets the `LaunchTemplateId` and `Version` of the ` AWS::Amazon EC2::LaunchTemplate` resource shown in the previous step. It also gets the values for `DesiredCapacity`, `MaxSize`, `MinSize`, and `VPCZoneIdentifier` that are declared elsewhere in the same template.

   ```
   NodeGroup:
     Type: "AWS::AutoScaling::AutoScalingGroup"
     Properties:
       DesiredCapacity: !Ref NodeAutoScalingGroupDesiredCapacity
       LaunchTemplate:
         LaunchTemplateId: !Ref NodeLaunchTemplate
         Version: !GetAtt NodeLaunchTemplate.LatestVersionNumber
       MaxSize: !Ref NodeAutoScalingGroupMaxSize
       MinSize: !Ref NodeAutoScalingGroupMinSize
       VPCZoneIdentifier: !Ref Subnets
       Tags:
         - Key: Name
           PropagateAtLaunch: true
           Value: !Sub ${ClusterName}-${NodeGroupName}-Node
         - Key: !Sub kubernetes.io/cluster/${ClusterName}
           PropagateAtLaunch: true
           Value: owned
   ```

1. Once the node group is created successfully, make sure to record the `NodeInstanceRole` for the node group that was created. You need this in order to make sure that when node group is scaled, the new nodes join the cluster and Kubernetes is able to recognize the nodes. For more information, see the AWS Management Console instructions in [Create self-managed Amazon Linux nodes](launch-workers.md).

1. We recommend that you create a scheduled scaling policy for the Auto Scaling group that aligns to the Capacity Block reservation times. For more information, see [Scheduled scaling for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scheduled-scaling.html) in the *Amazon EC2 Auto Scaling User Guide*.

   You can use all of the instances you reserved until 30 minutes before the end time of the Capacity Block. Instances that are still running at that time will start terminating. To allow sufficient time to gracefully drain the node(s), we suggest that you schedule scaling to scale to zero more than 30 minutes before the Capacity Block reservation end time.

   If you want to instead scale up manually whenever the capacity reservation becomes `Active`, then you need to update the Auto Scaling group’s desired capacity at the start time of the Capacity Block reservation. Then you would need to also scale down manually more than 30 minutes before the Capacity Block reservation end time.

1. The node group is now ready for workloads and Pods to be scheduled.

1. In order for your Pods to be gracefully drained, we recommend that you set up AWS Node Termination Handler. This handler will be able to watch for "ASG Scale-in" lifecycle events from Amazon EC2 Auto Scaling using EventBridge and allow the Kubernetes control plane to take required action before the instance becomes unavailable. Otherwise, your Pods and Kubernetes objects will get stuck in a pending state. For more information, see [AWS Node Termination Handler](https://github.com/aws/aws-node-termination-handler) on GitHub.

   If you don’t setup a Node Termination Handler, we recommend that you start draining your Pods manually before hitting the 30 minute window so that they have enough time to be gracefully drained.

# Use P6e-GB200 UltraServers with Amazon EKS
<a name="ml-eks-nvidia-ultraserver"></a>

This topic describes how to configure and use Amazon EKS with P6e-GB200 UltraServers. The `p6e-gb200.36xlarge` instance type with 4 NVIDIA Blackwell GPUs is only available as P6e-GB200 UltraServers. There are two types of P6e-GB200 UltraServers. The `u-p6e-gb200x36` UltraServer has 9 `p6e-gb200.36xlarge` instances and the `u-p6e-gb200x72` UltraServer has 18 `p6e-gb200.36xlarge` instances.

To learn more, see the [Amazon EC2 P6e-GB200 UltraServers webpage](https://aws.amazon.com/ec2/instance-types/p6/).

## Considerations
<a name="nvidia-ultraserver-considerations"></a>
+ Amazon EKS supports P6e-GB200 UltraServers for Kubernetes versions 1.33 and above. This Kubernetes version release provides support for [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (DRA), enabled by default in EKS and in the [AL2023 EKS-optimized accelerated AMIs](https://docs.aws.amazon.com/eks/latest/userguide/ml-eks-optimized-ami.html). DRA is a requirement to use the P6e-GB200 UltraServers with EKS. DRA is not supported in Karpenter or EKS Auto Mode, and it is recommended to use EKS self-managed node groups or EKS managed node groups when using the P6e-GB200 UltraServers with EKS.
+ P6e-GB200 UltraServers are made available through [EC2 Capacity Blocks for ML](https://aws.amazon.com/ec2/capacityblocks/). See [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for information on how to launch EKS nodes with Capacity Blocks.
+ When using EKS managed node groups with Capacity Blocks, you must use custom launch templates. When upgrading EKS managed node groups with P6e-GB200 UltraServers, you must set the desired size of the node group to `0` before upgrading.
+ It is recommended to use the AL2023 ARM NVIDIA variant of the EKS-optimized accelerated AMIs. This AMI includes the required node components and configuration to work with P6e-GB200 UltraServers. If you decide to build your own AMI, you are responsible for installing and validating the compatibility of the node and system software, including drivers. For more information, see [Use EKS-optimized accelerated AMIs for GPU instances](ml-eks-optimized-ami.md).
+ It is recommended to use EKS-optimized AMI release `v20251103` or later, which includes NVIDIA driver version 580. This NVIDIA driver version enables Coherent Driver-Based Memory (CDMM) to address potential memory over-reporting. When CDMM is enabled, the following capabilities are not supported: NVIDIA Multi-Instance GPU (MIG) and vGPU. For more information on CDMM, see [NVIDIA Coherent Driver-based Memory Management (CDMM)](https://nvdam.widen.net/s/gpqp6wmz7s/cuda-whitepaper—​cdmm-pdf).
+ When using the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI. The EKS-optimized AL2023 NVIDIA AMIs do not include the NVIDIA Kubernetes device plugin or the NVIDIA DRA driver, and these must be installed separately.
+ Each `p6e-gb200.36xlarge` instance can be configured with up to 17 network cards and can leverage EFA for communication between UltraServers. Workload network traffic can cross UltraServers, but for highest performance it is recommended to schedule workloads in the same UltraServer leveraging IMEX for intra-UltraServer GPU communication. For more information, see [EFA configuration for P6e-GB200 instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e).
+ Each `p6e-gb200.36xlarge` instance has 3x 7.5TB [instance store storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html). By default, the EKS-optimized AMI does not format and mount the instance stores. The node’s ephemeral storage can be shared among pods that request ephemeral storage and container images that are downloaded to the node. If using the AL2023 EKS-optimized AMI, this can be configured as part of the nodes bootstrap in the user data by setting the instance local storage policy in [NodeConfig](https://docs.aws.amazon.com/eks/latest/eksctl/node-bootstrapping.html#configuring-the-bootstrapping-process) to RAID0. Setting to RAID0 stripes the instance stores and configures the container runtime and kubelet to make use of this ephemeral storage.

## Components
<a name="nvidia-ultraserver-components"></a>

The following components are recommended for running workloads on EKS with the P6e-GB200 UltraServers. You can optionally use the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) to install the NVIDIA node components. When using the NVIDIA GPU operator with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/eks/latest/userguide/ml-eks-nvidia-ultraserver.html)

The node components in the table above perform the following functions:
+  **VPC CNI**: Allocates VPC IPs as the primary network interface for pods running on EKS
+  **EFA device plugin**: Allocates EFA devices as secondary networks for pods running on EKS. Responsible for network traffic across P6e-GB200 UltraServers. For multi-node workloads, for GPU-to-GPU within an UltraServer can flow over multi-node NVLink.
+  **NVIDIA Kubernetes device plugin**: Allocates GPUs as devices for pods running on EKS. It is recommended to use the NVIDIA Kubernetes device plugin until the NVIDIA DRA driver GPU allocation functionality graduates from experimental. See the [NVIDIA DRA driver releases](https://github.com/NVIDIA/k8s-dra-driver-gpu/releases) for updated information.
+  **NVIDIA DRA driver**: Enables ComputeDomain custom resources that facilitate creation of IMEX domains that follow workloads running on P6e-GB200 UltraServers.
  + The ComputeDomain resource describes an Internode Memory Exchange (IMEX) domain. When workloads with a ResourceClaim for a ComputeDomain are deployed to the cluster, the NVIDIA DRA driver automatically creates an IMEX DaemonSet that runs on matching nodes and establishes the IMEX channel(s) between the nodes before the workload is started. To learn more about IMEX, see [overview of NVIDIA IMEX for multi-node NVLink systems](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html).
  + The NVIDIA DRA driver uses a clique ID label (`nvidia.com/gpu.clique`) applied by NVIDIA GFD that relays the knowledge of the network topology and NVLink domain.
  + It is a best practice to create a ComputeDomain per workload job.
+  **NVIDIA Node Feature Discovery (NFD)**: Required dependency for GFD to apply node labels based on discovered node-level attributes.
+  **NVIDIA GPU Feature Discovery (GFD)**: Applies an NVIDIA standard topology label called `nvidia.com/gpu.clique` to the nodes. Nodes within the same `nvidia.com/gpu.clique` have multi-node NVLink-reachability, and you can use pod affinities in your application to schedule pods to the same NVlink domain.

## Procedure
<a name="nvidia-ultraserver-procedure"></a>

The following section assumes you have an EKS cluster running Kubernetes version 1.33 or above with one or more node groups with P6e-GB200 UltraServers running the AL2023 ARM NVIDIA EKS-optimized accelerated AMI. See the links in [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for the prerequisite steps for EKS self-managed nodes and managed node groups.

The following procedure uses the components below.


| Name | Version | Description | 
| --- | --- | --- | 
|  NVIDIA GPU Operator  |  25.3.4\$1  |  For lifecycle management of required plugins such as NVIDIA Kubernetes device plugin and NFD/GFD.  | 
|  NVIDIA DRA Drivers  |  25.8.0\$1  |  For ComputeDomain CRDs and IMEX domain management.  | 
|  EFA Device Plugin  |  0.5.14\$1  |  For cross-UltraServer communication.  | 

## Install NVIDIA GPU operator
<a name="nvidia-ultraserver-gpu-operator"></a>

The NVIDIA GPU operator simplifies the management of components required to use GPUs in Kubernetes clusters. As the NVIDIA GPU driver and container toolkit are installed as part of the EKS-optimized accelerated AMI, these must be set to `false` in the Helm values configuration.

1. Create a Helm values file named `gpu-operator-values.yaml` with the following configuration.

   ```
   devicePlugin:
     enabled: true
   nfd:
     enabled: true
   gfd:
     enabled: true
   driver:
     enabled: false
   toolkit:
     enabled: false
   migManager:
     enabled: false
   ```

1. Install the NVIDIA GPU operator for your cluster using the `gpu-operator-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator \
    --create-namespace \
    --version v25.3.4 \
    --values gpu-operator-values.yaml
   ```

## Install NVIDIA DRA driver
<a name="nvidia-ultraserver-dra-driver"></a>

As of NVIDIA GPU operator version `v25.3.4`, the NVIDIA DRA driver must be installed separately. It is recommended to track the NVIDIA GPU operator [release notes](https://github.com/NVIDIA/gpu-operator/releases) as this may change in a future release.

1. Create a Helm values file named `dra-values.yaml` with the following configuration. Note the `nodeAffinity` and `tolerations` that configures the DRA driver to deploy only on nodes with an NVIDIA GPU.

   ```
   resources:
     gpus:
       enabled: false # set to false to disable experimental gpu support
     computeDomains:
       enabled: true
   
   controller:
     nodeSelector: null
     affinity: null
     tolerations: []
   
   kubeletPlugin:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: "nvidia.com/gpu.present"
               operator: In
               values:
               - "true"
     tolerations:
       - key: "nvidia.com/gpu"
         operator: Exists
         effect: NoSchedule
   ```

1. Install the NVIDIA DRA driver for your cluster using the `dra-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
     --version="25.8.0" \
     --namespace nvidia-dra-driver-gpu \
     --create-namespace \
     -f dra-values.yaml
   ```

1. After installation, the DRA driver creates `DeviceClass` resources that enable Kubernetes to understand and allocate `ComputeDomain` resources, making the IMEX management possible for distributed GPU workloads on P6e-GB200 UltraServers.

   Confirm the DRA resources are available with the following commands.

   ```
   kubectl api-resources | grep resource.k8s.io
   ```

   ```
   deviceclasses           resource.k8s.io/v1  false        DeviceClass
   resourceclaims          resource.k8s.io/v1  true         ResourceClaim
   resourceclaimtemplates  resource.k8s.io/v1  true         ResourceClaimTemplate
   resourceslices          resource.k8s.io/v1  false        ResourceSlice
   ```

   ```
   kubectl get deviceclasses
   ```

   ```
   NAME
   compute-domain-daemon.nvidia.com
   compute-domain-default-channel.nvidia.com
   ```

## Install the EFA device plugin
<a name="nvidia-ultraserver-efa-plugin"></a>

To use EFA communication between UltraServers, you must install the Kubernetes device plugin for EFA. P6e-GB200 instances can be configured with up to [17 network cards](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) and the primary NCI (index 0) must be of type `interface` and supports up to 100 Gbps of ENA bandwidth. Configure your EFA and ENA interfaces as per your requirements during node provisioning. Review the [EFA configuration for a P6e-GB200 instances AWS documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) for more details on EFA configuration.

1. Create a Helm values file named `efa-values.yaml` with the following configuration.

   ```
   tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Install the NVIDIA DRA operator for your cluster using the `dra-values.yaml` file you created in the previous step.

   ```
   helm repo add eks https://aws.github.io/eks-charts
   helm repo update
   ```

   ```
   helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
     --version="0.5.14" \
     -f efa-values.yaml
   ```

   As an example, if you configured your instances with 1 efa-only interface in each [NCI group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e), when describing a node, it is expected to see 4 allocatable EFA devices per node.

   ```
   kubectl describe node/<gb200-node-name>
   ```

   ```
   Capacity:
     ...
     vpc.amazonaws.com/efa:  4
   Allocatable:
     ...
     vpc.amazonaws.com/efa:  4
   ```

## Validate IMEX over Multi-Node NVLink
<a name="nvidia-ultraserver-imex-nvlink"></a>

For a multi-node NVLINK NCCL test and other micro-benchmarks review the [awesome-distributed-training](https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests) GitHub repository. The following steps show how to run a multi-node NVLink test with nvbandwidth.

1. To run a multi-node bandwidth test across two nodes in the NVL72 domain, first install the MPI operator:

   ```
   kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.7.0/mpi-operator.yaml
   ```

1. Create a Helm values file named `nvbandwidth-test-job.yaml` that defines the test manifest. Note the `nvidia.com/gpu.clique` pod affinity to schedule the workers in the same NVLink domain which has Multi-Node NVLink reachability. The sample below runs a multi-node device-to-device CE Read memcpy test using cuMemcpyAsync and prints the results in the logs.

   As of NVIDIA DRA Driver version `v25.8.0` ComputeDomains are elastic and `.spec.numNodes` can be set to `0` in the ComputeDomain definition. Review the latest [NVIDIA DRA Driver release notes](https://github.com/NVIDIA/k8s-dra-driver-gpu) for updates.

   ```
   ---
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: nvbandwidth-test-compute-domain
   spec:
     numNodes: 0 # This can be set to 0 from NVIDIA DRA Driver version v25.8.0+
     channel:
       resourceClaimTemplate:
         name: nvbandwidth-test-compute-domain-channel
   
   ---
   apiVersion: kubeflow.org/v2beta1
   kind: MPIJob
   metadata:
     name: nvbandwidth-test
   spec:
     slotsPerWorker: 4 # 4 GPUs per worker node
     launcherCreationPolicy: WaitForWorkersReady
     runPolicy:
       cleanPodPolicy: Running
     sshAuthMountPath: /home/mpiuser/.ssh
     mpiReplicaSpecs:
       Launcher:
         replicas: 1
         template:
           metadata:
             labels:
               nvbandwidth-test-replica: mpi-launcher
           spec:
             affinity:
               nodeAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                   nodeSelectorTerms:
                   - matchExpressions:
                     # Only schedule on NVIDIA GB200/GB300 nodes
                     - key: node.kubernetes.io/instance-type
                       operator: In
                       values:
                       - p6e-gb200.36xlarge
                       - p6e-gb300.36xlarge
             containers:
             - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
               name: mpi-launcher
               securityContext:
                 runAsUser: 1000
               command:
               - mpirun
               args:
               - --bind-to
               - core
               - --map-by
               - ppr:4:node
               - -np
               - "8"
               - --report-bindings
               - -q
               - nvbandwidth
               - -t
               - multinode_device_to_device_memcpy_read_ce
       Worker:
         replicas: 2 # 2 worker nodes
         template:
           metadata:
             labels:
               nvbandwidth-test-replica: mpi-worker
           spec:
             affinity:
               nodeAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                   nodeSelectorTerms:
                   - matchExpressions:
                     # Only schedule on NVIDIA GB200/GB300 nodes
                     - key: node.kubernetes.io/instance-type
                       operator: In
                       values:
                       - p6e-gb200.36xlarge
                       - p6e-gb300.36xlarge
               podAffinity:
                 requiredDuringSchedulingIgnoredDuringExecution:
                 - labelSelector:
                     matchExpressions:
                     - key: nvbandwidth-test-replica
                       operator: In
                       values:
                       - mpi-worker
                   topologyKey: nvidia.com/gpu.clique
             containers:
             - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
               name: mpi-worker
               securityContext:
                 runAsUser: 1000
               env:
               command:
               - /usr/sbin/sshd
               args:
               - -De
               - -f
               - /home/mpiuser/.sshd_config
               resources:
                 limits:
                   nvidia.com/gpu: 4  # Request 4 GPUs per worker
                 claims:
                 - name: compute-domain-channel # Link to IMEX channel
             resourceClaims:
             - name: compute-domain-channel
               resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
   ```

1. Create the ComputeDomain and start the job with the following command.

   ```
   kubectl apply -f nvbandwidth-test-job.yaml
   ```

1. ComputeDomain creation, you can see the workload’s ComputeDomain has two nodes:

   ```
   kubectl get computedomains.resource.nvidia.com -o yaml
   ```

   ```
   status:
     nodes:
     - cliqueID: <ClusterUUID>.<Clique ID>
       ipAddress: <node-ip>
       name: <node-hostname>
     - cliqueID: <ClusterUUID>.<Clique ID>
       ipAddress: <node-ip>
       name: <node-hostname>
     status: Ready
   ```

1. Review the results of the job with the following command.

   ```
   kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
   ```

   A successful test shows bandwidth statistics in GB/s for the multi-node memcpy test. An example of a successful test output is shown below.

   ```
   ...
   nvbandwidth Version: ...
   Built from Git version: ...
   
   MPI version: ...
   CUDA Runtime Version: ...
   CUDA Driver Version: ...
   Driver Version: ...
   
   Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (...)
   Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (...)
   Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (...)
   Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (...)
   Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (...)
   Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (...)
   Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (...)
   Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (...)
   
   Running multinode_device_to_device_memcpy_read_ce.
   memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
              0         1         2         3         4         5         6         7
    0       N/A    821.45    822.18    821.73    822.05    821.38    822.61    821.89
    1    822.34       N/A    821.67    822.12    821.94    820.87    821.53    822.08
    2    821.76    822.29       N/A    821.58    822.43    821.15    821.82    822.31
    3    822.19    821.84    822.05       N/A    821.67    821.23    820.95    822.47
    4    821.63    822.38    821.49    822.17       N/A    821.06    821.78    822.22
    5    822.08    821.52    821.89    822.35    821.27       N/A    821.64    822.13
    6    821.94    822.15    821.68    822.04    821.39    820.92       N/A    822.56
    7    822.27    821.73    822.11    821.86    822.38    821.04    821.49       N/A
   
   SUM multinode_device_to_device_memcpy_read_ce ...
   
   NOTE: The reported results may not reflect the full capabilities of the platform.
   Performance can vary with software drivers, hardware clocks, and system topology.
   ```

1. When the test is complete, delete it with the following command.

   ```
   kubectl delete -f nvbandwidth-test-job.yaml
   ```

# Manage hardware devices on Amazon EKS
<a name="device-management"></a>

Amazon EKS supports two Kubernetes mechanisms for managing specialized hardware devices in EKS clusters: *Dynamic Resource Allocation (DRA)* and *device plugins*. Both mechanisms enable workloads to access hardware accelerators such as NVIDIA GPUs and AWS Trainium chips, and high-performance network devices such as Elastic Fabric Adapter (EFA). It’s recommended to use DRA drivers for new deployments with Kubernetes versions 1.34 and later, as DRA provides richer device selection, topology-aware scheduling, and device sharing capabilities that are not possible with device plugins.

Reference the Kubernetes documentation for [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) and [device plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) for general information about these two Kubernetes features.

## Dynamic Resource Allocation vs device plugins
<a name="_dynamic_resource_allocation_vs_device_plugins"></a>

Kubernetes device plugins have been the primary mechanism for exposing specialized hardware to Kubernetes workloads. Device plugins advertise devices as extended resources (for example, `nvidia.com/gpu` or `aws.amazon.com/neuroncore`) that you request in container resource requests and limits. While device plugins are widely supported and used, they have limitations:
+ Devices are requested as opaque integer counts with no attribute-based filtering.
+ No support for device sharing between containers or Pods.
+ No expressive topology-aware allocation across device types.
+ Custom scheduler extensions are often required for intelligent placement.

Dynamic Resource Allocation (DRA) is a Kubernetes feature made generally available in Kubernetes version 1.34 that addresses these limitations. With DRA, device drivers publish rich device attributes to the Kubernetes scheduler through `ResourceSlice` objects. You request devices using `ResourceClaim` and `ResourceClaimTemplate` objects that reference `DeviceClass` categories.

DRA enables:
+ Attribute-based device selection using [Common Expression Language (CEL)](https://kubernetes.io/docs/reference/using-api/cel/) expressions.
+ Topology-aware allocation that ensures devices are co-located on the same PCIe switch or NUMA domain.
+ Device sharing between multiple containers or Pods through shared `ResourceClaim` references.
+ Constraint-based scheduling that aligns different device types

## DRA drivers for Amazon EKS
<a name="_dra_drivers_for_amazon_eks"></a>

The following DRA drivers are commonly used for managing specialized hardware devices in Amazon EKS clusters.

Neuron DRA driver  
The Neuron DRA driver manages AWS Trainium and AWS Inferentia2 device allocation with topology-aware scheduling, connected device subset allocation, and Logical NeuronCore (LNC) configuration, without requiring custom scheduler extensions.

NVIDIA DRA driver  
The [NVIDIA DRA driver for GPUs](https://github.com/NVIDIA/k8s-dra-driver-gpu) enables flexible allocation and dynamic reconfiguration of NVIDIA GPUs, including support for `ComputeDomain` resources for Multi-Node NVLink (MNNVL) workloads on EC2 Grace-Blackwell instances. For more information on using `ComputeDomains` with EC2 Grace-Blackwell instances, see [Use P6e-GB200 UltraServers with Amazon EKS](ml-eks-nvidia-ultraserver.md).

## Device plugins for Amazon EKS
<a name="_device_plugins_for_amazon_eks"></a>

The following device plugins are commonly used for managing specialized hardware devices in Amazon EKS clusters.

EFA device plugin  
The EFA device plugin discovers all available EFA devices on each node and advertises EFA devices as `vpc.amazonaws.com/efa` extended resources.

Neuron device plugin  
The [Neuron device plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html) exposes Neuron hardware as `aws.amazon.com/neuroncore` and `aws.amazon.com/neuron` extended resources. It discovers available Neuron devices on each node, advertises them as allocatable resources, and manages their lifecycle.

NVIDIA device plugin  
The [NVIDIA device plugin](https://github.com/NVIDIA/k8s-device-plugin) advertises NVIDIA GPUs as `nvidia.com/gpu` extended resources and tracks the health of GPUs.

## Considerations
<a name="_considerations"></a>

Before using DRA drivers on Amazon EKS, review the following considerations:
+ DRA is available on Amazon EKS with Kubernetes version 1.33 and above, but it is recommended for Kubernetes versions 1.34 and later due to an upstream [Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/133920). Your cluster control plane and nodes must be running a Kubernetes version that supports DRA.
+ DRA is not currently compatible with Karpenter or EKS Auto Mode provisioned compute. You must use EKS managed node groups or self-managed nodes with DRA drivers.
+ DRA drivers and device plugins for the same device type **must** not run simultaneously on the same node. Uninstall the device plugin before installing the corresponding DRA driver, or deploy them on separate nodes. See upstream Kubernetes [KEP-5004](https://github.com/kubernetes/enhancements/issues/5004) for updates on DRA driver and device plugin compatibility.
+ DRA uses different Kubernetes API resources (`ResourceClaim`, `ResourceClaimTemplate`, `DeviceClass`) than device plugins (`resource.limits`, `resource.requests`). Migrating from device plugins to DRA requires updating your workload specifications.
+ Device plugins remain fully supported for all Kubernetes versions. If your cluster runs a Kubernetes version earlier than 1.34, if you use Karpenter or EKS Auto Mode, or if you use Bottlerocket, continue using device plugins.

## Topics
<a name="_topics"></a>
+  [Manage EFA devices on Amazon EKS](device-management-efa.md) 
+  [Manage Neuron devices on Amazon EKS](device-management-neuron.md) 
+  [Manage NVIDIA GPU devices on Amazon EKS](device-management-nvidia.md) 

# Manage EFA devices on Amazon EKS
<a name="device-management-efa"></a>

 [Elastic Fabric Adapter](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) (EFA) is a network device for Amazon EC2 instances that enables high-performance inter-node communication for machine learning training and High Performance Computing (HPC) workloads. Amazon EKS supports the *EFA device plugin* for managing EFA devices in EKS clusters.

## Creating EKS nodes with EFA interfaces
<a name="eks-efa-nodes"></a>

When you create EKS nodes with EFA interfaces, the EFA interfaces are attached during instance bootstrap. If you need to customize the per-device EFA configuration or use [placement groups](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html) for the EFA-enabled EC2 instances, it’s recommended to use EKS managed node groups or EKS self-managed node groups. You can pass configuration for each network interface with [launch templates](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-launch-templates.html).

When using EKS Auto Mode or Karpenter with dynamic provisioning, instances created for Pods requesting `vpc.amazonaws.com/efa` have all interfaces configured with interface type `EFA`. Static capacity provisioning in EKS Auto Mode and Karpenter does not currently support per-device EFA configuration. EKS Auto Mode and Karpenter do not currently support placement groups.

When using [`eksctl`](install-kubectl.md#eksctl-install-update) for provisioning EKS nodes with the `efaEnabled` setting, all interfaces are configured with interface type `EFA`, an EFA-specific security group is created, and the EFA device plugin is installed on the cluster. If you need to customize the per-device EFA configuration when using `eksctl`, it is recommended to use `eksctl’s support for [launch templates](https://docs.aws.amazon.com/eks/latest/eksctl/launch-template-support.html).

## Using EKS-optimized AMIs with EFA
<a name="eks-amis-efa"></a>

The EKS-optimized AL2023 accelerated AMIs (NVIDIA and Neuron) and all Bottlerocket AMIs include the host-level components required to use EFA. The EKS AL2023 and Bottlerocket AMIs do not include the EFA device plugin, and the device plugin must be installed separately on your cluster before deploying workloads that use EFA.

## Install the EFA Kubernetes device plugin
<a name="efa-device-plugin"></a>

The EFA device plugin advertises EFA devices as `vpc.amazonaws.com/efa` extended resources. You request EFA devices in container resource requests and limits. For a complete walkthrough of setting up EFA with training workloads, see [Run machine learning training on Amazon EKS with Elastic Fabric Adapter](node-efa.md).

The EFA device plugin automatically allocates EFA devices that are topologically close to Neuron accelerators and NVIDIA GPUs on the physical EC2 instance.

### Prerequisites
<a name="_prerequisites"></a>
+ An Amazon EKS cluster.
+ Nodes with EFA-enabled Amazon EC2 instance types. For a list of supported instance types, see [Supported instance types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) in the *Amazon EC2 User Guide*.
+ Nodes with host-level components installed for EFA. These are included if using the EKS AL2023 accelerated AMIs or the EKS Bottlerocket AMIs.
+ Helm installed in your command-line environment, see the [Setup Helm instructions](helm.md) for more information.
+  `kubectl` configured to communicate with your cluster, see [Install or update `kubectl`](install-kubectl.md#kubectl-install-update) for more information.

### Procedure
<a name="_procedure"></a>

1. Add the EKS Helm chart repository.

   ```
   helm repo add eks https://aws.github.io/eks-charts
   ```

1. Update your local Helm repository.

   ```
   helm repo update
   ```

1. Install the EFA device plugin.

   ```
   helm install efa eks/aws-efa-k8s-device-plugin -n kube-system
   ```

1. Verify the EFA device plugin DaemonSet is running.

   ```
   kubectl get daemonset -n kube-system aws-efa-k8s-device-plugin-daemonset
   ```

   ```
   NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   aws-efa-k8s-device-plugin-daemonset   2         2         2       2            2           <none>          60s
   ```

1. Verify that your nodes have allocatable EFA resources.

   ```
   kubectl get nodes "-o=custom-columns=NAME:.metadata.name,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"
   ```

   ```
   NAME                                           EFA
   ip-192-168-11-225.us-west-2.compute.internal   4
   ip-192-168-24-96.us-west-2.compute.internal    4
   ```

### Request EFA devices in a Pod
<a name="_request_efa_devices_in_a_pod"></a>

To request EFA devices using the device plugin, specify the `vpc.amazonaws.com/efa` resource in your container resource requests or limits.

```
apiVersion: v1
kind: Pod
metadata:
  name: efa-workload
spec:
  containers:
  - name: app
    ...
    resources:
      limits:
        vpc.amazonaws.com/efa: 4
        hugepages-2Mi: ...
      requests:
        vpc.amazonaws.com/efa: 4
        hugepages-2Mi: ...
```

# Manage Neuron devices on Amazon EKS
<a name="device-management-neuron"></a>

 AWS Trainium and AWS Inferentia are purpose-built machine learning chips designed by AWS. Amazon EKS supports two mechanisms for managing Neuron devices in EKS clusters: the *Neuron DRA driver* and the *Neuron Kubernetes device plugin*.

It’s recommended to use the Neuron DRA driver for new deployments on EKS clusters running Kubernetes version 1.34 or later. The Neuron DRA driver provides topology-aware allocation, connected device subset scheduling, Logical NeuronCore (LNC) configuration, and UltraServer multi-node allocation without requiring custom scheduler extensions. The Neuron device plugin remains supported.

## Neuron DRA driver vs. Neuron device plugin
<a name="_neuron_dra_driver_vs_neuron_device_plugin"></a>


| Feature | Neuron DRA driver | Neuron device plugin | 
| --- | --- | --- | 
|  Minimum Kubernetes version  |  1.34  |  All EKS-supported Kubernetes versions  | 
|  Karpenter and EKS Auto Mode  |  Not supported  |  Supported  | 
|  EKS-optimized AMI support  |  AL2023  |  AL2023, Bottlerocket  | 
|  Device advertisement  |  Rich attributes via `ResourceSlice` objects including device ID, instance type, topology, driver version, and EFA locality  |  Integer count of `aws.amazon.com/neuron` and `aws.amazon.com/neuroncore` extended resources  | 
|  Connected device subsets  |  Allocate subsets of 1, 4, 8, or 16 connected Neuron devices using topology constraints  |  Requires the [Neuron scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html) for contiguous device allocation  | 
|  LNC configuration  |  Per-workload Logical NeuronCore configuration (LNC=1 or LNC=2) through `ResourceClaimTemplate` parameters  |  Requires pre-configuration in EC2 launch templates  | 
|  Attribute-based selection  |  Filter devices by instance type, driver version, and other attributes using CEL expressions  |  Not supported  | 

## Install the Neuron DRA driver
<a name="neuron-dra-driver"></a>

The Neuron DRA driver advertises Neuron devices as `ResourceSlice` objects with the `DeviceClass` name `neuron.aws.com`. The driver runs as a DaemonSet and automatically discovers Neuron devices and their topology attributes.

Detailed information about the Neuron DRA driver is available in the [Neuron DRA documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neuron-dra.html#neuron-dra-driver-attributes-reference).

Using the Neuron DRA driver with Bottlerocket is not currently supported.

### Prerequisites
<a name="_prerequisites"></a>
+ An Amazon EKS cluster running Kubernetes version 1.34 or later.
+ Nodes with AWS Trainium or Inferentia2 instance types.
+ Helm installed in your command-line environment, see the [Setup Helm instructions](helm.md) for more information.
+  `kubectl` configured to communicate with your cluster, see [Install or update `kubectl`](install-kubectl.md#kubectl-install-update) for more information.

### Procedure
<a name="_procedure"></a>

**Important**  
Do not install the Neuron DRA driver on nodes where the Neuron device plugin is running. The two mechanisms cannot coexist on the same node. See upstream Kubernetes [KEP-5004](https://github.com/kubernetes/enhancements/issues/5004) for updates.

1. Install the Neuron DRA driver using Helm.

   ```
   helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
       --namespace neuron-dra-driver \
       --create-namespace \
       --set "devicePlugin.enabled=false" \
       --set "npd.enabled=false" \
       --set "draDriver.enabled=true"
   ```

   The driver is deployed as a DaemonSet in the `neuron-dra-driver` namespace by default with the `DeviceClass` `neuron.aws.com`.

1. Verify that the DRA driver DaemonSet is running.

   ```
   kubectl get ds -n neuron-dra-driver neuron-dra-driver-kubelet-plugin
   ```

   ```
   NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   neuron-dra-driver-kubelet-plugin  1         1         1       1            1           <none>          60s
   ```

1. Verify that the `DeviceClass` was created.

   ```
   kubectl get deviceclass neuron.aws.com
   ```

   ```
   NAME            AGE
   neuron.aws.com  60s
   ```

1. Verify that `ResourceSlice` objects are advertised for your nodes.

   ```
   kubectl get resourceslice
   ```

See the [Neuron DRA documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/neuron-dra.html#neuron-dra-driver-attributes-reference) for information on the available `ResourceSlice` object attributes.

### Request Neuron devices in a Pod
<a name="_request_neuron_devices_in_a_pod"></a>

To request Neuron devices using the DRA driver, create a `ResourceClaimTemplate` that references the `neuron.aws.com` `DeviceClass` and reference it in your Pod specification.

The following example requests all Neuron devices on a `trn2.48xlarge` instance:

```
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-neurons
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
          allocationMode: All
---
apiVersion: v1
kind: Pod
metadata:
  name: neuron-workload
spec:
  containers:
  - name: app
    ...
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: all-neurons
```

### Allocate connected device subsets
<a name="_allocate_connected_device_subsets"></a>

The Neuron DRA driver can allocate subsets of connected Neuron devices without requiring the [Neuron scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html). Supported subset sizes are 1, 4, 8, or 16 devices. Use the `matchAttribute` constraint with a topology group ID to ensure devices are connected.

The following example requests 4 connected Neuron devices:

```
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: 1x4-connected-neurons
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          allocationMode: ExactCount
          count: 4
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
      constraints:
      - requests: ["neurons"]
        matchAttribute: "resource.aws.com/devicegroup4_id"
```

The supported `matchAttribute` values for connected subsets are `resource.aws.com/devicegroup1_id`, `resource.aws.com/devicegroup4_id`, `resource.aws.com/devicegroup8_id`, and `resource.aws.com/devicegroup16_id`.

### Configure Logical NeuronCores (LNC)
<a name="_configure_logical_neuroncores_lnc"></a>

The Neuron DRA driver allows per-workload Logical NeuronCore configuration through `ResourceClaimTemplate` parameters. This eliminates the need to pre-configure LNC in EC2 Launch Templates.

The following example requests all Neuron devices with LNC set to 1:

```
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-neurons-lnc-1
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
          allocationMode: All
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: NeuronConfig
            logicalNeuronCore: 1
```

## Install the Neuron Kubernetes device plugin
<a name="neuron-device-plugin"></a>

The Neuron Kubernetes device plugin advertises Neuron devices as `aws.amazon.com/neuron` and NeuronCores as `aws.amazon.com/neuroncore` extended resources. You request Neuron devices in container resource requests and limits.

### Prerequisites
<a name="_prerequisites_2"></a>
+ An Amazon EKS cluster.
+ Nodes with host-level components installed for AWS Trainium or AWS Inferentia instances. These are included if using the EKS AL2023 accelerated AMIs or the EKS Bottlerocket AMIs.
+ Helm installed in your command-line environment, see the [Setup Helm instructions](helm.md) for more information.
+  `kubectl` configured to communicate with your cluster, see [Install or update `kubectl`](install-kubectl.md#kubectl-install-update) for more information.

### Procedure
<a name="_procedure_2"></a>

1. Install the Neuron Kubernetes device plugin using Helm.

   ```
   helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
       --set "npd.enabled=false"
   ```

1. Verify the Neuron device plugin DaemonSet is running.

   ```
   kubectl get ds -n kube-system neuron-device-plugin
   ```

   ```
   NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   neuron-device-plugin   1         1         1       1            1           <none>          60s
   ```

1. Verify that your nodes have allocatable Neuron devices.

   ```
   kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
   ```

   ```
   NAME                                           NeuronDevice   NeuronCore
   ip-192-168-47-173.us-west-2.compute.internal   1              2
   ```

### Verify Neuron devices with a test Pod
<a name="_verify_neuron_devices_with_a_test_pod"></a>

You can verify that Neuron devices are accessible by running the `neuron-ls` tool in a test Pod.

1. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an [Neuron Monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) container that has the `neuron-ls` tool installed.

   ```
   apiVersion: v1
   kind: Pod
   metadata:
     name: neuron-ls
   spec:
     restartPolicy: Never
     containers:
     - name: neuron-container
       image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0
       command: ["/bin/sh"]
       args: ["-c", "neuron-ls"]
       resources:
         limits:
           aws.amazon.com/neuron: 1
     tolerations:
     - key: "aws.amazon.com/neuron"
       operator: "Exists"
       effect: "NoSchedule"
   ```

1. Apply the manifest.

   ```
   kubectl apply -f neuron-ls.yaml
   ```

1. After the Pod has finished running, view its logs.

   ```
   kubectl logs neuron-ls
   ```

   An example output is as follows.

   ```
   instance-type: inf2.xlarge
   instance-id: ...
   +--------+--------+--------+---------+
   | NEURON | NEURON | NEURON |   PCI   |
   | DEVICE | CORES  | MEMORY |   BDF   |
   +--------+--------+--------+---------+
   | 0      | 2      | 32 GB  | 00:1f.0 |
   +--------+--------+--------+---------+
   ```

**Note**  
When using the Neuron device plugin, contiguous device allocation on instances with multiple Neuron devices (such as `trn2.48xlarge`) requires the [Neuron Kubernetes scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html). The Neuron DRA driver handles this automatically through topology constraints.

For more information about using Neuron devices with Amazon EKS, see the [Neuron documentation for running on EKS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html).

# Manage NVIDIA GPU devices on Amazon EKS
<a name="device-management-nvidia"></a>

NVIDIA GPUs are widely used for machine learning training, inference, and high-performance computing workloads. Amazon EKS supports two mechanisms for managing NVIDIA GPU devices in your EKS clusters: the *NVIDIA DRA driver for GPUs* and the *NVIDIA Kubernetes device plugin*.

It’s recommended to use the NVIDIA DRA driver for new deployments on clusters running Kubernetes version 1.34 or later. The NVIDIA DRA driver enables flexible GPU allocation and GPU sharing between containers. The NVIDIA device plugin remains supported.

## NVIDIA DRA driver vs. NVIDIA device plugin
<a name="_nvidia_dra_driver_vs_nvidia_device_plugin"></a>


| Capability | NVIDIA DRA driver | NVIDIA device plugin | 
| --- | --- | --- | 
|  Minimum Kubernetes version  |  1.34  |  All EKS-supported Kubernetes versions  | 
|  Karpenter and EKS Auto Mode  |  Not supported  |  Supported  | 
|  EKS-optimized AMIs  |  AL2023  |  AL2023, Bottlerocket  | 
|  Device advertisement  |  Rich attributes via `ResourceSlice` objects including GPU model, memory, driver version, and topology  |  Integer count of `nvidia.com/gpu` extended resources  | 
|  GPU sharing  |  Multiple containers in the same Pod can share a GPU through shared `ResourceClaim` references  |  Not supported. Each GPU is exclusively allocated to one container.  | 
|  ComputeDomains  |  Manages Multi-Node NVLink (MNNVL) through `ComputeDomain` resources for secure multi-node GPU communication  |  Not supported  | 
|  Attribute-based selection  |  Filter GPUs by model, memory, or other attributes using CEL expressions  |  Not supported  | 

## Install the NVIDIA DRA driver
<a name="nvidia-dra-driver"></a>

The NVIDIA DRA driver for GPUs manages two types of resources: GPUs and ComputeDomains. It runs two DRA kubelet plugins: `gpu-kubelet-plugin` and `compute-domain-kubelet-plugin`. Each can be enabled or disabled separately during installation. This guide focuses on GPU allocation. For using ComputeDomains, see [Use P6e-GB200 UltraServers with Amazon EKS](ml-eks-nvidia-ultraserver.md).

Using the NVIDIA DRA driver with Bottlerocket is not currently supported.

### Prerequisites
<a name="_prerequisites"></a>
+ An Amazon EKS cluster running Kubernetes version 1.34 or later.
+ Nodes with NVIDIA GPU instance types (such as `P` or `G` instances).
+ Nodes with host-level components installed for NVIDIA GPUs. When using the EKS-optimized AL2023 or Bottlerocket NVIDIA AMIs, the host-level NVIDIA driver, CUDA user mode driver, and container toolkit are pre-installed.
+ Helm installed in your command-line environment, see the [Setup Helm instructions](helm.md) for more information.
+  `kubectl` configured to communicate with your cluster, see [Install or update `kubectl`](install-kubectl.md#kubectl-install-update) for more information.

### Procedure
<a name="_procedure"></a>

**Important**  
When using the NVIDIA DRA driver for GPU device management, it cannot be deployed alongside the NVIDIA device plugin on the same node. See upstream Kubernetes [KEP-5004](https://github.com/kubernetes/enhancements/issues/5004) for updates.

1. Add the NVIDIA DRA driver Helm chart repository.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   ```

1. Update your local Helm repository.

   ```
   helm repo update
   ```

1. Get the latest version of the NVIDIA DRA driver.

   ```
   helm search repo nvidia/nvidia-dra
   ```

1. Install the NVIDIA DRA driver.

   ```
   helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
       --create-namespace \
       --namespace nvidia-dra-driver-gpu \
       --set resources.computeDomains.enabled=false \
       --set 'gpuResourcesEnabledOverride=true'
   ```

   For advanced configuration options, see the [NVIDIA DRA driver Helm chart](https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/deployments/helm/nvidia-dra-driver-gpu/values.yaml).

1. Verify that the DRA driver pods are running.

   ```
   kubectl get pods -n nvidia-dra-driver-gpu
   ```

1. Verify that the `DeviceClass` objects were created.

   ```
   kubectl get deviceclass
   ```

   ```
   NAME            AGE
   gpu.nvidia.com  60s
   ```

1. Verify that `ResourceSlice` objects are published for your GPU nodes.

   ```
   kubectl get resourceslice
   ```

### Request NVIDIA GPUs in a Pod
<a name="_request_nvidia_gpus_in_a_pod"></a>

To request NVIDIA GPUs using the DRA driver, create a `ResourceClaimTemplate` that references the `gpu.nvidia.com` `DeviceClass` and reference it in your Pod specification.

The following example requests a single GPU:

```
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: app
    ...
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
```

## Install the NVIDIA Kubernetes device plugin
<a name="nvidia-device-plugin"></a>

The NVIDIA Kubernetes device plugin advertises NVIDIA GPUs as `nvidia.com/gpu` extended resources. You request GPUs in container resource requests and limits.

### Prerequisites
<a name="_prerequisites_2"></a>
+ An Amazon EKS cluster.
+ Nodes with NVIDIA GPU instance types using the EKS-optimized AL2023 NVIDIA AMI. The EKS-optimized Bottlerocket AMIs include the NVIDIA device plugin and no separate installation is required.
+ Helm installed in your command-line environment, see the [Setup Helm instructions](helm.md) for more information.
+  `kubectl` configured to communicate with your cluster, see [Install or update `kubectl`](install-kubectl.md#kubectl-install-update) for more information.

### Procedure
<a name="_procedure_2"></a>

1. Add the NVIDIA device plugin Helm chart repository.

   ```
   helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
   ```

1. Update your local Helm repository.

   ```
   helm repo update
   ```

1. Install the NVIDIA Kubernetes device plugin.

   ```
   helm install nvdp nvdp/nvidia-device-plugin \
       --namespace nvidia \
       --create-namespace \
       --set gfd.enabled=true
   ```
**Note**  
You can also install and manage the NVIDIA Kubernetes device plugin using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator), which automates the management of all NVIDIA software components needed to provision GPUs.

1. Verify the NVIDIA device plugin DaemonSet is running.

   ```
   kubectl get ds -n nvidia nvdp-nvidia-device-plugin
   ```

   ```
   NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   nvdp-nvidia-device-plugin   2         2         2       2            2           <none>          60s
   ```

1. Verify that your nodes have allocatable GPUs.

   ```
   kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
   ```

   An example output is as follows.

   ```
   NAME                                           GPU
   ip-192-168-11-225.us-west-2.compute.internal   1
   ip-192-168-24-96.us-west-2.compute.internal    1
   ```

### Request NVIDIA GPUs in a Pod
<a name="_request_nvidia_gpus_in_a_pod_2"></a>

To request NVIDIA GPUs using the device plugin, specify the `nvidia.com/gpu` resource in your container resource limits.

```
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: gpu-demo
    image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
    command: ["/bin/sh", "-c"]
    args: ["nvidia-smi && tail -f /dev/null"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
```

To run this test, apply the manifest and view the logs:

```
kubectl apply -f nvidia-smi.yaml
kubectl logs nvidia-smi
```

An example output is as follows.

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI XXX.XXX.XX            Driver Version: XXX.XXX.XX     CUDA Version: XX.X      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   27C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

# Recipes to optimize your Amazon EKS cluster for AI/ML workloads
<a name="ml-recipes"></a>

**Tip**  
 [Register](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) for upcoming Amazon EKS AI/ML workshops.

This section is designed to provide bite-sized recipes for optimizing your Amazon EKS cluster, particularly for AI/ML workloads involving specialized hardware. You’ll find guidance on preventing pods from being scheduled on specific nodes by adding taints to managed node groups, including prerequisites, step-by-step procedures, and deployment considerations.

**Topics**
+ [

# Recipe: Prevent pods from being scheduled on specific nodes
](node-taints-managed-node-groups.md)

# Recipe: Prevent pods from being scheduled on specific nodes
<a name="node-taints-managed-node-groups"></a>

## Overview
<a name="_overview"></a>

Nodes with specialized processors, such as GPUs, can be more expensive to run than nodes on standard machines. To protect these nodes from workloads that don’t require special hardware, you can use Kubernetes taints. Taints mark nodes to repel pods that don’t have matching tolerations, ensuring only compatible workloads are scheduled. For more information, see [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the Kubernetes documentation.

Kubernetes node taints can be applied to new and existing managed node groups using the AWS Management Console or through the Amazon EKS API. This recipe shows how to apply taints to Amazon EKS managed node groups using the AWS CLI. For information on creating a node group with a taint using the AWS Management Console, see [Create a managed node group for your cluster](create-managed-node-group.md).

## Prerequisites
<a name="_prerequisites"></a>
+ An [existing Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html).
+  [AWS CLI installed and configured](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html) with appropriate permissions.

## Steps
<a name="_steps"></a>

### Step 1: Create a node group with taints
<a name="_step_1_create_a_node_group_with_taints"></a>

Use the `aws eks create-nodegroup` command to create a new managed node group with taints. This example applies a taint with key `dedicated`, value `gpuGroup`, and effect `NO_SCHEDULE`.

```
aws eks create-nodegroup \
 --cli-input-json '
{
  "clusterName": "my-cluster",
  "nodegroupName": "node-taints-example",
  "subnets": [
     "subnet-1234567890abcdef0",
     "subnet-abcdef01234567890",
     "subnet-021345abcdef67890"
   ],
  "nodeRole": "arn:aws:iam::111122223333:role/AmazonEKSNodeRole",
  "taints": [
     {
         "key": "dedicated",
         "value": "gpuGroup",
         "effect": "NO_SCHEDULE"
     }
   ]
}'
```

For more information and examples, see [taint](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#taint) in the Kubernetes reference documentation.

### Step 2: Update taints on an existing node group
<a name="_step_2_update_taints_on_an_existing_node_group"></a>

Use the [aws eks update-nodegroup-config](https://docs.aws.amazon.com/cli/latest/reference/eks/update-nodegroup-config.html) AWS CLI command to add, remove, or replace taints for managed node groups.

```
aws eks update-nodegroup-config
  --cluster-name my-cluster
  --nodegroup-name node-taints-example
  --taints 'removeTaints=[{key=dedicated,value=gpuGroup,effect=NO_SCHEDULE}]'
```

## Notes
<a name="_notes"></a>
+ Taints can be updated after you create the node group using the `UpdateNodegroupConfig` API.
+ The taint key must begin with a letter or number. It can contain letters, numbers, hyphens (`-`), periods (`.`), and underscores (`_`). It can be up to 63 characters long.
+ Optionally, the taint key can begin with a DNS subdomain prefix and a single `/`. If it begins with a DNS subdomain prefix, it can be 253 characters long.
+ The value is optional and must begin with a letter or number. It can contain letters, numbers, hyphens (`-`), periods (`.`), and underscores (`_`). It can be up to 63 characters long.
+ When using Kubernetes directly or the AWS Management Console, the taint effect must be `NoSchedule`, `PreferNoSchedule`, or `NoExecute`. However, when using the AWS CLI or API, the taint effect must be `NO_SCHEDULE`, `PREFER_NO_SCHEDULE`, or `NO_EXECUTE`.
+ A maximum of 50 taints are allowed per node group.
+ If taints that were created using a managed node group are removed manually from a node, then Amazon EKS doesn’t add the taints back to the node. This is true even if the taints are specified in the managed node group configuration.

# Resources to get started with AI/ML on Amazon EKS
<a name="ml-resources"></a>

To jump into Machine Learning on EKS, start by choosing from these prescriptive patterns to quickly get an EKS cluster and ML software and hardware ready to begin running ML workloads.

## Workshops
<a name="aiml-workshops"></a>

### [Generative AI on Amazon EKS Workshop](https://genai.eksworkshop.com/)
<a name="_generative_ai_on_amazon_eks_workshop"></a>

Learn how to get started with Large Language Model (LLM) applications and inference on Amazon EKS. Discover how to deploy and manage production-grade LLM workloads. Through hands-on labs, you’ll explore how to leverage Amazon EKS along with AWS services and open-source tools to create robust LLM solutions. The workshop environment provides all the necessary infrastructure and tools, allowing you to focus on learning and implementation.

### [Generative AI on Amazon EKS using Neuron](https://catalog.us-east-1.prod.workshops.aws/workshops/e21aadbd-23cb-4207-bd09-625e6de08a6c/en-US)
<a name="_generative_ai_on_amazon_eks_using_neuron"></a>

Learn how to get started with Large Language Model (LLM) applications and inference on Amazon EKS. Discover how to deploy and manage production-grade LLM workloads, implement advanced RAG patterns with vector databases, and build data-backed LLM applications using open-source frameworks. Through hands-on labs, you’ll explore how to leverage Amazon EKS along with AWS services and open-source tools to create robust LLM solutions. The workshop environment provides all the necessary infrastructure and tools, allowing you to focus on learning and implementation.

## [Best Practices](https://docs.aws.amazon.com/eks/latest/best-practices/aiml.html)
<a name="aiml-best-practices"></a>

The AI/ML focused topics in the Amazon EKS Best Practices guide provides detailed recommendations across the following areas to optimize your AI/ML workloads on Amazon EKS.

### [AI/ML Compute and Autoscaling](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html)
<a name="_aiml_compute_and_autoscaling"></a>

This section outlines best practices for optimizing AI/ML compute and autoscaling in Amazon EKS, focusing on GPU resource management, node resiliency, and application scaling. It provides strategies such as scheduling workloads with well-known labels and node affinity, using ML Capacity Blocks or On-Demand Capacity Reservations, and implementing node health checks with tools like EKS Node Monitoring Agent.

### [AI/ML Networking](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-networking.html)
<a name="_aiml_networking"></a>

This section outlines best practices for optimizing AI/ML networking in Amazon EKS to enhance performance and scalability, including strategies like selecting instances with higher network bandwidth or Elastic Fabric Adapter (EFA) for distributed training, installing tools like MPI and NCCL, and enabling prefix delegation to increase IP addresses and improve pod launch times.

### [AI/ML Security](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-security.html)
<a name="_aiml_security"></a>

This section focuses on securing data storage and ensuring compliance for AI/ML workloads on Amazon EKS, including practices such as using Amazon S3 with AWS Key Management Service (KMS) for server-side encryption (SSE-KMS), configuring buckets with regional KMS keys and S3 Bucket Keys to reduce costs, granting IAM permissions for KMS actions like decryption to EKS pods, and auditing with AWS CloudTrail logs.

### [AI/ML Storage](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-storage.html)
<a name="_aiml_storage"></a>

This section provides best practices for optimizing storage in AI/ML workloads on Amazon EKS, including practices like deploying models using CSI drivers to mount services like S3, FSx for Lustre, or EFS as Persistent Volumes, selecting storage based on workload needs (e.g., FSx for Lustre for distributed training with options like Scratch-SSD or Persistent-SSD), and enabling features like data compression and striping.

### [AI/ML Observability](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-observability.html)
<a name="_aiml_observability"></a>

This section focuses on monitoring and optimizing GPU utilization for AI/ML workloads on Amazon EKS to improve efficiency and reduce costs, including strategies such as targeting high GPU usage with tools like CloudWatch Container Insights and NVIDIA’s DCGM-Exporter integrated with Prometheus and Grafana, and metrics we recommend you analyzing for your AI/ML workloads.

### [AI/ML Performance](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-performance.html)
<a name="_aiml_performance"></a>

This section focuses on enhancing application scaling and performance for AI/ML workloads on Amazon EKS through container image management and startup optimization, including practices such as using small lightweight base images or AWS Deep Learning Containers with multi-stage builds, preloading images via EBS snapshots or pre-pulling into runtime cache using DaemonSets or Deployments.

## Reference Architectures
<a name="aiml-reference-architectures"></a>

Explore these GitHub repositories for reference architectures, sample code, and utilities to implement distributed training and inference for AI/ML workloads on Amazon EKS and other AWS services.

### [AWSome Distributed Training](https://github.com/aws-samples/awsome-distributed-training)
<a name="_awsome_distributed_training"></a>

This repository offers a collection of best practices, reference architectures, model training examples, and utilities for training large models on AWS. It supports distributed training with Amazon EKS, including CloudFormation templates for EKS clusters, custom AMI and container builds, test cases for frameworks like PyTorch (DDP/FSDP, MegatronLM, NeMo) and JAX, and tools for validation, observability, and performance monitoring such as EFA Prometheus exporter and Nvidia Nsight Systems.

### [AWSome Inference](https://github.com/aws-samples/awsome-inference)
<a name="_awsome_inference"></a>

This repository provides reference architectures and test cases for optimizing inference solutions on AWS, with a focus on Amazon EKS and accelerated EC2 instances. It includes infrastructure setups for VPC and EKS clusters, projects for frameworks like NVIDIA NIMs, TensorRT-LLM, Triton Inference Server, and RayService, with examples for models such as Llama3-8B and Llama 3.1 405B. Features multi-node deployments using K8s LeaderWorkerSet, EKS autoscaling, Multi-Instance GPUs (MIG), and real-life use cases like an audio bot for ASR, inference, and TTS.

## Tutorials
<a name="aiml-tutorials"></a>

If you are interested in setting up Machine Learning platforms and frameworks in EKS, explore the tutorials described in this section. These tutorials cover everything from patterns for making the best use of GPU processors to choosing modeling tools to building frameworks for specialized industries.

### Build generative AI platforms on EKS
<a name="_build_generative_ai_platforms_on_eks"></a>
+  [Deploy Generative AI Models on Amazon EKS](https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/) 
+  [Building multi-tenant JupyterHub Platforms on Amazon EKS](https://aws.amazon.com/blogs/containers/building-multi-tenant-jupyterhub-platforms-on-amazon-eks/) 

### Run specialized generative AI frameworks on EKS
<a name="_run_specialized_generative_ai_frameworks_on_eks"></a>
+  [Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS](https://aws.amazon.com/blogs/machine-learning/accelerate-your-generative-ai-distributed-training-workloads-with-the-nvidia-nemo-framework-on-amazon-eks/) 
+  [Running TorchServe on Amazon Elastic Kubernetes Service](https://aws.amazon.com/blogs/opensource/running-torchserve-on-amazon-elastic-kubernetes-service/) 

### Maximize NVIDIA GPU performance for ML on EKS
<a name="_maximize_nvidia_gpu_performance_for_ml_on_eks"></a>
+ Implement GPU sharing to efficiently use NVIDIA GPUs for your EKS clusters:

   [GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances](https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/) 
+ Use Multi-Instance GPUs (MIGs) and NIM microservices to run more pods per GPU on your EKS clusters:

   [Maximizing GPU utilization with NVIDIA’s Multi-Instance GPU (MIG) on Amazon EKS: Running more pods per GPU for enhanced performance](https://aws.amazon.com/blogs/containers/maximizing-gpu-utilization-with-nvidias-multi-instance-gpu-mig-on-amazon-eks-running-more-pods-per-gpu-for-enhanced-performance/) 
+  [Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on AWS](https://aws.amazon.com/blogs/machine-learning/build-and-deploy-a-scalable-machine-learning-system-on-kubernetes-with-kubeflow-on-aws/) 

### Run video encoding workloads on EKS
<a name="_run_video_encoding_workloads_on_eks"></a>
+  [Delivering video content with fractional GPUs in containers on Amazon EKS](https://aws.amazon.com/blogs/containers/delivering-video-content-with-fractional-gpus-in-containers-on-amazon-eks/) 

### Accelerate image loading for inference workloads
<a name="_accelerate_image_loading_for_inference_workloads"></a>
+  [How H2O.ai optimized and secured their AI/ML infrastructure with Karpenter and Bottlerocket](https://aws.amazon.com/blogs/containers/how-h2o-ai-optimized-and-secured-their-ai-ml-infrastructure-with-karpenter-and-bottlerocket/) 

### Monitoring ML workloads
<a name="_monitoring_ml_workloads"></a>
+  [Monitoring GPU workloads on Amazon EKS using AWS managed open-source services](https://aws.amazon.com/blogs/mt/monitoring-gpu-workloads-on-amazon-eks-using-aws-managed-open-source-services/) 
+  [Enable pod-based GPU metrics in Amazon CloudWatch](https://aws.amazon.com/blogs/machine-learning/enable-pod-based-gpu-metrics-in-amazon-cloudwatch/) 