View a markdown version of this page

Deploy models from local NVMe storage using kubectl - Amazon SageMaker AI

Deploy models from local NVMe storage using kubectl

This topic shows you how to deploy inference endpoints on Amazon SageMaker HyperPod that load model weights from a node's local NVMe storage instead of pulling them over the network from Amazon S3 or Amazon FSx. Reading weights locally eliminates the network hop during pod startup, which reduces inference pod cold-start time and is useful for autoscaling events, scale-from-zero workloads, and latency-sensitive failovers. For workloads where cold-start latency is not a concern, use modelSourceType: s3 or fsx and skip this topic.

Local NVMe is node-local and ephemeral: data on NVMe is lost when a node is replaced, for example during a spot interruption, hardware failure, or AMI refresh. The approaches in this topic handle this differently — some require you to pre-populate every node, others fall back to Amazon S3 automatically when the model is not cached locally. Local NVMe instance storage is typically found in P, G, and Trn instance families. See Amazon EC2 instance store specifications to validate availability for your instance type.

You can choose from the following approaches based on your storage requirements:

NVMe deployment approaches
# Approach Description
1 Kubernetes volume (no fallback) Use when model weights exist on NVMe on every node. Simplest setup with no Amazon S3, Amazon FSx, PV/PVC, or initContainers required.
2 Kubernetes volume with fallback Use when the model might not exist on NVMe on every node. You provide a custom initContainer that checks NVMe first and downloads from Amazon S3 using IRSA credentials if the model is missing.
3 Amazon S3 with prefetch and fallback Use when you want to stage model weights to RAM for pod startup. You provide a custom initContainer that checks NVMe first and falls back to copying from the operator-provisioned Amazon S3 mount if the model is not cached locally.

Prerequisites

Before you begin, verify that you've:

Choose your deployment approach

Use the following decision flow to determine which approach is right for your use case.

┌────────────────────────────┐ │ Want to use a volume of │ │ your choice, e.g. NVMe? │ └─────┬──────────────┬───────┘ YES │ │ NO ▼ ▼ ┌──────────────────────┐ Use S3/FSx/HF │ Are you sure EVERY │ as-is (no volume │ node has the model │ override needed) │ on NVMe? │ └─────┬──────────┬─────┘ YES │ │ NO ▼ ▼ ┌─────────────────┐ ┌───────────────────────────────┐ │ Approach 1 │ │ Do you need the operator to │ │ │ │ create S3/FSx PVCs as a │ │ Use k8sVolume │ │ fallback when the model is │ │ field in CRD to │ │ missing on a node? │ │ read from NVMe │ └──────┬────────────────┬───────┘ │ directly. │ YES │ │ NO └─────────────────┘ ▼ ▼ ┌──────────────────┐ ┌──────────────────────┐ │ Approach 3 │ │ Approach 2 │ │ │ │ │ │ Use S3 with │ │ Use k8sVolume with a │ │ prefetch enabled.│ │ custom initContainer │ │ Custom │ │ you create that │ │ initContainer │ │ checks NVMe first │ │ checks NVMe │ │ and downloads from │ │ first, falls │ │ S3 via IRSA if the │ │ back to S3, and │ │ model is missing. │ │ copies to RAM. │ └──────────────────────┘ └──────────────────┘

Deploy using a Kubernetes volume (no fallback)

Use this approach when you have model weights on NVMe on every node and want the simplest setup — no Amazon S3 or Amazon FSx configuration, no PV/PVC, and no initContainers.

When you set modelSourceType: kubernetesVolume, the operator skips PV/PVC creation entirely. No CSI driver, Amazon S3 fuse mount, or Amazon FSx mount is used. The customer-provided model-weights volume is used directly in the pod spec, and the worker reads model data from NVMe at /opt/ml/model.

Important

When using modelSourceType: kubernetesVolume, the operator derives the expected volume name from modelVolumeMount.name in your worker configuration. kubernetes.volumes must contain a volume with that same name. The operator validates this and rejects the deployment with a KubernetesVolumeValidationFailed condition if no matching volume is found. In the following examples, both use model-weights.

  1. Create the InferenceEndpointConfig YAML file. Replace the placeholder values with your actual resource identifiers.

    cat <<EOF> deploy_nvme_k8s_volume.yaml apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: InferenceEndpointConfig metadata: name: nvme-k8s-volume namespace: default spec: endpointName: nvme-k8s-volume modelName: Qwen2.5-VL-7B-Instruct invocationEndpoint: v1/chat/completions replicas: 1 modelSourceConfig: modelSourceType: kubernetesVolume kubernetes: volumes: - name: model-weights hostPath: path: /opt/dlami/nvme/<YOUR_MODEL> type: Directory loadBalancer: healthCheckPath: /health worker: image: lmcache/vllm-openai:latest args: - /opt/ml/model - --max-model-len - "15000" - --tensor-parallel-size - "1" modelInvocationPort: containerPort: 8000 name: http modelVolumeMount: name: model-weights mountPath: /opt/ml/model resources: limits: nvidia.com/gpu: "1" requests: cpu: "6" memory: 30Gi nvidia.com/gpu: "1" environmentVariables: - name: PYTHONHASHSEED value: "123" - name: VLLM_REQUEST_TIMEOUT value: "600" EOF
  2. Deploy the InferenceEndpointConfig.

    kubectl apply -f deploy_nvme_k8s_volume.yaml
  3. Verify the deployment status.

    kubectl describe InferenceEndpointConfig nvme-k8s-volume -n default

Deploy using a Kubernetes volume with fallback

Use this approach when the model might or might not be on NVMe on a given node. A hostPath volume only works on nodes where the data exists — pods scheduled on other nodes would mount an empty or nonexistent path, causing the model server to fail.

In this approach, you set modelSourceType: kubernetesVolume and provide a custom initContainer that checks NVMe first and downloads from Amazon S3 using IRSA credentials if the model is missing.

Set up IRSA

Before deploying, configure IAM Roles for Service Accounts (IRSA) to give your pods credentials for downloading from Amazon S3.

  1. Get the OIDC provider ID for your cluster.

    aws eks describe-cluster --name <CLUSTER_NAME> --region <REGION> \ --query "cluster.identity.oidc.issuer" --output text
  2. Create an IAM trust policy. Save the following as trust-policy.json, replacing the placeholder values.

    { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:sub": "system:serviceaccount:<NAMESPACE>:<SA_NAME>", "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:aud": "sts.amazonaws.com" } } }] }
    Warning

    Always scope the trust policy to a specific namespace and ServiceAccount name. Never use wildcards in the subject condition (for example, system:serviceaccount:*:*), as this would allow any ServiceAccount in any namespace to assume the role.

  3. Create the IAM role and attach a scoped Amazon S3 read policy for your model bucket.

    aws iam create-role --role-name <ROLE_NAME> \ --assume-role-policy-document file://trust-policy.json aws iam put-role-policy --role-name <ROLE_NAME> \ --policy-name S3ModelReadAccess \ --policy-document '{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET>", "arn:aws:s3:::<YOUR_BUCKET>/<YOUR_MODEL_PREFIX>/*" ] }] }'
  4. Create the Kubernetes service account with the IRSA annotation.

    kubectl create sa <SA_NAME> -n <NAMESPACE> kubectl annotate sa <SA_NAME> -n <NAMESPACE> \ eks.amazonaws.com/role-arn=arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>

Deploy the model

  1. Create the InferenceEndpointConfig YAML file. Replace the placeholder values with your actual resource identifiers.

    cat <<EOF> deploy_nvme_k8s_volume_fallback.yaml apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: InferenceEndpointConfig metadata: name: nvme-k8s-volume-fallback namespace: default spec: endpointName: nvme-k8s-volume-fallback modelName: Qwen2.5-VL-7B-Instruct invocationEndpoint: v1/chat/completions replicas: 1 modelSourceConfig: modelSourceType: kubernetesVolume kubernetes: serviceAccountName: <YOUR_SERVICE_ACCOUNT> initContainers: - name: smart-loader image: public.ecr.aws/aws-cli/aws-cli:latest command: ["/bin/bash", "-c"] args: - | if [ "$(ls -A /model)" ]; then echo "NVMe hit — model already present ($(du -sh /model | cut -f1))" else echo "NVMe miss — downloading from S3" aws s3 sync s3://<YOUR_BUCKET>/<YOUR_MODEL>/ /model/ fi volumeMounts: - name: model-weights mountPath: /model volumes: - name: model-weights hostPath: path: /opt/dlami/nvme/<YOUR_MODEL> type: DirectoryOrCreate loadBalancer: healthCheckPath: /health worker: image: lmcache/vllm-openai:latest args: - /opt/ml/model - --max-model-len - "15000" - --tensor-parallel-size - "1" modelInvocationPort: containerPort: 8000 name: http modelVolumeMount: name: model-weights mountPath: /opt/ml/model resources: limits: nvidia.com/gpu: "1" requests: cpu: "6" memory: 30Gi nvidia.com/gpu: "1" environmentVariables: - name: PYTHONHASHSEED value: "123" EOF
  2. Deploy the InferenceEndpointConfig.

    kubectl apply -f deploy_nvme_k8s_volume_fallback.yaml
  3. Verify the deployment status.

    kubectl describe InferenceEndpointConfig nvme-k8s-volume-fallback -n default

Deploy using Amazon S3 with prefetch and NVMe fallback

Use this approach when you want inference performance by staging model weights to RAM, with automatic fallback to Amazon S3 if the model isn't cached locally on NVMe.

When you set modelSourceType: s3 with prefetchEnabled: true, the operator creates two volumes automatically:

  • A volume named after your modelVolumeMount.name (typically model-weights) — an Amazon S3 CSI fuse mount containing your model

  • model-weights-copy — a RAM-backed emptyDir where the worker reads from

You add a custom nvme-cache volume pointing to the node's local NVMe storage, and a custom initContainer that:

  • If the model exists on NVMe — copies from NVMe to RAM (model-weights-copy), skipping the network entirely.

  • If the model is missing — falls back to copying from the Amazon S3 mount (model-weights) to RAM (model-weights-copy). Optionally copies to NVMe so subsequent pod startups on the same node use the fast local path.

Important

Do not override model-weights in kubernetes.volumes when using this approach. The operator creates model-weights pointing to the Amazon S3 CSI volume. Overriding it removes the operator-provisioned volume that your initContainer needs for fallback. Use a separate volume name (for example, nvme-cache) for your NVMe hostPath.

Important

Do not include model-weights-copy in kubernetes.volumes. It is a reserved name created automatically by the operator. Your initContainer can reference it in volumeMounts but must not declare it as a volume.

  1. Create the InferenceEndpointConfig YAML file. Replace the placeholder values with your actual resource identifiers.

    cat <<EOF> deploy_nvme_s3_prefetch_fallback.yaml apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: InferenceEndpointConfig metadata: name: nvme-s3-prefetch-fallback namespace: default spec: endpointName: nvme-s3-prefetch-fallback modelName: Qwen2.5-VL-7B-Instruct invocationEndpoint: v1/chat/completions replicas: 1 modelSourceConfig: modelSourceType: s3 s3Storage: bucketName: <YOUR_BUCKET> region: <YOUR_REGION> prefetchEnabled: true kubernetes: serviceAccountName: <YOUR_SERVICE_ACCOUNT> initContainers: - name: smart-loader image: public.ecr.aws/aws-cli/aws-cli:latest command: ["/bin/bash", "-c"] args: - | # Check NVMe first, fall back to S3 mount, then copy to RAM if [ "$(ls -A /nvme)" ]; then echo "NVMe hit ($(du -sh /nvme | cut -f1))" echo "Copying model from NVMe to RAM..." cp -r /nvme/* /model/ else echo "NVMe miss — copying from S3 mount to NVMe, then NVMe to RAM" cp -r /s3-model/* /nvme/ cp -r /nvme/* /model/ fi echo "Done. $(du -sh /model | cut -f1) in RAM." volumeMounts: - name: model-weights mountPath: /s3-model - name: nvme-cache mountPath: /nvme - name: model-weights-copy mountPath: /model volumes: - name: nvme-cache hostPath: path: /opt/dlami/nvme/<YOUR_MODEL> type: DirectoryOrCreate loadBalancer: healthCheckPath: /health worker: image: lmcache/vllm-openai:latest args: - /opt/ml/model - --max-model-len - "15000" - --tensor-parallel-size - "1" modelInvocationPort: containerPort: 8000 name: http modelVolumeMount: name: model-weights mountPath: /opt/ml/model resources: limits: nvidia.com/gpu: "1" requests: cpu: "6" memory: 30Gi nvidia.com/gpu: "1" environmentVariables: - name: PYTHONHASHSEED value: "123" - name: VLLM_REQUEST_TIMEOUT value: "600" EOF
  2. Deploy the InferenceEndpointConfig.

    kubectl apply -f deploy_nvme_s3_prefetch_fallback.yaml
  3. Verify the deployment status.

    kubectl describe InferenceEndpointConfig nvme-s3-prefetch-fallback -n default

Understanding model-weights and model-weights-copy with prefetch

When using prefetch, the operator creates two model-related volumes:

  • A volume named after your modelVolumeMount.name (typically model-weights) — an Amazon S3 CSI fuse mount containing your model

  • model-weights-copy — a RAM-backed emptyDir where the worker actually reads from

In your InferenceEndpointConfig, you define:

modelVolumeMount: name: model-weights mountPath: /opt/ml/model

While you reference model-weights, when prefetchEnabled: true, it is actually model-weights-copy that gets mounted at /opt/ml/model in the worker container. When using a custom initContainer, ensure that you copy the data into the volume called model-weights-copy — that is where the worker expects to find it.

When prefetchEnabled: false, there is only one volume (named after your modelVolumeMount.name) and it is mounted directly at /opt/ml/model.

Configure a custom service account

You can assign a custom Kubernetes ServiceAccount to your inference endpoint pods using the spec.kubernetes.serviceAccountName field in the InferenceEndpointConfig. This is useful for providing AWS credentials via IRSA (IAM Roles for Service Accounts) to your worker containers or init containers — for example, to download model weights from Amazon S3 in a fallback scenario.

Important

Custom service account support is disabled by default and must be explicitly enabled by a cluster administrator before use. See Enable custom service accounts for instructions.

If you do not specify a ServiceAccount, the namespace's default ServiceAccount is used.

Enable custom service accounts

Custom service account support is disabled by default. A cluster administrator must enable it in the operator's Helm configuration before users can reference custom ServiceAccounts in their InferenceEndpointConfig.

  • Update the operator Helm values to enable the feature. If you deployed the operator via Helm, upgrade with the flag:

    helm upgrade hyperpod-inference-operator <CHART_PATH> \ --set enableCustomServiceAccounts=true \ --reuse-values
  • If you deployed the operator as an Amazon EKS add-on, update the add-on configuration to include enableCustomServiceAccounts: true in the advanced configuration settings.

  • Verify the operator pod has the environment variable set:

    kubectl get deployment hyperpod-inference-operator-controller-manager \ -n hyperpod-inference-system \ -o jsonpath='{.spec.template.spec.containers[0].env}' | jq '.[] | select(.name=="ENABLE_CUSTOM_SERVICE_ACCOUNTS")'

    You should see:

    { "name": "ENABLE_CUSTOM_SERVICE_ACCOUNTS", "value": "true" }
Important

If this feature is not enabled, any InferenceEndpointConfig that specifies kubernetes.serviceAccountName is rejected with a DeploymentFailed status and the message: kubernetes.serviceAccountName is not enabled. Requires addon configuration (enableCustomServiceAccounts: true).

Label the service account

Before you can reference a custom ServiceAccount, a cluster administrator must label it as user-assignable:

kubectl label serviceaccount <your-service-account> \ sagemaker.amazonaws.com/user-assignable=true \ -n <namespace>

Only ServiceAccounts with this label can be referenced by inference endpoints. This is a security control to prevent unauthorized privilege escalation.

Specify the service account in your configuration

Add the serviceAccountName field under spec.kubernetes in your InferenceEndpointConfig:

apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: InferenceEndpointConfig metadata: name: my-inference-endpoint namespace: my-namespace spec: kubernetes: serviceAccountName: my-inference-sa # ... rest of your config

Validation rules

The operator validates the serviceAccountName field on both create and update operations. Your deployment will be rejected with a DeploymentFailed status if any of the following conditions are met:

  • The ServiceAccount does not exist in the namespace — serviceAccountName "X" not found in namespace "Y"

  • The ServiceAccount is missing the required label — serviceAccountName "X" is not labeled as user-assignable (requires label sagemaker.amazonaws.com/user-assignable=true)

  • The ServiceAccount is the operator's system ServiceAccount — serviceAccountName must not reference the operator's service account

Note

All containers in the inference pod (worker, init containers, and sidecars) inherit the permissions of the specified ServiceAccount. If the ServiceAccount is annotated with eks.amazonaws.com/role-arn, the pod receives temporary AWS credentials for that IAM role. Cluster administrators should only label ServiceAccounts as user-assignable after reviewing the associated RBAC roles and IAM permissions.

Note

If a ServiceAccount is deleted while an InferenceEndpointConfig is already running, existing pods continue to run with their current credentials until they are restarted. However, new pod creation (for example, during scaling or rescheduling) will fail because the ServiceAccount no longer exists. The operator validates the ServiceAccount when the deployment is first created and when the IEC spec is updated — it does not continuously monitor the ServiceAccount. Updating the IEC spec after the SA is deleted will result in a DeploymentFailed status.

Security best practices for custom service accounts

When you use a custom ServiceAccount with inference endpoints, the HyperPod inference operator creates Deployments on your behalf. All containers in the inference pod — including the worker, init containers, and sidecars — inherit the permissions of the specified ServiceAccount. Follow these best practices to secure your cluster.

Lock down RBAC permissions

  • Create a dedicated ServiceAccount for each inference workload. Do not reuse ServiceAccounts across unrelated workloads.

  • Bind only the minimum RBAC permissions required. For example, if your init container only needs to read from Amazon S3, the ServiceAccount should not have permissions to list or modify Kubernetes resources.

    # Example: minimal Role for an inference workload that only needs S3 access via IRSA # No Kubernetes API permissions needed — IRSA provides AWS credentials directly apiVersion: v1 kind: ServiceAccount metadata: name: my-inference-sa namespace: my-namespace labels: sagemaker.amazonaws.com/user-assignable: "true" annotations: eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/<SCOPED_ROLE_NAME>
  • Avoid granting cluster-wide permissions (ClusterRoleBindings) to ServiceAccounts used by inference pods.

Scope IRSA IAM roles

  • When annotating a ServiceAccount with eks.amazonaws.com/role-arn, ensure the IAM role follows least-privilege principles.

  • Scope Amazon S3 permissions to the specific bucket and prefix containing your model weights.

    { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET>", "arn:aws:s3:::<YOUR_BUCKET>/<YOUR_MODEL_PREFIX>/*" ] }] }
  • Do not use broad managed policies such as AmazonS3FullAccess in production. Use AmazonS3ReadOnlyAccess or a custom policy scoped to your model bucket.

Protect the user-assignable label

  • Only cluster administrators should add or remove the sagemaker.amazonaws.com/user-assignable=true label. Use Kubernetes RBAC to restrict who can modify ServiceAccount labels in your namespace.

  • Review the RBAC roles and IAM permissions associated with a ServiceAccount before labeling it as user-assignable.

  • Periodically audit which ServiceAccounts carry the user-assignable label.

    kubectl get serviceaccounts -n <NAMESPACE> -l sagemaker.amazonaws.com/user-assignable=true
  • Ensure non-admin roles do not include patch, update, or create verbs on ServiceAccount resources. The operator validates the user-assignable label at deployment time, but does not prevent unauthorized users from adding the label to a ServiceAccount. Restricting who can modify ServiceAccounts via RBAC is the primary control for protecting this label. Non-admin users should only have get and list access:

    # Example: RBAC Role for non-admin users — read-only access to ServiceAccounts apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: sa-read-only namespace: <NAMESPACE> rules: - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["get", "list"]
Important

The HyperPod inference operator acts as an intermediary that creates Deployments on behalf of users. Unlike standard Kubernetes workloads where the caller directly creates pods, the operator assigns the specified ServiceAccount to pods it creates. This means that any permissions granted to a user-assignable ServiceAccount are effectively available to anyone who can create an InferenceEndpointConfig in that namespace. Ensure that namespace-level RBAC controls who can create and update InferenceEndpointConfig resources.

Preload model weights to NVMe

If you need to pre-populate NVMe on specific nodes before deploying, you can use a one-off pod to sync from Amazon S3.

Note

This approach targets a specific node via nodeName and does not work with autoscaling. For autoscaling scenarios, use the Kubernetes volume with fallback or Amazon S3 with prefetch approaches, which handle missing models automatically via initContainer fallback logic.

  1. Create the preload pod YAML file. Replace the placeholder values with your actual resource identifiers.

    cat <<EOF> nvme-s3-copy.yaml apiVersion: v1 kind: Pod metadata: name: nvme-s3-copy namespace: default spec: nodeName: <TARGET_NODE> restartPolicy: Never containers: - name: s3-copy image: public.ecr.aws/aws-cli/aws-cli:latest command: ["/bin/bash", "-c"] args: - | echo "=== Starting S3 sync to NVMe ===" aws s3 sync s3://<YOUR_BUCKET>/<YOUR_MODEL>/ /nvme/<YOUR_MODEL>/ --region <YOUR_REGION> echo "=== Sync complete ===" ls -la /nvme/<YOUR_MODEL>/ du -sh /nvme/<YOUR_MODEL>/ echo "=== Done ===" volumeMounts: - name: nvme-storage mountPath: /nvme serviceAccountName: default volumes: - name: nvme-storage hostPath: path: /opt/dlami/nvme type: Directory EOF
  2. Apply the pod and monitor the sync progress.

    kubectl apply -f nvme-s3-copy.yaml kubectl get pod nvme-s3-copy -w kubectl logs nvme-s3-copy -f
  3. Clean up the pod after the sync completes.

    kubectl delete pod nvme-s3-copy

Reserved volume names

The operator manages several internal volumes that cannot be overridden via kubernetes.volumes. Using any of these names results in a KubernetesVolumeValidationFailed condition.

Reserved volume names
# Name Purpose
1 shm Shared memory (/dev/shm) for inter-process communication
2 model-weights-copy RAM-backed emptyDir used when prefetchEnabled: true
3 parallel-copy-configmap ConfigMap for parallel copy script (prefetch)
4 lmcache-config LMCache configuration volume
5 gated-model-downloader-configmap ConfigMap for gated model download script

Things to remember

  • Do not use reserved volume names. The operator manages several internal volumes (see Reserved volume names). Using any of these names in kubernetes.volumes results in a KubernetesVolumeValidationFailed condition.

  • Volume names must match. The operator derives the volume name from modelVolumeMount.name. When using modelSourceType: kubernetesVolume, kubernetes.volumes must contain a volume with that same name.

  • Mount volumes to the correct location in your initContainer. Ensure that any volume you create is mounted to the correct path in your initContainer.

  • No custom service account is needed for S3/FSx. If you are unable to create custom service accounts or prefer not to, you can use modelSourceType: s3 or fsx. The operator provisions S3/FSx volumes automatically. You can still add custom initContainers and override volumes on top of the operator-managed storage.

  • IRSA credentials are injected into all containers. When you set kubernetes.serviceAccountName to a service account with an IRSA annotation, Amazon EKS injects AWS credentials (aws-iam-token volume, AWS_ROLE_ARN, AWS_WEB_IDENTITY_TOKEN_FILE) into all containers, including your custom initContainers.

  • Do not set modelLocation when using kubernetesVolume. The volume path is controlled by kubernetes.volumes. Setting modelLocation when modelSourceType is kubernetesVolume results in a validation error.

  • Understand how model-weights vs model-weights-copy works with prefetch. When prefetchEnabled: true, the operator creates two model-related volumes:

    • model-weights — the source volume (from Amazon S3/Amazon FSx PVC or your override)

    • model-weights-copy — a RAM-backed emptyDir where the worker actually reads from

  • While you reference model-weights in your config, when prefetchEnabled: true, it is actually model-weights-copy that gets mounted at /opt/ml/model in the worker container. When using a custom initContainer, ensure that you copy the data into the volume called model-weights-copy — that is where the worker expects to find it. When prefetchEnabled: false, there is only one volume (named after your modelVolumeMount.name) and it is mounted directly at /opt/ml/model.

Troubleshooting

Use these debugging commands if your deployment isn't working as expected.

  • Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

    kubectl describe InferenceEndpointConfig <ENDPOINT_NAME> -n <NAMESPACE>
  • Check the Kubernetes deployment status.

    kubectl describe deployment <ENDPOINT_NAME> -n <NAMESPACE>
  • Check the status of all Kubernetes objects in your namespace.

    kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n <NAMESPACE>
  • Check initContainer logs if the model loading step fails.

    kubectl logs <POD_NAME> -c smart-loader -n <NAMESPACE>
  • If the deployment fails with "not found in namespace", verify the ServiceAccount exists:

    kubectl get serviceaccount <name> -n <namespace>
  • If the deployment fails with "not labeled as user-assignable", ask your cluster administrator to add the required label:

    kubectl label serviceaccount <name> sagemaker.amazonaws.com/user-assignable=true -n <namespace>
  • If the deployment fails with "must not reference the operator's service account", create a separate ServiceAccount for your workload. You cannot use the HyperPod inference operator's own ServiceAccount.