Custom certificates and Route 53 DNS management for HyperPod Inference
The following steps show you how to use your own ACM certificates for HyperPod inference endpoints and optionally configure the operator to manage Route 53 DNS records for your custom domain.
With custom certificates, you provide an ACM certificate ARN and the operator attaches it to the Application Load Balancer (ALB), monitors its health, and supports automatic renewal detection. The operator supports publicly trusted ACM certificates, AWS Private CA certificates, and certificates imported from external CAs.
Custom certificates can be used on their own or combined with Route 53 DNS management. Route 53 DNS management requires a custom certificate and uses the domain name from your certificate configuration to create and manage DNS records.
Prerequisites
Before you begin, verify that you've:
-
Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see Setting up your HyperPod clusters for model deployment.
-
Installed kubectl
in your terminal. -
Provisioned or imported a TLS certificate in ACM in the same AWS Region as your HyperPod cluster. The certificate must be in the Issued state and must include a certificate chain (intermediate and root CA). Self-signed certificates imported to ACM are not supported as custom certificates because they lack a certificate chain.
-
(For Route 53 DNS management) Created a Route 53 hosted zone for your domain. Public hosted zones are recommended. Private hosted zones work for record creation, but the operator verifies DNS resolution using the pod's DNS resolver, which relies on the VPC DNS resolver. For private hosted zones, the VPC must have DNS resolution and DNS hostnames enabled, and the private hosted zone must be associated with the cluster's VPC.
Configure IAM permissions
The inference operator's execution role requires additional permissions for custom certificates and Route 53 DNS management. Add the following policies to your HyperPod Inference execution role.
ACM and Amazon S3 permissions for custom certificates
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ACMCustomCertificateAccess", "Effect": "Allow", "Action": [ "acm:DescribeCertificate", "acm:GetCertificate" ], "Resource": "arn:aws:acm:<region>:<account-id>:certificate/*" }, { "Sid": "S3CertificateUpload", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:PutObjectTagging" ], "Resource": "arn:aws:s3:::<tls-certificate-bucket>/*", "Condition": { "StringEquals": { "s3:RequestObjectTag/CreatedBy": "HyperPodInference" } } } ] }
Note
If your Amazon S3 bucket name starts with hyperpod-tls, the Amazon S3
permissions are already included in the
AmazonSageMakerHyperPodInferenceAccess managed policy and
you only need to add the ACM statement.
Route 53 permissions for DNS management
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Route53DNSManagement", "Effect": "Allow", "Action": [ "route53:GetHostedZone", "route53:ListResourceRecordSets", "route53:ChangeResourceRecordSets" ], "Resource": "arn:aws:route53:::hostedzone/<hosted-zone-id>" } ] }
Replace <region>, <account-id>,
<tls-certificate-bucket>, and
<hosted-zone-id> with your actual values. You can scope
the ACM resource ARN to specific certificates for tighter security.
Configure a custom certificate
To use a custom certificate, add the customCertificateConfig section
to your tlsConfig in the InferenceEndpointConfig or
JumpStartModel spec. The tlsConfig and
dnsConfig fields are identical in both CRDs.
The following fields are available in customCertificateConfig and
tlsConfig:
tlsConfig.customCertificateConfig.acmArn(Required, String)-
The ARN of your ACM certificate. Must be in the Issued state.
tlsConfig.customCertificateConfig.domainName(Required, String)-
The domain name to use from the certificate.
-
Must be a specific domain, not a wildcard. For wildcard certificates (for example,
*.example.com), specify the specific subdomain (for example,api.example.com). -
Must be lowercase.
-
Must match one of the domain names listed in the certificate.
-
tlsConfig.tlsCertificateOutputS3Uri(Conditional, String)-
Amazon S3 URI where the operator uploads the public certificate. Required unless the operator was installed with the
TLS_CERTIFICATE_OUTPUT_S3URIenvironment variable configured. If you're unsure whether this was set, specify it explicitly.
The following is an example YAML file for creating an endpoint with a custom certificate.
apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: InferenceEndpointConfig metadata: name: my-model namespace: my-namespace spec: modelName: my-llm instanceType: ml.g5.24xlarge invocationEndpoint: v1/chat/completions replicas: 2 modelSourceConfig: modelSourceType: s3 s3Storage: bucketName: my-model-bucket region: us-west-2 modelLocation: models/my-llm tlsConfig: customCertificateConfig: acmArn: arn:aws:acm:us-west-2:123456789012:certificate/abc12345-1234-1234-1234-abc123456789 domainName: api.example.com tlsCertificateOutputS3Uri: s3://my-tls-bucket worker: image: my-inference-image:latest modelInvocationPort: containerPort: 8000 name: http modelVolumeMount: name: model-weights mountPath: /opt/ml/model resources: limits: nvidia.com/gpu: "4" requests: cpu: "6" memory: 30Gi nvidia.com/gpu: "4"
You can add customCertificateConfig to a deployment that is already
running. The operator detects the change on the next reconciliation, validates the
certificate, attaches it to the ALB, and uploads the public certificate to Amazon S3. No
downtime or redeployment is required.
Important
If you remove the customCertificateConfig section from a running
deployment, the operator falls back to generating a new self-signed certificate.
This replaces your CA-signed certificate on the ALB.
Configure Route 53 DNS management
Route 53 DNS management requires a custom certificate to be configured and uses
the domain name from
tlsConfig.customCertificateConfig.domainName. If you have not
configured a custom certificate, see Configure a custom certificate
first.
To enable Route 53 DNS management, add the dnsConfig section to your
spec:
spec: tlsConfig: customCertificateConfig: acmArn: arn:aws:acm:us-west-2:123456789012:certificate/abc12345-1234-1234-1234-abc123456789 domainName: api.example.com tlsCertificateOutputS3Uri: s3://my-tls-bucket dnsConfig: hostedZoneId: Z1234567890ABC
The dnsConfig section can be added at the same time as
customCertificateConfig, or added later to an existing deployment.
The operator creates the DNS records on the next reconciliation without restarting
your pods.
When you deploy with dnsConfig, the operator:
-
Validates the hosted zone exists and that your domain belongs to the hosted zone (for example,
api.example.comrequires a hosted zone forexample.com). -
Checks for existing A records at the target domain to prevent accidental overwrites.
-
Creates an A record (alias) pointing your domain to the ALB, and a TXT record with an ownership marker (
hyperpod-inference/owner=<namespace>/<name>) to prevent conflicts between endpoints. Do not modify or delete the TXT record manually. -
Polls DNS resolution every 30 seconds until the domain resolves, then marks the status as
Active. If the record doesn't resolve within 10 minutes, the status transitions toError.
Note
Route 53 DNS management is non-blocking. Errors in DNS record creation or
propagation do not prevent your inference endpoint from reaching the
Ready state. The endpoint remains accessible through the ALB's
default hostname. Check the dnsStatus section in the resource
status for DNS-specific errors.
Use a custom certificate without Route 53 DNS management
Route 53 DNS management is optional. If you want to manage your own DNS records,
omit the dnsConfig section and configure your domain manually.
To find the ALB hostname for your deployment, the ingress name depends on whether intelligent routing is enabled.
Without intelligent routing: The ingress is named
alb-<deployment-name> in your namespace.
kubectl get ingress alb-<deployment-name> -n <namespace> \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
With intelligent routing: The ingress is named
alb-<deployment-name>-<namespace> in the
hyperpod-inference-system namespace.
kubectl get ingress alb-<deployment-name>-<namespace> -n hyperpod-inference-system \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Create a DNS record in your DNS provider pointing your domain to this ALB hostname:
-
Route 53 — Create an A record with Alias enabled, pointing to the ALB.
-
Other DNS providers — Create a CNAME record pointing your domain to the ALB DNS name. CNAME records cannot be used at the root domain (for example,
example.com); use a subdomain likeapi.example.com.
Note
If the ALB is recreated (for example, after deleting and redeploying the endpoint), the ALB hostname changes. You must update your DNS record manually. Route 53 DNS management handles this automatically.
Verify the status of your deployment
Check the custom certificate status
kubectl describe InferenceEndpointConfig my-model -n my-namespace
Look for the tlsCertificate section in the status:
Status: Tls Certificate: Certificate ARN: arn:aws:acm:us-west-2:123456789012:certificate/abc12345-... Certificate Health: Valid Certificate Domain Names: api.example.com Last Cert Expiry Time: 2027-04-23T00:00:00Z
Certificate health values:
-
Valid — Certificate is more than 60 days from expiration.
-
Expiring — Certificate expires within 60 days. A Kubernetes warning event (
CertificateExpiring) is emitted. -
Expired — Certificate has expired. A Kubernetes warning event (
CertificateExpired) is emitted.
The operator checks certificate expiration every 24 hours. When it detects
that a certificate has been renewed in ACM (the NotAfter date
changes), it automatically re-uploads the public certificate to Amazon S3 and emits a
CertificateRenewed event.
Check the DNS status
Look for the dnsStatus section:
Status: Dns Status: Managed By Operator: true Record Name: api.example.com Hosted Zone Id: Z1234567890ABC Dns Health: Active Message: DNS record resolves successfully
DNS health values:
-
Active — DNS record resolves successfully. Your custom domain is ready to use.
-
Pending — DNS records have been created in Route 53 but haven't propagated yet. The operator rechecks every 30 seconds.
-
Error — DNS record creation or propagation failed. Check the
Messagefield for details.
Test the deployed endpoint
If you configured Route 53 DNS management, invoke your endpoint using your
custom domain after the DNS status shows Active. If you configured
only a custom certificate without DNS management, use the ALB hostname from the
ingress (see Use a custom certificate without Route 53 DNS management).
curl -X POST https://api.example.com/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "my-llm", "messages": [{"role": "user", "content": "Hello"}]}'
Note
To invoke through the SageMaker AI endpoint, set endpointName in your
InferenceEndpointConfig or
sageMakerEndpoint.name in your JumpStartModel
spec. If endpointName is not set, no SageMaker AI endpoint is created
and only direct ALB invocation is available.
aws sagemaker-runtime invoke-endpoint \ --endpoint-name my-model \ --content-type "application/json" \ --body '{"model": "my-llm", "messages": [{"role": "user", "content": "Hello"}]}' \ --region us-west-2 \ --cli-binary-format raw-in-base64-out \ /dev/stdout
Manage your custom certificates and DNS records
Changing the domain or hosted zone
You can update the domainName (in
customCertificateConfig) or hostedZoneId (in
dnsConfig) on a running deployment. Changing the domain name
triggers both certificate re-validation and DNS cutover — the new domain must be
valid in your ACM certificate (as a SAN or wildcard match).
The operator performs a safe cutover:
-
Creates new DNS records in the new zone or for the new domain.
-
Verifies the new records resolve.
-
Deletes the old DNS records only after the new records are confirmed active.
During the transition, both old and new domains resolve to the ALB. The TXT ownership record has a TTL of 300 seconds (5 minutes), so DNS clients may cache the old record for up to 5 minutes after cleanup.
Cleanup
When you delete the InferenceEndpointConfig or remove the
dnsConfig section, the operator automatically deletes the Route 53
A and TXT records it created. The operator only deletes records that it owns
(verified by the ownership TXT record).
Troubleshooting
Use these debugging steps if your custom certificate or DNS configuration isn't working as expected.
-
Deployment fails with Amazon S3 access error. Verify the Amazon S3 bucket specified in
tlsCertificateOutputS3Uriexists and is in the same Region. Verify the operator's execution role hass3:PutObjectands3:PutObjectTaggingpermissions on the bucket. The operator validates Amazon S3 write access by uploading a zero-byte test object during initial deployment. -
Certificate validation fails. Verify the ACM certificate is in the
ISSUEDstate:aws acm describe-certificate --certificate-arn <arn> --region <region>. Verify thedomainNamematches a domain or SAN in the certificate. For wildcard certificates (*.example.com), use a specific subdomain likeapi.example.com. -
DNS record creation fails. Verify the hosted zone ID is correct and the operator's execution role has Route 53 permissions. Verify the domain belongs to the hosted zone (for example,
api.example.comrequires a hosted zone forexample.com). If you see an NS delegation conflict, use the hosted zone ID of the delegated zone instead. If you see a record conflict, another endpoint or external process owns the A record at that domain. -
DNS record shows Pending for extended time. Verify the hosted zone's NS records are properly delegated from the parent domain registrar. The operator uses the pod's DNS resolver (typically CoreDNS), which may cache results. After 10 minutes without resolution, the status transitions to
Error. -
Certificate expiration warnings. Renew or replace the certificate in ACM. For ACM-issued certificates, ACM handles renewal automatically. For imported certificates, import a new certificate. The operator detects renewal automatically and re-uploads the public certificate to Amazon S3.