Troubleshoot Amazon SageMaker AI model deployments - Amazon SageMaker AI

Troubleshoot Amazon SageMaker AI model deployments

If you encounter an issue when deploying machine learning models in Amazon SageMaker AI, see the following guidance.

Detection Errors in the Active CPU Count

If you deploy a SageMaker AI model with a Linux Java Virtual Machine (JVM), you might encounter detection errors that prevent using available CPU resources. This issue affects some JVMs that support Java 8 and Java 9, and most that support Java 10 and Java 11. These JVMs implement a mechanism that detects and handles the CPU count and the maximum memory available when running a model in a Docker container, and, more generally, within Linux taskset commands or control groups (cgroups). SageMaker AI deployments take advantage of some of the settings that the JVM uses for managing these resources. Currently, this causes the container to incorrectly detect the number of available CPUs.

SageMaker AI doesn't limit access to CPUs on an instance. However, the JVM might detect the CPU count as 1 when more CPUs are available for the container. As a result, the JVM adjusts all of its internal settings to run as if only 1 CPU core is available. These settings affect garbage collection, locks, compiler threads, and other JVM internals that negatively affect the concurrency, throughput, and latency of the container.

For an example of the misdetection, in a container configured for SageMaker AI that is deployed with a JVM that is based on Java8_191 and that has four available CPUs on the instance, run the following command to start your JVM:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version

This generates the following output:

active_processor_count: sched_getaffinity processor count: 4 active_processor_count: determined by OSContainer: 1 active_processor_count: sched_getaffinity processor count: 4 active_processor_count: determined by OSContainer: 1 active_processor_count: sched_getaffinity processor count: 4 active_processor_count: determined by OSContainer: 1 active_processor_count: sched_getaffinity processor count: 4 active_processor_count: determined by OSContainer: 1 openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

Many of the JVMs affected by this issue have an option to disable this behavior and reestablish full access to all of the CPUs on the instance. Disable the unwanted behavior and establish full access to all instance CPUs by including the -XX:-UseContainerSupport parameter when starting Java applications. For example, run the java command to start your JVM as follows:

java -XX:-UseContainerSupport -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version

This generates the following output:

active_processor_count: sched_getaffinity processor count: 4 active_processor_count: sched_getaffinity processor count: 4 active_processor_count: sched_getaffinity processor count: 4 active_processor_count: sched_getaffinity processor count: 4 openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

Check whether the JVM used in your container supports the -XX:-UseContainerSupport parameter. If it does, always pass the parameter when you start your JVM. This provides access to all of the CPUs in your instances.

You might also encounter this issue when indirectly using a JVM in SageMaker AI containers. For example, when using a JVM to support SparkML Scala. The -XX:-UseContainerSupport parameter also affects the output returned by the Java Runtime.getRuntime().availableProcessors() API .

Issues with deploying a model.tar.gz file

When you deploy a model using a model.tar.gz file, the model tarball should not include any symlinks. Symlinks cause the model creation to fail. Also, we recommend that you do not include any unnecessary files in the tarball.

Primary container did not pass ping health checks

If your primary container fails ping health checks with the following error message, it indicates that there is an issue with your container or script:

The primary container for production variant beta did not pass the ping health check. Please check CloudWatch Logs logs for this endpoint.

To troubleshoot this issue, you should check the CloudWatch Logs logs for the endpoint in question to see if there are any errors or issues that are preventing the container from responding to /ping or /invocations. The logs may provide an error message that could point to the issue. Once you have identified the error and failure reason you should resolve the error.

It is also good practice to test the model deployment locally before creating an endpoint.

  • Use local mode in the SageMaker SDK to imitate the hosted environment by deploying the model to a local endpoint. For more information, see Local Mode.

  • Use vanilla docker commands to test the container responds to /ping and /invocations. For more information, see local_test.