Managed Spot Training Lifecycle - Amazon SageMaker AI

Managed Spot Training Lifecycle

You can monitor a training job using TrainingJobStatus and SecondaryStatus returned by DescribeTrainingJob. The list below shows how TrainingJobStatus and SecondaryStatus values change depending on the training scenario:

  • Spot instances acquired with no interruption during training

    1. InProgress: StartingDownloadingTrainingUploading

  • Spot instances interrupted once. Later, enough spot instances were acquired to finish the training job.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingUploading

  • Spot instances interrupted twice and MaxWaitTimeInSeconds exceeded.

    1. InProgress: StartingDownloadingTrainingInterruptedStartingDownloadingTrainingInterruptedDownloadingTraining

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded

  • Spot instances were never launched.

    1. InProgress: Starting

    2. Stopping: Stopping

    3. Stopped: MaxWaitTimeExceeded