Stop Training Jobs Early - Amazon SageMaker AI

Stop Training Jobs Early

Stop the training jobs that a hyperparameter tuning job launches early when they are not improving significantly as measured by the objective metric. Stopping training jobs early can help reduce compute time and helps you avoid overfitting your model. To configure a hyperparameter tuning job to stop training jobs early, do one of the following:

  • If you are using the AWS SDK for Python (Boto3), set the TrainingJobEarlyStoppingType field of the HyperParameterTuningJobConfig object that you use to configure the tuning job to AUTO.

  • If you are using the Amazon SageMaker Python SDK, set the early_stopping_type parameter of the HyperParameterTuner object to Auto.

  • In the Amazon SageMaker AI console, in the Create hyperparameter tuning job workflow, under Early stopping, choose Auto.

For a sample notebook that demonstrates how to use early stopping, see https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/image_classification_early_stopping/hpo_image_classification_early_stopping.ipynb or open the hpo_image_classification_early_stopping.ipynb notebook in the Hyperparameter Tuning section of the SageMaker AI Examples in a notebook instance. For information about using sample notebooks in a notebook instance, see Access example notebooks.

How Early Stopping Works

When you enable early stopping for a hyperparameter tuning job, SageMaker AI evaluates each training job the hyperparameter tuning job launches as follows:

  • After each epoch of training, get the value of the objective metric.

  • Compute the running average of the objective metric for all previous training jobs up to the same epoch, and then compute the median of all of the running averages.

  • If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, SageMaker AI stops the current training job.

Algorithms That Support Early Stopping

To support early stopping, an algorithm must emit objective metrics for each epoch. The following built-in SageMaker AI algorithms support early stopping:

Note

This list of built-in algorithms that support early stopping is current as of December 13, 2018. Other built-in algorithms might support early stopping in the future. If an algorithm emits a metric that can be used as an objective metric for a hyperparameter tuning job (preferably a validation metric), then it supports early stopping.

To use early stopping with your own algorithm, you must write your algorithms such that it emits the value of the objective metric after each epoch. The following list shows how you can do that in different frameworks:

TensorFlow

Use the tf.keras.callbacks.ProgbarLogger class. For information, see the tf.keras.callbacks.ProgbarLogger API.

MXNet

Use the mxnet.callback.LogValidationMetricsCallback. For information, see the mxnet.callback APIs.

Chainer

Extend chainer by using the extensions.Evaluator class. For information, see the chainer.training.extensions.Evaluator API.

PyTorch and Spark

There is no high-level support. You must explicitly write your training code so that it computes objective metrics and writes them to logs after each epoch.