Stop Training Jobs Early
Stop the training jobs that a hyperparameter tuning job launches early when they are not improving significantly as measured by the objective metric. Stopping training jobs early can help reduce compute time and helps you avoid overfitting your model. To configure a hyperparameter tuning job to stop training jobs early, do one of the following:
-
If you are using the AWS SDK for Python (Boto3), set the
TrainingJobEarlyStoppingType
field of theHyperParameterTuningJobConfig
object that you use to configure the tuning job toAUTO
. -
If you are using the Amazon SageMaker Python SDK
, set the early_stopping_type
parameter of the HyperParameterTunerobject to Auto
. -
In the Amazon SageMaker AI console, in the Create hyperparameter tuning job workflow, under Early stopping, choose Auto.
For a sample notebook that demonstrates how to use early stopping, see https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/image_classification_early_stopping/hpo_image_classification_early_stopping.ipynbhpo_image_classification_early_stopping.ipynb
notebook in the
Hyperparameter Tuning section of the SageMaker AI Examples
in a notebook instance. For information about using sample notebooks in a notebook instance,
see Access example notebooks.
How Early Stopping Works
When you enable early stopping for a hyperparameter tuning job, SageMaker AI evaluates each training job the hyperparameter tuning job launches as follows:
-
After each epoch of training, get the value of the objective metric.
-
Compute the running average of the objective metric for all previous training jobs up to the same epoch, and then compute the median of all of the running averages.
-
If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, SageMaker AI stops the current training job.
Algorithms That Support Early Stopping
To support early stopping, an algorithm must emit objective metrics for each epoch. The following built-in SageMaker AI algorithms support early stopping:
-
Linear Learner Algorithm—Supported only if you use
objective_loss
as the objective metric.
Note
This list of built-in algorithms that support early stopping is current as of December 13, 2018. Other built-in algorithms might support early stopping in the future. If an algorithm emits a metric that can be used as an objective metric for a hyperparameter tuning job (preferably a validation metric), then it supports early stopping.
To use early stopping with your own algorithm, you must write your algorithms such that it emits the value of the objective metric after each epoch. The following list shows how you can do that in different frameworks:
- TensorFlow
-
Use the
tf.keras.callbacks.ProgbarLogger
class. For information, see the tf.keras.callbacks.ProgbarLogger API. - MXNet
-
Use the
mxnet.callback.LogValidationMetricsCallback
. For information, see the mxnet.callback APIs. - Chainer
-
Extend chainer by using the
extensions.Evaluator
class. For information, see the chainer.training.extensions.Evaluator API. - PyTorch and Spark
-
There is no high-level support. You must explicitly write your training code so that it computes objective metrics and writes them to logs after each epoch.