

# Checkpoints in Amazon SageMaker AI
<a name="model-checkpoints"></a>

Use checkpoints in Amazon SageMaker AI to save the state of machine learning (ML) models during training. Checkpoints are snapshots of the model and can be configured by the callback functions of ML frameworks. You can use the saved checkpoints to restart a training job from the last saved checkpoint. 

Using checkpoints, you can do the following:
+ Save your model snapshots under training due to an unexpected interruption to the training job or instance.
+ Resume training the model in the future from a checkpoint.
+ Analyze the model at intermediate stages of training.
+ Use checkpoints with S3 Express One Zone for increased access speeds.
+ Use checkpoints with SageMaker AI managed spot training to save on training costs.

The SageMaker training mechanism uses training containers on Amazon EC2 instances, and the checkpoint files are saved under a local directory of the containers (the default is `/opt/ml/checkpoints`). SageMaker AI provides the functionality to copy the checkpoints from the local path to Amazon S3 and automatically syncs the checkpoints in that directory with S3. Existing checkpoints in S3 are written to the SageMaker AI container at the start of the job, enabling jobs to resume from a checkpoint. Checkpoints added to the S3 folder after the job has started are not copied to the training container. SageMaker AI also writes new checkpoints from the container to S3 during training. If a checkpoint is deleted in the SageMaker AI container, it will also be deleted in the S3 folder.

You can use checkpoints in Amazon SageMaker AI with the Amazon S3 Express One Zone storage class (S3 Express One Zone) for faster access to checkpoints. When you enable checkpointing and specify the S3 URI for your checkpoint storage destination, you can provide an S3 URI for a folder in either an S3 general purpose bucket or an S3 directory bucket. S3 directory buckets that are integrated with SageMaker AI can only be encrypted with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported. For more information on S3 Express One Zone and S3 directory buckets, see [What is S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html).

If you are using checkpoints with SageMaker AI managed spot training, SageMaker AI manages checkpointing your model training on a spot instance and resuming the training job on the next spot instance. With SageMaker AI managed spot training, you can significantly reduce the billable time for training ML models. For more information, see [Managed Spot Training in Amazon SageMaker AI](model-managed-spot-training.md).

**Topics**
+ [

## Checkpoints for frameworks and algorithms in SageMaker AI
](#model-checkpoints-whats-supported)
+ [

## Considerations for checkpointing
](#model-checkpoints-considerations)
+ [

# Enable checkpointing
](model-checkpoints-enable.md)
+ [

# Browse checkpoint files
](model-checkpoints-saved-file.md)
+ [

# Resume training from a checkpoint
](model-checkpoints-resume.md)
+ [

# Cluster repairs for GPU errors
](model-checkpoints-cluster-repair.md)

## Checkpoints for frameworks and algorithms in SageMaker AI
<a name="model-checkpoints-whats-supported"></a>

Use checkpoints to save snapshots of ML models built on your preferred frameworks within SageMaker AI.

**SageMaker AI frameworks and algorithms that support checkpointing**

SageMaker AI supports checkpointing for AWS Deep Learning Containers and a subset of built-in algorithms without requiring training script changes. SageMaker AI saves the checkpoints to the default local path `'/opt/ml/checkpoints'` and copies them to Amazon S3. 
+ Deep Learning Containers: [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html), [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html), [MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html), and [HuggingFace](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html)
**Note**  
If you are using the HuggingFace framework estimator, you need to specify a checkpoint output path through hyperparameters. For more information, see [Run training on Amazon SageMaker AI](https://huggingface.co/docs/sagemaker/train) in the *HuggingFace documentation*.
+ Built-in algorithms: [Image Classification](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html), [Object Detection](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html), [Semantic Segmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html), and [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) (0.90-1 or later)
**Note**  
If you are using the XGBoost algorithm in framework mode (script mode), you need to bring an XGBoost training script with checkpointing that's manually configured. For more information about the XGBoost training methods to save model snapshots, see [Training XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#training) in *the XGBoost Python SDK documentation*.

If a pre-built algorithm that does not support checkpointing is used in a managed spot training job, SageMaker AI does not allow a maximum wait time greater than an hour for the job in order to limit wasted training time from interrupts.

**For custom training containers and other frameworks**

If you are using your own training containers, training scripts, or other frameworks not listed in the previous section, you must properly set up your training script using callbacks or training APIs to save checkpoints to the local path (`'/opt/ml/checkpoints'`) and load from the local path in your training script. SageMaker AI estimators can sync up with the local path and save the checkpoints to Amazon S3.

## Considerations for checkpointing
<a name="model-checkpoints-considerations"></a>

Consider the following when using checkpoints in SageMaker AI.
+ To avoid overwrites in distributed training with multiple instances, you must manually configure the checkpoint file names and paths in your training script. The high-level SageMaker AI checkpoint configuration specifies a single Amazon S3 location without additional suffixes or prefixes to tag checkpoints from multiple instances.
+ The SageMaker Python SDK does not support high-level configuration for checkpointing frequency. To control the checkpointing frequency, modify your training script using the framework's model save functions or checkpoint callbacks.
+ If you use SageMaker AI checkpoints with SageMaker Debugger and SageMaker AI distributed and are facing issues, see the following pages for troubleshooting and considerations.
  + [Distributed training supported by Amazon SageMaker Debugger](debugger-reference.md#debugger-considerations)
  + [Troubleshooting for distributed training in Amazon SageMaker AI](distributed-troubleshooting-data-parallel.md)
  + [Model Parallel Troubleshooting](distributed-troubleshooting-model-parallel.md)

# Enable checkpointing
<a name="model-checkpoints-enable"></a>

After you enable checkpointing, SageMaker AI saves checkpoints to Amazon S3 and syncs your training job with the checkpoint S3 bucket. You can use either S3 general purpose or S3 directory buckets for your checkpoint S3 bucket. 

![\[Architecture diagram of writing checkpoints during training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_write.png)


The following example shows how to configure checkpoint paths when you construct a SageMaker AI estimator. To enable checkpointing, add the `checkpoint_s3_uri` and `checkpoint_local_path` parameters to your estimator. 

The following example template shows how to create a generic SageMaker AI estimator and enable checkpointing. You can use this template for the supported algorithms by specifying the `image_uri` parameter. To find Docker image URIs for algorithms with checkpointing supported by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths). You can also replace `estimator` and `Estimator` with other SageMaker AI frameworks' estimator parent classes and estimator classes, such as `[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#create-an-estimator)`, `[PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)`, `[MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#create-an-estimator)`, `[HuggingFace](https://huggingface.co/docs/sagemaker/train#create-a-hugging-face-estimator)` and `[XGBoost](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#create-an-estimator)`.

```
import sagemaker
from sagemaker.estimator import Estimator

bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

# The local path where the model will save its checkpoints in the training container
checkpoint_local_path="/opt/ml/checkpoints"

estimator = Estimator(
    ...
    image_uri="<ecr_path>/<algorithm-name>:<tag>" # Specify to use built-in algorithms
    output_path=bucket,
    base_job_name=base_job_name,
    
    # Parameters required to enable checkpointing
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path=checkpoint_local_path
)
```

The following two parameters specify paths for checkpointing:
+ `checkpoint_local_path` – Specify the local path where the model saves the checkpoints periodically in a training container. The default path is set to `'/opt/ml/checkpoints'`. If you are using other frameworks or bringing your own training container, ensure that your training script's checkpoint configuration specifies the path to `'/opt/ml/checkpoints'`.
**Note**  
We recommend specifying the local paths as `'/opt/ml/checkpoints'` to be consistent with the default SageMaker AI checkpoint settings. If you prefer to specify your own local path, make sure you match the checkpoint saving path in your training script and the `checkpoint_local_path` parameter of the SageMaker AI estimators.
+ `checkpoint_s3_uri` – The URI to an S3 bucket where the checkpoints are stored in real time. You can specify either an S3 general purpose or S3 directory bucket to store your checkpoints. For more information on S3 directory buckets, see [Directory buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html) in the *Amazon Simple Storage Service User Guide*. 

To find a complete list of SageMaker AI estimator parameters, see the [Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the *[Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation*.

# Browse checkpoint files
<a name="model-checkpoints-saved-file"></a>

Locate checkpoint files using the SageMaker Python SDK and the Amazon S3 console.

**To find the checkpoint files programmatically**

To retrieve the S3 bucket URI where the checkpoints are saved, check the following estimator attribute:

```
estimator.checkpoint_s3_uri
```

This returns the S3 output path for checkpoints configured while requesting the `CreateTrainingJob` request. To find the saved checkpoint files using the S3 console, use the following procedure.

**To find the checkpoint files from the S3 console**

1. Sign in to the AWS Management Console and open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training jobs**.

1. Choose the link to the training job with checkpointing enabled to open **Job settings**.

1. On the **Job settings** page of the training job, locate the **Checkpoint configuration** section.  
![\[Checkpoint configuration section in the Job settings page of a training job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_trainingjob.png)

1. Use the link to the S3 bucket to access the checkpoint files.

# Resume training from a checkpoint
<a name="model-checkpoints-resume"></a>

To resume a training job from a checkpoint, run a new estimator with the same `checkpoint_s3_uri` that you created in the [Enable checkpointing](model-checkpoints-enable.md) section. Once the training has resumed, the checkpoints from this S3 bucket are restored to `checkpoint_local_path` in each instance of the new training job. Ensure that the S3 bucket is in the same Region as that of the current SageMaker AI session.

![\[Architecture diagram of syncing checkpoints to resume training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_resume.png)


# Cluster repairs for GPU errors
<a name="model-checkpoints-cluster-repair"></a>

If you are running a training job that fails on a GPU, SageMaker AI will run a GPU health check to see whether the failure is related to a GPU issue. SageMaker AI takes the following actions based on the health check results:
+ If the error is recoverable, and can be fixed by rebooting the instance or resetting the GPU, SageMaker AI will reboot the instance.
+ If the error is not recoverable, and caused by a GPU that needs to be replaced, SageMaker AI will replace the instance.

The instance is either replaced or rebooted as part of a SageMaker AI cluster repair process. During this process, you will see the following message in your training job status:

`Repairing training cluster due to hardware failure`

SageMaker AI will attempt to repair the cluster up to `10` times. If the cluster repair is successful, SageMaker AI will automatically restart the training job from the previous checkpoint. If the cluster repair fails, the training job will also fail. You are not billed for the cluster repair process. Cluster repairs will not initiate unless your training job fails. If a GPU issue is detected for a warmpool cluster, the cluster will enter into repair mode to either reboot or replace the faulty instance. After repair, the cluster can still be used as a warmpool cluster.

The previously described cluster and instance repair process is depicted in the following diagram:

![\[The cluster and instance repair process.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-cluster-repair.png)