Configure data input mode using the SageMaker Python SDK - Amazon SageMaker AI

Configure data input mode using the SageMaker Python SDK

SageMaker Python SDK provides the generic Estimator class and its variations for ML frameworks for launching training jobs. You can specify one of the data input modes while configuring the SageMaker AI Estimator class or the Estimator.fit method. The following code templates show the two ways to specify input modes.

To specify the input mode using the Estimator class

from sagemaker.estimator import Estimator from sagemaker.inputs import TrainingInput estimator = Estimator( checkpoint_s3_uri='s3://amzn-s3-demo-bucket/checkpoint-destination/', output_path='s3://amzn-s3-demo-bucket/output-path/', base_job_name='job-name', input_mode='File' # Available options: File | Pipe | FastFile ... ) # Run the training job estimator.fit( inputs=TrainingInput(s3_data="s3://amzn-s3-demo-bucket/my-data/train") )

For more information, see the sagemaker.estimator.Estimator class in the SageMaker Python SDK documentation.

To specify the input mode through the estimator.fit() method

from sagemaker.estimator import Estimator from sagemaker.inputs import TrainingInput estimator = Estimator( checkpoint_s3_uri='s3://amzn-s3-demo-bucket/checkpoint-destination/', output_path='s3://amzn-s3-demo-bucket/output-path/', base_job_name='job-name', ... ) # Run the training job estimator.fit( inputs=TrainingInput( s3_data="s3://amzn-s3-demo-bucket/my-data/train", input_mode='File' # Available options: File | Pipe | FastFile ) )

For more information, see the sagemaker.estimator.Estimator.fit class method and the sagemaker.inputs.TrainingInput class in the SageMaker Python SDK documentation.

Tip

To learn more about how to configure Amazon FSx for Lustre or Amazon EFS with your VPC configuration using the SageMaker Python SDK estimators, see Use File Systems as Training Inputs in the SageMaker AI Python SDK documentation.

Tip

The data input mode integrations with Amazon S3, Amazon EFS, and FSx for Lustre are recommended ways to optimally configure data source for the best practices. You can strategically improve data loading performance using the SageMaker AI managed storage options and input modes, but it's not strictly constrained. You can write your own data reading logic directly in your training container. For example, you can set to read from a different data source, write your own S3 data loader class, or use third-party frameworks' data loading functions within your training script. However, you must make sure that you specify the right paths that SageMaker AI can recognize.

Tip

If you use a custom training container, make sure you install the SageMaker training toolkit that helps set up the environment for SageMaker training jobs. Otherwise, you must specify the environment variables explicitly in your Dockerfile. For more information, see Create a container with your own algorithms and models.

For more information about how to set the data input modes using the low-level SageMaker APIs, see How Amazon SageMaker AI Provides Training Information, the CreateTrainingJob API, and the TrainingInputMode in AlgorithmSpecification.