Configure data input channel to use Amazon FSx for Lustre
Learn how to use Amazon FSx for Lustre as your data source for higher throughput and faster training by reducing the time for data loading.
Note
When you use EFA-enabled instances such as P4d and P3dn, make sure that you set appropriate inbound and output rules in the security group. Specially, opening up these ports is necessary for SageMaker AI to access the Amazon FSx file system in the training job. To learn more, see File System Access Control with Amazon VPC.
Sync Amazon S3 and Amazon FSx for Lustre
To link your Amazon S3 to Amazon FSx for Lustre and upload your training datasets, do the following.
-
Prepare your dataset and upload to an Amazon S3 bucket. For example, assume that the Amazon S3 paths for a train dataset and a test dataset are in the following format.
s3://amzn-s3-demo-bucket/data/train s3://amzn-s3-demo-bucket/data/test
-
To create an FSx for Lustre file system linked with the Amazon S3 bucket with the training data, follow the steps at Linking your file system to an Amazon S3 bucket in the Amazon FSx for Lustre User Guide. Make sure that you add an endpoint to your VPC allowing Amazon S3 access. For more information, see Create an Amazon S3 VPC Endpoint. When you specify Data repository path, provide the Amazon S3 bucket URI of the folder that contains your datasets. For example, based on the example S3 paths in step 1, the data repository path should be the following.
s3://amzn-s3-demo-bucket/data
-
After the FSx for Lustre file system is created, check the configuration information by running the following commands.
aws fsx describe-file-systems && \ aws fsx describe-data-repository-association
These commands return
FileSystemId
,MountName
,FileSystemPath
, andDataRepositoryPath
. For example, the outputs should look like the following.# Output of aws fsx describe-file-systems "FileSystemId": "fs-0123456789abcdef0" "MountName": "1234abcd" # Output of aws fsx describe-data-repository-association "FileSystemPath": "/ns1", "DataRepositoryPath": "s3://amzn-s3-demo-bucket/data/"
After the sync between Amazon S3 and Amazon FSx has completed, your datasets are saved in Amazon FSx in the following directories.
/ns1/train # synced with s3://amzn-s3-demo-bucket/data/train /ns1/test # synced with s3://amzn-s3-demo-bucket/data/test
Set the Amazon FSx file system path as the data input channel for SageMaker training
The following procedures walk you through the process of setting the Amazon FSx file system as the data source for SageMaker training jobs.