Common Data Formats for Training
To prepare for training, you can preprocess your data using a variety of AWS services, including AWS Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After preprocessing, publish the data to an Amazon S3 bucket. For training, the data must go through a series of conversions and transformations, including:
-
Training data serialization (handled by you)
-
Training data deserialization (handled by the algorithm)
-
Training model serialization (handled by the algorithm)
-
Trained model deserialization (optional, handled by you)
When using Amazon SageMaker AI in the training portion of the algorithm, make sure to upload all data at once. If more data is added to that location, a new training call would need to be made to construct a brand new model.
Topics
Content Types Supported by Built-In Algorithms
The following table lists some of the commonly supported ContentType
values and the algorithms that use
them:
ContentTypes for Built-in Algorithms
ContentType | Algorithm |
---|---|
application/x-image | Object Detection Algorithm, Semantic Segmentation |
application/x-recordio |
Object Detection Algorithm |
application/x-recordio-protobuf |
Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence |
application/jsonlines |
BlazingText, DeepAR |
image/jpeg |
Object Detection Algorithm, Semantic Segmentation |
image/png |
Object Detection Algorithm, Semantic Segmentation |
text/csv |
IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, XGBoost |
text/libsvm |
XGBoost |
For a summary of the parameters used by each algorithm, see the documentation for the individual algorithms or this table.
Using Pipe Mode
In Pipe mode, your training job streams data directly
from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training
jobs and better throughput. This is in contrast to File
mode, in which your data from Amazon S3 is stored on the training
instance volumes. File mode uses disk space to store both your final model
artifacts and your full training dataset. By streaming in your data directly
from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes of your
training instances. Pipe mode needs only enough disk space to store your
final model artifacts. See the AlgorithmSpecification
for additional details on
the training input mode.
Using CSV Format
Many Amazon SageMaker AI algorithms support training with data in CSV format. To use
data in CSV format for training, in the input data channel specification,
specify text/csv
as the ContentType
. Amazon SageMaker AI requires
that a CSV file does not have a header record and that the target variable
is in the first column. To run unsupervised learning algorithms that don't
have a target, specify the number of label columns in the content type. For
example, in this case
'content_type=text/csv;label_size=0'
. For more information, see Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker AI
built-in algorithms
Using RecordIO Format
In the protobuf recordIO format, SageMaker AI converts each observation in the dataset into a binary representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python for your data preparation, we strongly recommend that you use these existing transformations. However, if you are using another language, the protobuf definition file below provides the schema that you use to convert your data into SageMaker AI protobuf format.
Note
For an example that shows how to convert the commonly used numPy array
into the protobuf recordIO format, see An Introduction to Factorization Machines with
MNIST
syntax = "proto2"; package aialgs.data; option java_package = "com.amazonaws.aialgorithms.proto"; option java_outer_classname = "RecordProtos"; // A sparse or dense rank-R tensor that stores data as doubles (float64). message Float32Tensor { // Each value in the vector. If keys is empty, this is treated as a // dense vector. repeated float values = 1 [packed = true]; // If key is not empty, the vector is treated as sparse, with // each key specifying the location of the value in the sparse vector. repeated uint64 keys = 2 [packed = true]; // An optional shape that allows the vector to represent a matrix. // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row, // and keys[i] % 20 gives the column. // This also supports n-dimensonal tensors. // Note: If the tensor is sparse, you must specify this value. repeated uint64 shape = 3 [packed = true]; } // A sparse or dense rank-R tensor that stores data as doubles (float64). message Float64Tensor { // Each value in the vector. If keys is empty, this is treated as a // dense vector. repeated double values = 1 [packed = true]; // If this is not empty, the vector is treated as sparse, with // each key specifying the location of the value in the sparse vector. repeated uint64 keys = 2 [packed = true]; // An optional shape that allows the vector to represent a matrix. // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row, // and keys[i] % 20 gives the column. // This also supports n-dimensonal tensors. // Note: If the tensor is sparse, you must specify this value. repeated uint64 shape = 3 [packed = true]; } // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32). message Int32Tensor { // Each value in the vector. If keys is empty, this is treated as a // dense vector. repeated int32 values = 1 [packed = true]; // If this is not empty, the vector is treated as sparse with // each key specifying the location of the value in the sparse vector. repeated uint64 keys = 2 [packed = true]; // An optional shape that allows the vector to represent a matrix. // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row, // and keys[i] % 20 gives the column. // This also supports n-dimensonal tensors. // Note: If the tensor is sparse, you must specify this value. repeated uint64 shape = 3 [packed = true]; } // Support for storing binary data for parsing in other ways (such as JPEG/etc). // This is an example of another type of value and may not immediately be supported. message Bytes { repeated bytes value = 1; // If the content type of the data is known, stores it. // This allows for the possibility of using decoders for common formats // in the future. optional string content_type = 2; } message Value { oneof value { // The numbering assumes the possible use of: // - float16, float128 // - int8, int16, int32 Float32Tensor float32_tensor = 2; Float64Tensor float64_tensor = 3; Int32Tensor int32_tensor = 7; Bytes bytes = 9; } } message Record { // Map from the name of the feature to the value. // // For vectors and libsvm-like datasets, // a single feature with the name `values` // should be specified. map<string, Value> features = 1; // An optional set of labels for this record. // Similar to the features field above, the key used for // generic scalar / vector labels should be 'values'. map<string, Value> label = 2; // A unique identifier for this record in the dataset. // // Whilst not necessary, this allows better // debugging where there are data issues. // // This is not used by the algorithm directly. optional string uid = 3; // Textual metadata describing the record. // // This may include JSON-serialized information // about the source of the record. // // This is not used by the algorithm directly. optional string metadata = 4; // An optional serialized JSON object that allows per-record // hyper-parameters/configuration/other information to be set. // // The meaning/interpretation of this field is defined by // the algorithm author and may not be supported. // // This is used to pass additional inference configuration // when batch inference is used (e.g. types of scores to return). optional string configuration = 5; }
After creating the protocol buffer, store it in an Amazon S3 location that
Amazon SageMaker AI can access and that can be passed as part of
InputDataConfig
in create_training_job
.
Note
For all Amazon SageMaker AI algorithms, the ChannelName
in
InputDataConfig
must be set to train
. Some
algorithms also support a validation or test input
channels
. These are typically used to evaluate the model's
performance by using a hold-out dataset. Hold-out datasets are not used
in the initial training but can be used to further tune the
model.
Trained Model Deserialization
Amazon SageMaker AI models are stored as model.tar.gz in the S3 bucket specified in
OutputDataConfig
S3OutputPath
parameter of the create_training_job
call. The S3 bucket must be in the same AWS Region as the notebook
instance. You can specify most of these model artifacts when creating a
hosting model. You can also open and review them in your notebook instance.
When model.tar.gz
is untarred, it contains
model_algo-1
, which is a serialized Apache MXNet object.
For example, you use the following to load the k-means model into memory and
view it:
import mxnet as mx print(mx.ndarray.load('model_algo-1'))