用於訓練的一般資料格式

焦點模式

用於訓練的一般資料格式 - Amazon SageMaker AI

內建演算法支援的內容類型使用樞紐分析模式使用 CSV 格式使用 RecordIO 格式訓練模型還原序列化

若要準備訓練，您可以使用各種 AWS 服務預先處理資料，包括 Amazon EMR AWS Glue、Amazon Redshift、Amazon Relational Database Service 和 Amazon Athena。預先處理完畢後，請將資料發佈到 Amazon S3 儲存貯體上。對於訓練，資料必須經過一系列轉換和轉換，包括：

訓練資料序列化 (由您處理)
訓練資料還原序列化 (由演算法處理)
訓練模型序列化 (由演算法處理)
訓練模型還原序列化 (選擇性，由您處理)

在演算法的訓練部分中使用 Amazon SageMaker AI 時，請務必一次上傳所有資料。若該位置新增了更多資料，則需進行新的訓練呼叫，以建立全新的模型。

內建演算法支援的內容類型

下表列出了一些常見支援的ContentType值以及使用它們的算法：

內建演算法的類型

ContentType	演算法
應用程式/x - 影像	物件偵測演算法、語意分割
應用程式/X - 錄音	物件偵測演算法
application/x-recordio-protobuf	因式分解機, K 平均數, k-NN, Latent Dirichlet Allocation, 線性學習, NTM, PCA, RCF, Sequence-to-Sequence
application/jsonlines	BlazingText, DeepAR
影像/JPEG	物件偵測演算法、語意分割
圖片/png	物件偵測演算法、語意分割
text/csv	IP Insights, K 平均數, k-NN, Latent Dirichlet Allocation, 線性學習, NTM, PCA, RCF, XGBoost
文字/libsvm	XGBoost

如需每個演算法使用的參數摘要，請參閱文件或此資料表以取得個別演算法的資訊。

使用樞紐分析模式

在管道模式下，您的訓練任務會直接從 Amazon Simple Storage Service (Amazon S3) 串流資料。串流可以為訓練任務提供更快的啟動時間和更高的輸送量。這與檔案模式相反，Amazon S3 中的資料會存放在訓練執行個體磁碟區上。檔案模式則使用存放最終模型成品和完整訓練資料集的磁碟空間。透過以管道模式直接從 Amazon S3 串流資料，您可以減少訓練執行個體的 Amazon 彈性區塊存放區容量大小。管道模式僅需足夠磁碟空間來存放您的最終模型成品。請參閱 AlgorithmSpecification以取得訓練輸入模式的詳細資訊。

使用 CSV 格式

許多 Amazon SageMaker AI 演算法支援使用 CSV 格式的資料進行訓練。若要使用 CSV 格式的資料進行訓練，請在輸入資料通道規格中指定text/csv為 ContentType。Amazon SageMaker AI 要求 CSV 檔案沒有標頭記錄，且目標變數位於第一欄中。若要執行不具有目標的未受監督學習演算法，請指定內容類型中標籤欄的數目。例如，在此案例中為 'content_type=text/csv;label_size=0'。如需詳細資訊，請參閱現在搭配 CSV 資料集使用管道模式，以在 Amazon SageMaker AI 內建演算法上進行更快速的訓練。

使用 RecordIO 格式

在 protobuf recordIO 格式中，SageMaker AI 會將資料集中的每個觀察轉換為一組 4 位元組浮點數的二進位表示法，然後將其載入 protobuf 值欄位中。如果您使用 Python 來準備資料，強烈建議您使用這些現有的轉換。不過，如果您使用的是另一種語言，下面的 protobuf 定義檔案會提供結構描述，讓您用來將資料轉換為 SageMaker AI protobuf 格式。

注意

如需說明如何將常用 numPy 陣列轉換為 protobuf recordIO 格式的範例，請參閱入門指南：因式分解機搭配使用 MNIST。


syntax = "proto2";

 package aialgs.data;

 option java_package = "com.amazonaws.aialgorithms.proto";
 option java_outer_classname = "RecordProtos";

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float32Tensor   {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated float values = 1 [packed = true];

     // If key is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float64Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated double values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
 message Int32Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated int32 values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // Support for storing binary data for parsing in other ways (such as JPEG/etc).
 // This is an example of another type of value and may not immediately be supported.
 message Bytes {
     repeated bytes value = 1;

     // If the content type of the data is known, stores it.
     // This allows for the possibility of using decoders for common formats
     // in the future.
     optional string content_type = 2;
 }

 message Value {
     oneof value {
         // The numbering assumes the possible use of:
         // - float16, float128
         // - int8, int16, int32
         Float32Tensor float32_tensor = 2;
         Float64Tensor float64_tensor = 3;
         Int32Tensor int32_tensor = 7;
         Bytes bytes = 9;
     }
 }

 message Record {
     // Map from the name of the feature to the value.
     //
     // For vectors and libsvm-like datasets,
     // a single feature with the name `values`
     // should be specified.
     map<string, Value> features = 1;

     // An optional set of labels for this record.
     // Similar to the features field above, the key used for
     // generic scalar / vector labels should be 'values'.
     map<string, Value> label = 2;

     // A unique identifier for this record in the dataset.
     //
     // Whilst not necessary, this allows better
     // debugging where there are data issues.
     //
     // This is not used by the algorithm directly.
     optional string uid = 3;

     // Textual metadata describing the record.
     //
     // This may include JSON-serialized information
     // about the source of the record.
     //
     // This is not used by the algorithm directly.
     optional string metadata = 4;

     // An optional serialized JSON object that allows per-record
     // hyper-parameters/configuration/other information to be set.
     //
     // The meaning/interpretation of this field is defined by
     // the algorithm author and may not be supported.
     //
     // This is used to pass additional inference configuration
     // when batch inference is used (e.g. types of scores to return).
     optional string configuration = 5;
 }

建立通訊協定緩衝區後，將其存放在 Amazon SageMaker AI 可存取的 Amazon S3 位置，並且可以做為 InputDataConfig的一部分傳遞create_training_job。 Amazon SageMaker

注意

對於所有 Amazon SageMaker AI 演算法， ChannelName中的 InputDataConfig 必須設定為 train。部分演算法也支援驗證或測試 input channels。這些通常會透過使用鑑效組來評估模型的效能。鑑效組不會用於初始訓練，但可用來進一步微調模型。

訓練模型還原序列化

Amazon SageMaker AI 模型會以 model.tar.gz 的形式存放在create_training_job呼叫OutputDataConfigS3OutputPath參數中指定的 S3 儲存貯體中。S3 儲存貯體必須與筆記本執行個體位於相同的 AWS 區域。建立託管模型時，這類模型成品大多可以指定。也可以在您筆記本執行個體中開啟和檢視。當 model.tar.gz 解壓縮後，其含有序列化的 Apache MXNet 物件 model_algo-1。舉例而言，您可以如下所示，將 K 平均數模型載入記憶體內並加以檢視：


import mxnet as mx
print(mx.ndarray.load('model_algo-1'))

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

一般資訊

推論的常用資料格式

下一個主題：

推論的常用資料格式

上一個主題：

一般資訊

需要協助？

在本頁面

選取您的 Cookie 偏好設定

自訂 Cookie 偏好設定

必要

效能

功能

廣告

無法儲存 Cookie 偏好設定