Sequence-to-Sequence Algorithm
Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
Topics
Input/Output Interface for the Sequence-to-Sequence Algorithm
Training
SageMaker seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.
A script to convert data from tokenized text files to the protobuf format is included
in the seq2seq example notebook
After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:
-
train
: It should contain the training data (for example, thetrain.rec
file generated by the preprocessing script). -
validation
: It should contain the validation data (for example, theval.rec
file generated by the preprocessing script). -
vocab
: It should contain two vocabulary files (vocab.src.json
andvocab.trg.json
)
If the algorithm doesn't find data in any of these three channels, training results in an error.
Inference
For hosted endpoints, inference supports two data formats. To perform inference using
space separated text tokens, use the application/json
format. Otherwise,
use the recordio-protobuf
format to work with the integer encoded data.
Both modes support batching of input data. application/json
format also
allows you to visualize the attention matrix.
-
application/json
: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should beapplication/json
. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:configuration
: {attention_matrix
:true
}: Returns the attention matrix for the particular input sequence. -
application/x-recordio-protobuf
: Expects the input inrecordio-protobuf
format and returns the output inrecordio-protobuf format
. Both content and accept types should beapplications/x-recordio-protobuf
. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.
For batch transform, inference supports JSON Lines format. Batch transform expects the
input in JSON Lines format and returns the output in JSON Lines format. Both content and
accept types should be application/jsonlines
. The format for input is as
follows:
content-type: application/jsonlines {"source": "source_sequence_0"} {"source": "source_sequence_1"}
The format for response is as follows:
accept: application/jsonlines {"target": "predicted_sequence_0"} {"target": "predicted_sequence_1"}
For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the Sequence-to-Sequence Sample Notebooks .
EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
The Amazon SageMaker seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.
Sequence-to-Sequence Sample Notebooks
For a sample notebook that shows how to use the SageMaker Sequence to Sequence algorithm to
train a English-German translation model, see Machine Translation English-German Example Using SageMaker Seq2Seq