Sequence-to-Sequence Algorithm - Amazon SageMaker

Sequence-to-Sequence Algorithm

Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.

Input/Output Interface for the Sequence-to-Sequence Algorithm

Training

SageMaker seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook. In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:

  • train: It should contain the training data (for example, the train.rec file generated by the preprocessing script).

  • validation: It should contain the validation data (for example, the val.rec file generated by the preprocessing script).

  • vocab: It should contain two vocabulary files (vocab.src.json and vocab.trg.json)

If the algorithm doesn't find data in any of these three channels, training results in an error.

Inference

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the application/json format. Otherwise, use the recordio-protobuf format to work with the integer encoded data. Both modes support batching of input data. application/json format also allows you to visualize the attention matrix.

  • application/json: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be application/json. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

    configuration: {attention_matrix: true}: Returns the attention matrix for the particular input sequence.

  • application/x-recordio-protobuf: Expects the input in recordio-protobuf format and returns the output in recordio-protobuf format. Both content and accept types should be applications/x-recordio-protobuf. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be application/jsonlines. The format for input is as follows:

content-type: application/jsonlines {"source": "source_sequence_0"} {"source": "source_sequence_1"}

The format for response is as follows:

accept: application/jsonlines {"target": "predicted_sequence_0"} {"target": "predicted_sequence_1"}

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the Sequence-to-Sequence Sample Notebooks .

EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm

The Amazon SageMaker seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.

Sequence-to-Sequence Sample Notebooks

For a sample notebook that shows how to use the SageMaker Sequence to Sequence algorithm to train a English-German translation model, see Machine Translation English-German Example Using SageMaker Seq2Seq. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker, see Amazon SageMaker Notebook Instances. Once you have created a notebook instance and opened it, select the SageMaker Examples tab to see a list of all the SageMaker samples. The topic modeling example notebooks using the NTM algorithms are located in the Introduction to Amazon algorithms section. To open a notebook, click on its Use tab and select Create copy.