

# Sequence-to-Sequence Algorithm
<a name="seq-2-seq"></a>

Amazon SageMaker AI Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures. 

**Topics**
+ [Input/Output Interface for the Sequence-to-Sequence Algorithm](#s2s-inputoutput)
+ [EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm](#s2s-instances)
+ [Sequence-to-Sequence Sample Notebooks](#seq-2-seq-sample-notebooks)
+ [How Sequence-to-Sequence Works](seq-2-seq-howitworks.md)
+ [Sequence-to-Sequence Hyperparameters](seq-2-seq-hyperparameters.md)
+ [Tune a Sequence-to-Sequence Model](seq-2-seq-tuning.md)

## Input/Output Interface for the Sequence-to-Sequence Algorithm
<a name="s2s-inputoutput"></a>

**Training**

SageMaker AI seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in [the seq2seq example notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:
+ `train`: It should contain the training data (for example, the `train.rec` file generated by the preprocessing script).
+ `validation`: It should contain the validation data (for example, the `val.rec` file generated by the preprocessing script).
+ `vocab`: It should contain two vocabulary files (`vocab.src.json` and `vocab.trg.json`) 

If the algorithm doesn't find data in any of these three channels, training results in an error.

**Inference**

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the `application/json` format. Otherwise, use the `recordio-protobuf` format to work with the integer encoded data. Both modes support batching of input data. `application/json` format also allows you to visualize the attention matrix.
+ `application/json`: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be `application/json`. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

  `configuration`: \$1`attention_matrix`: `true`\$1: Returns the attention matrix for the particular input sequence.
+ `application/x-recordio-protobuf`: Expects the input in `recordio-protobuf` format and returns the output in `recordio-protobuf format`. Both content and accept types should be `applications/x-recordio-protobuf`. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be `application/jsonlines`. The format for input is as follows:

```
content-type: application/jsonlines

{"source": "source_sequence_0"}
{"source": "source_sequence_1"}
```

The format for response is as follows:

```
accept: application/jsonlines

{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}
```

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the [Sequence-to-Sequence Sample Notebooks](#seq-2-seq-sample-notebooks) .

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
<a name="s2s-instances"></a>

The Amazon SageMaker AI seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.

## Sequence-to-Sequence Sample Notebooks
<a name="seq-2-seq-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Sequence to Sequence algorithm to train a English-German translation model, see [Machine Translation English-German Example Using SageMaker AI Seq2Seq](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Sequence-to-Sequence Works
<a name="seq-2-seq-howitworks"></a>

Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including: 
+ An **embedding layer**. In this layer, the input matrix, which is input tokens encoded in a sparse way (for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-dimensional feature vector is more capable of encoding information regarding a particular token (word for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this embedding layer with a pre-trained word vector like [FastText](https://fasttext.cc/) or [Glove](https://nlp.stanford.edu/projects/glove/) or to initialize it randomly and learn the parameters during training. 
+ An **encoder layer**. After the input tokens are mapped into a high-dimensional feature space, the sequence is passed through an encoder layer to compress all the information from the input embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). ([ Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) explains LSTM in a great detail.) 
+ A **decoder layer**. The decoder layer takes this encoded feature vector and produces the output sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU). 

The whole model is trained jointly to maximize the probability of the target sequence given the source sequence. This model was first introduced by [Sutskever et al.](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) in 2014. 

**Attention mechanism**. The disadvantage of an encoder-decoder framework is that model performance decreases as and when the length of the source sequence increases because of the limit of how much information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015, Bahdanau et al. proposed the [attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). In an attention mechanism, the decoder tries to find the location in the encoder sequence where the most important information could be located and uses that information and previously decoded words to predict the next token in the sequence. 

For more in details, see the whitepaper [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) by Luong, et al. that explains and simplifies calculations for various attention mechanisms. Additionally, the whitepaper [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) by Wu, et al. describes Google's architecture for machine translation, which uses skip connections between encoder and decoder layers.

# Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-hyperparameters"></a>

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI Sequence-to-Sequence (seq2seq) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size | Mini batch size for gradient descent. **Optional** Valid values: positive integer Default value: 64 | 
| beam\$1size | Length of the beam for beam search. Used during training for computing `bleu` and used during inference. **Optional** Valid values: positive integer Default value: 5 | 
| bleu\$1sample\$1size | Number of instances to pick from validation dataset to decode and compute `bleu` score during training. Set to -1 to use full validation set (if `bleu` is chosen as `optimized_metric`). **Optional** Valid values: integer Default value: 0 | 
| bucket\$1width | Returns (source,target) buckets up to (`max_seq_len_source`, `max_seq_len_target`). The longer side of the data uses steps of `bucket_width` while the shorter side uses steps scaled down by the average target/source length ratio. If one sided reaches its maximum length before the other, width of extra buckets on that side is fixed to that side of `max_len`. **Optional** Valid values: positive integer Default value: 10 | 
| bucketing\$1enabled | Set to `false` to disable bucketing, unroll to maximum length. **Optional** Valid values: `true` or `false` Default value: `true` | 
| checkpoint\$1frequency\$1num\$1batches | Checkpoint and evaluate every x batches. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 1000 | 
| checkpoint\$1threshold | Maximum number of checkpoints model is allowed to not improve in `optimized_metric` on validation dataset before training is stopped. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 3 | 
| clip\$1gradient | Clip absolute gradient values greater than this. Set to negative to disable. **Optional** Valid values: float Default value: 1 | 
| cnn\$1activation\$1type | The `cnn` activation type to be used. **Optional** Valid values: String. One of `glu`, `relu`, `softrelu`, `sigmoid`, or `tanh`. Default value: `glu` | 
| cnn\$1hidden\$1dropout | Dropout probability for dropout between convolutional layers. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| cnn\$1kernel\$1width\$1decoder | Kernel width for the `cnn` decoder. **Optional** Valid values: positive integer Default value: 5 | 
| cnn\$1kernel\$1width\$1encoder | Kernel width for the `cnn` encoder. **Optional** Valid values: positive integer Default value: 3 | 
| cnn\$1num\$1hidden | Number of `cnn` hidden units for encoder and decoder. **Optional** Valid values: positive integer Default value: 512 | 
| decoder\$1type | Decoder type. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: *rnn* | 
| embed\$1dropout\$1source | Dropout probability for source side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| embed\$1dropout\$1target | Dropout probability for target side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| encoder\$1type | Encoder type. The `rnn` architecture is based on attention mechanism by Bahdanau et al. and *cnn* architecture is based on Gehring et al. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: `rnn` | 
| fixed\$1rate\$1lr\$1half\$1life | Half life for learning rate in terms of number of checkpoints for `fixed_rate_`\$1 schedulers. **Optional** Valid values: positive integer Default value: 10 | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float Default value: 0.0003 | 
| loss\$1type | Loss function for training. **Optional** Valid values: String. `cross-entropy` Default value: `cross-entropy` | 
| lr\$1scheduler\$1type | Learning rate scheduler type. `plateau_reduce` means reduce the learning rate whenever `optimized_metric` on `validation_accuracy` plateaus. `inv_t` is inverse time decay. `learning_rate`/(1\$1`decay_rate`\$1t) **Optional** Valid values: String. One of `plateau_reduce`, `fixed_rate_inv_t`, or `fixed_rate_inv_sqrt_t`. Default value: `plateau_reduce` | 
| max\$1num\$1batches | Maximum number of updates/batches to process. -1 for infinite. **Optional** Valid values: integer Default value: -1 | 
| max\$1num\$1epochs | Maximum number of epochs to pass through training data before fitting is stopped. Training continues until this number of epochs even if validation accuracy is not improving if this parameter is passed. Ignored if not passed. **Optional** Valid values: Positive integer and less than or equal to max\$1num\$1epochs. Default value: none | 
| max\$1seq\$1len\$1source | Maximum length for the source sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100  | 
| max\$1seq\$1len\$1target | Maximum length for the target sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100 | 
| min\$1num\$1epochs | Minimum number of epochs the training must run before it is stopped via `early_stopping` conditions. **Optional** Valid values: positive integer Default value: 0 | 
| momentum | Momentum constant used for `sgd`. Don't pass this parameter if you are using `adam` or `rmsprop`. **Optional** Valid values: float Default value: none | 
| num\$1embed\$1source | Embedding size for source tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1embed\$1target | Embedding size for target tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1layers\$1decoder | Number of layers for Decoder *rnn* or *cnn*. **Optional** Valid values: positive integer Default value: 1 | 
| num\$1layers\$1encoder | Number of layers for Encoder `rnn` or `cnn`. **Optional** Valid values: positive integer Default value: 1 | 
| optimized\$1metric | Metrics to optimize with early stopping. **Optional** Valid values: String. One of `perplexity`, `accuracy`, or `bleu`. Default value: `perplexity` | 
| optimizer\$1type | Optimizer to choose from. **Optional** Valid values: String. One of `adam`, `sgd`, or `rmsprop`. Default value: `adam` | 
| plateau\$1reduce\$1lr\$1factor | Factor to multiply learning rate with (for `plateau_reduce`). **Optional** Valid values: float Default value: 0.5 | 
| plateau\$1reduce\$1lr\$1threshold | For `plateau_reduce` scheduler, multiply learning rate with reduce factor if `optimized_metric` didn't improve for this many checkpoints. **Optional** Valid values: positive integer Default value: 3 | 
| rnn\$1attention\$1in\$1upper\$1layers | Pass the attention to upper layers of *rnn*, like Google NMT paper. Only applicable if more than one layer is used. **Optional** Valid values: boolean (`true` or `false`) Default value: `true` | 
| rnn\$1attention\$1num\$1hidden | Number of hidden units for attention layers. defaults to `rnn_num_hidden`. **Optional** Valid values: positive integer Default value: `rnn_num_hidden` | 
| rnn\$1attention\$1type | Attention model for encoders. `mlp` refers to concat and bilinear refers to general from the Luong et al. paper. **Optional** Valid values: String. One of `dot`, `fixed`, `mlp`, or `bilinear`. Default value: `mlp` | 
| rnn\$1cell\$1type | Specific type of `rnn` architecture. **Optional** Valid values: String. Either `lstm` or `gru`. Default value: `lstm` | 
| rnn\$1decoder\$1state\$1init | How to initialize `rnn` decoder states from encoders. **Optional** Valid values: String. One of `last`, `avg`, or `zero`. Default value: `last` | 
| rnn\$1first\$1residual\$1layer | First *rnn* layer to have a residual connection, only applicable if number of layers in encoder or decoder is more than 1. **Optional** Valid values: positive integer Default value: 2 | 
| rnn\$1num\$1hidden | The number of *rnn* hidden units for encoder and decoder. This must be a multiple of 2 because the algorithm uses bi-directional Long Term Short Term Memory (LSTM) by default. **Optional** Valid values: positive even integer Default value: 1024 | 
| rnn\$1residual\$1connections | Add residual connection to stacked *rnn*. Number of layers should be more than 1. **Optional** Valid values: boolean (`true` or `false`) Default value: `false` | 
| rnn\$1decoder\$1hidden\$1dropout | Dropout probability for hidden state that combines the context with the *rnn* hidden state in the decoder. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| training\$1metric | Metrics to track on training on validation data. **Optional** Valid values: String. Either `perplexity` or `accuracy`. Default value: `perplexity` | 
| weight\$1decay | Weight decay constant. **Optional** Valid values: float Default value: 0 | 
| weight\$1init\$1scale | Weight initialization scale (for `uniform` and `xavier` initialization).  **Optional** Valid values: float Default value: 2.34 | 
| weight\$1init\$1type | Type of weight initialization.  **Optional** Valid values: String. Either `uniform` or `xavier`. Default value: `xavier` | 
| xavier\$1factor\$1type | Xavier factor type. **Optional** Valid values: String. One of `in`, `out`, or `avg`. Default value: `in` | 

# Tune a Sequence-to-Sequence Model
<a name="seq-2-seq-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Sequence-to-Sequence Algorithm
<a name="seq-2-seq-metrics"></a>

The sequence to sequence algorithm reports three metrics that are computed during training. Choose one of them as an objective to optimize when tuning the hyperparameter values.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Accuracy computed on the validation dataset.  |  Maximize  | 
| validation:bleu |  [Bleu﻿](https://en.wikipedia.org/wiki/BLEU) score computed on the validation dataset. Because BLEU computation is expensive, you can choose to compute BLEU on a random subsample of the validation dataset to speed up the overall training process. Use the `bleu_sample_size` parameter to specify the subsample.  |  Maximize  | 
| validation:perplexity |  [Perplexity](https://en.wikipedia.org/wiki/Perplexity), is a loss function computed on the validation dataset. Perplexity measures the cross-entropy between an empirical sample and the distribution predicted by a model and so provides a measure of how well a model predicts the sample values, Models that are good at predicting a sample have a low perplexity.  |  Minimize  | 

## Tunable Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the SageMaker AI Sequence to Sequence algorithm. The hyperparameters that have the greatest impact on sequence to sequence objective metrics are: `batch_size`, `optimizer_type`, `learning_rate`, `num_layers_encoder`, and `num_layers_decoder`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1layers\$1encoder |  IntegerParameterRange  |  [1-10]  | 
| num\$1layers\$1decoder |  IntegerParameterRange  |  [1-10]  | 
| batch\$1size |  CategoricalParameterRange  |  [16,32,64,128,256,512,1024,2048]  | 
| optimizer\$1type |  CategoricalParameterRange  |  ['adam', 'sgd', 'rmsprop']  | 
| weight\$1init\$1type |  CategoricalParameterRange  |  ['xavier', 'uniform']  | 
| weight\$1init\$1scale |  ContinuousParameterRange  |  For the xavier type: MinValue: 2.0, MaxValue: 3.0 For the uniform type: MinValue: -1.0, MaxValue: 1.0  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 0.00005, MaxValue: 0.2  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.1  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.5, MaxValue: 0.9  | 
| clip\$1gradient |  ContinuousParameterRange  |  MinValue: 1.0, MaxValue: 5.0  | 
| rnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to recurrent neural networks (RNNs). [128,256,512,1024,2048]   | 
| cnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to convolutional neural networks (CNNs). [128,256,512,1024,2048]   | 
| num\$1embed\$1source |  IntegerParameterRange  |  [256-512]  | 
| num\$1embed\$1target |  IntegerParameterRange  |  [256-512]  | 
| embed\$1dropout\$1source |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| embed\$1dropout\$1target |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| rnn\$1decoder\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| cnn\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| lr\$1scheduler\$1type |  CategoricalParameterRange  |  ['plateau\$1reduce', 'fixed\$1rate\$1inv\$1t', 'fixed\$1rate\$1inv\$1sqrt\$1t']  | 
| plateau\$1reduce\$1lr\$1factor |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 0.5  | 
| plateau\$1reduce\$1lr\$1threshold |  IntegerParameterRange  |  [1-5]  | 
| fixed\$1rate\$1lr\$1half\$1life |  IntegerParameterRange  |  [10-30]  | 