How Sequence-to-Sequence Works
Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including:
-
An embedding layer. In this layer, the input matrix, which is input tokens encoded in a sparse way (for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-dimensional feature vector is more capable of encoding information regarding a particular token (word for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this embedding layer with a pre-trained word vector like FastText
or Glove or to initialize it randomly and learn the parameters during training. -
An encoder layer. After the input tokens are mapped into a high-dimensional feature space, the sequence is passed through an encoder layer to compress all the information from the input embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). ( Colah's blog
explains LSTM in a great detail.) -
A decoder layer. The decoder layer takes this encoded feature vector and produces the output sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU).
The whole model is trained jointly to maximize the probability of the target sequence
given the source sequence. This model was first introduced by Sutskever et al.
Attention mechanism. The disadvantage of an
encoder-decoder framework is that model performance decreases as and when the length of
the source sequence increases because of the limit of how much information the
fixed-length encoded feature vector can contain. To tackle this problem, in 2015,
Bahdanau et al. proposed the attention
mechanism
For more in details, see the whitepaper Effective Approaches to Attention-based Neural Machine Translation