Sequence-to-sequence (S2S) models are a special case of a general family of models called encoder–decoder models. An encoder–decoder model is a composition of two models, an “encoder” and a “decoder,” that are typically jointly trained. The encoder model takes an input and produces an encoding or a representation (ϕ) of the input, which is usually a vector.1 The goal of the encoder is to capture important properties of the input with respect to the task at hand. The goal of the decoder is to take the encoded input and produce a desired output. From this understanding of encoders and decoders, we define S2S models as encoder–decoder models in which the encoder and decoder are sequence models and the inputs and outputs are both sequences, possibly of different lengths.
One way to view encoder–decoder models is as a special case of models called conditioned generation models. In conditioned generation, instead of the input representation ϕ, a general conditioning context c influences a decoder to produce an output. When the conditioning context c comes from an encoder model, conditioned generation is same as an encoder–decoder model. Not all conditioned generation models are encoder–decoder models, because it is possible for the conditioning context to be derived from a structured source. Consider the example of a weather report generator. The values of the temperature, humidity, and wind speed and direction could “condition” a decoder to generate the textual weather report
This Figure shows the encoder “encoding” the entire input into a representation, ϕ, that conditions the decoder to generate the right output. You can use any RNN as an encoder, be it an Elman RNN, LSTM, or GRU. In the next two sections, we introduce two vital components of modern-day S2S models. First, we look at the bidirectional recurrent model that combines forward and backward passes over a sequence to create richer representations. Then, in “Capturing More from a Sequence: Attention”, we introduce and survey the attention mechanism, which is useful in focusing on different parts of the input that are relevant to the task. Both sections are vital for building nontrivial S2S model–based solutions.
Capturing More from a Sequence: Bidirectional Recurrent Models
One way to understand a recurrent model is to look at it as a black box that encodes a sequence into a vector. When modeling a sequence, it is useful to observe not just the words in the past but also the words that appear in the future. Consider the following sentence:
The man who hunts ducks out on the weekends.
If the model were to observe only from left to right, its representation for “ducks” would be different from that of a model that had also observed the words from right to left. Humans do this sort of retroactive meaning updating all the time. Taken together, information from the past and the future will be able to robustly represent the meaning of a word in a sequence. This is the goal of bidirectional recurrent models. Any of the models in the recurrent family, such as Elmann RNNs, LSTMs, or GRUs, could be used in such a bidirectional formulation. Notice how there is a “forward” representation and a “backward” representation for each word in the input, which are concatenated to produce the final representation for the word in question. What’s not shown here is the final classification layer, consisting of a Linear layer and a softmax, at each time step.
Capturing More from a Sequence: Attention
This is a limitation of using just the final hidden state as the encoding. Another problem with long inputs is that the gradients vanish when back-propagating through time, making the training difficult.
This process of encode-first-then-decode might appear a little strange to bilingual/multilingual readers who have ever attempted to translate. As humans, we don’t usually distill the meaning of a sentence and generate the translation from the meaning. For the example , when we see the French word pour we know there will be a for; similarly, breakfast is on our mind when we see petit-déjeuner, and so on. In other words, our minds focus on the relevant parts of the input while producing output. This phenomenon is called attention. Attention has been widely studied in neuroscience and other allied fields, and it is what makes us quite successful despite having limited memories.
In an analogous fashion, we would like our sequence generation models to incorporate attention to different parts of the input and not just the final summary of the entire input. This is called the attention mechanism. The first models to incorporate a notion of attention for natural language processing were, incidentally, machine translation models by Bahdanau et al. (2015). Since then, several kinds of attention mechanisms and several approaches to improving attention have been proposed. In this section, we review some of the basic attention mechanisms and introduce some terminology related to attention. Attention has proven extremely useful in improving the performance of deep learning systems with complex inputs and complex outputs. In fact, Bahdanau et al. show that the performance of a machine translation system, as measured by “BLEU score”
Attention in Deep Neural Networks
There are several ways to implement attention. The simplest and the most commonly used is the content-aware mechanism. You can see content-aware attention in action in “Example: Neural Machine Translation”. Another popular attention mechanism is location-aware attention, which depends only on the query vector and the key. The attention weights are typically floating-point values between
1. This is called soft attention. In contrast, it is possible to learn a binary
0/1 vector for attention. This is called hard attention.
The attention mechanism illustrated in the following figure depends on the encoder states for all the time steps in the input. This is also known as global attention. In contrast, for local attention, you could devise an attention mechanism that depended only on a window of the input around the current time step.
Sometimes, especially in machine translation, the alignment information could be explicitly provided as a part of the training data. In such situations, a supervised attention mechanism could be devised to learn the attention function using a separate neural network that’s jointly trained. For large inputs such as documents, it is possible to design a coarse- to fine-grained (or hierarchical) attention mechanism, that not only focuses on the immediate input but also takes into account the structure of the document — paragraph, section, chapter, and so on.
The work on transformer networks by Vaswani et al. (2017) introduces multiheaded attention, in which multiple attention vectors are used to track different regions of the input. They also popularized the concept of self-attention, a mechanism whereby the model learns which regions of the input influence one another.
When the input is multimodal — for example, both image and speech — it is possible to design a multimodal attention mechanism. The literature on attention, although new, is already vast, indicating the importance of this topic. Covering each of the approaches in detail is beyond the scope of this book, and we direct you to Luong, Pham, and Manning (2015) and Vaswani et al. (2017) as a starting point.
Sequence to Sequence without attention:
Sequence to Sequence with attention:
Great blog about transformer:
The Illustrated Transformer
Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations…