NLP Fundamentals — Sequence Modeling (P6)

Duy Anh Nguyen
4 min readMar 5, 2020


Sequence prediction tasks require us to label each item of a sequence. Such tasks are common in natural language processing. Some examples include language modeling. in which we predict the next word given a sequence of words at each step; part-of-speech tagging, in which we predict the grammatical part of speech for each word; named entity recognition, in which we predict whether each word is part of a named entity, such as Person, Location, Product, or Organization; and so on. Sometimes, in NLP literature, the sequence prediction tasks are also referred to as sequence labeling.

Although in theory we can use the Elman recurrent neural networks for sequence prediction tasks, they fail to capture long-range dependencies well and perform poorly in practice. In this section, we spend some time understanding why that is the case and learn about a new type of RNN architecture called the gated network. We also introduce the task of natural language generation as an application of sequence prediction and explore conditioned generation in which the output sequence is constrained in some manner.

The Problem with Vanilla RNNs (or Elman RNNs)

Even though the vanilla/Elman RNN is well suited for modeling sequences, it has two issues that make it unsuitable for many tasks: the inability to retain information for long-range predictions, and gradient stability. To understand these two issues, recall that at their core, RNNs are computing a hidden state vector at each time step using the hidden state vector of the previous time step and an input vector at the current time step. It is this core computation that makes the RNN so powerful, but it also creates drastic numerical issues.

The first issue with Elman RNNs is the difficulty in retaining long-range information. With the RNN , for example, at each time step we simply updated the hidden state vector regardless of whether it made sense. As a consequence, the RNN has no control over which values are retained and which are discarded in the hidden state — that is entirely determined by the input. Intuitively, that doesn’t make sense. What is desired is some way for the RNN to decide if the update is optional, or if the update happens, by how much and what parts of the state vector, and so on.

The second issue with Elman RNNs is their tendency to cause gradients to spiral out of control to zero or to infinity. Unstable gradients that can spiral out of control are called either vanishing gradients or exploding gradients depending on the direction in which the absolute values of the gradients are shrinking/growing. A really large absolute value of the gradient or a really small (less than 1) value can make the optimization procedure unstable (Hochreiter et al., 2001; Pascanu et al., 2013).

There are solutions to deal with these gradient problems in vanilla RNNs, such as the use of rectified linear units (ReLUs), gradient clipping, and careful initialization. But none of the proposed solutions work as reliably as the technique called gating.

Gating as a Solution to a Vanilla RNN’s Challenges

To intuitively understand gating, suppose that you were adding two quantities, a and b, but you wanted to control how much of b gets into the sum. Mathematically, you can rewrite the sum a + b as:


where λ is a value between 0 and 1. If λ = 0, there is no contribution from b and if λ = 1, b contributes fully. Looking at it this way, you can interpret that λ acts as a “switch” or a “gate” in controlling the amount of b that gets into the sum. This is the intuition behind the gating mechanism. Now let’s revisit the Elman RNN and see how gating can be incorporated into vanilla RNNs to make conditional updates. If the previous hidden state was ht−1 and the current input is xt, the recurrent update in the Elman RNN would look something like:


Now it becomes clear that the function λ controls how much of the current input gets to update the state ht−1. Further, the function λ is context-dependent. This is the basic intuition behind all gated networks. The function λ is usually a sigmoid function. In the case of the long short-term memory network (LSTM; Hochreiter and Schmidhuber, 1997), this basic intuition is extended carefully to incorporate not only conditional updates, but also intentional forgetting of the values in the previous hidden state ht−1. This “forgetting” happens by multiplying the previous hidden state value ht−1 with another function, μ, that also produces values between 0 and 1 and depends on the current input:


The LSTM is only one of the many gated variants of the RNN. Another variant that’s becoming increasingly popular is the gated recurrent unit (GRU; Chung et al., 2015)

Example: A Character RNN for Generating Surnames

In this example, we practice through a simple sequence prediction task: generating surnames using an RNN. In practice, this means that for each time step, the RNN is computing a probability distribution over the set of possible characters in the surname. We can use these probability distributions, either to optimize the network to improve its predictions (given that we know what characters should have been predicted), or to generate brand-new surnames.

Data could be found at: