NLP Fundamentals — Sequence Modeling (P5)

3 min readMar 5, 2020

A sequence is an ordered collection of items. Traditional machine learning assumes data points to be independently and identically distributed (IID), but in many situations, like with language, speech, and time-series data, one data item depends on the items that precede or follow it. Such data is also called sequence data. Sequential information is everywhere in human language. For example, speech can be considered a sequence of basic units called phonemes. In a language like English, words in a sentence are not haphazard. They might be constrained by the words that come before or after them.

In deep learning, modeling sequences involves maintaining hidden “state information,” or a hidden state. As each item in the sequence is encountered — for example, as each word in a sentence is seen by the model — the hidden state is updated. Thus, the hidden state (usually a vector) encapsulates everything seen by the sequence so far. This hidden state vector, also called a sequence representation, can then be used in many sequence modeling tasks in myriad ways depending on the task we are solving, ranging from classifying sequences to predicting sequences.

We begin by introducing the most basic neural network sequence model: the recurrent neural network (RNN). After this, we present an end-to-end example of the RNN in a classification setting.

Introduction to Recurrent Neural Networks

The goal of recurrent networks — both the basic Elman form and the more complicated form is to learn a representation of a sequence. This is done by maintaining a hidden state vector that captures the current state of the sequence. The hidden state vector is computed from both a current input vector and the previous hidden state vector. These relationships are shown in the following figure which shows both the functional (left) and the “unrolled” (right) view of the computational dependencies. In both illustrations, the output is same as the hidden vector. This is not always the case, but in the case of an Elman RNN, the hidden vector is what’s predicted.

Let’s look at a slightly more specific description to understand what is happening in the Elman RNN. As shown in the unrolled view in Figure 6–1, also known as backpropagation through time (BPTT), the input vector from the current time step and the hidden state vector from the previous time step are mapped to the hidden state vector of the current time step.

Crucially, the hidden-to-hidden and input-to-hidden weights are shared across the different time steps. The intuition you should take away from this fact is that, during training, these weights will be adjusted so that the RNN is learning how to incorporate incoming information and maintain a state representation summarizing the input seen so far. The RNN does not have any way of knowing which time step it is on. Instead, it is simply learning how to transition from one time step to another and maintain a state representation that will minimize its loss function.

Because words and sentences can be of different lengths, the RNN or any sequence model should be equipped to handle variable-length sequences. One possible technique is to restrict sequences to a fixed length artificially.

Implementing an Elman RNN

To explore the details of RNNs, let us step through a simple implementation of the Elman RNN. PyTorch offers many useful classes and helper functions to build RNNs. The PyTorch RNN class implements the Elman RNN. Instead of using this class directly, we use RNNCell, an abstraction for a single time step of the RNN, and construct an RNN from that.