Recurrent Neural Networks — Part 1

The neural network architectures you’ve seen so far (MLP-CNN)were trained using the current inputs only. We did not consider previous inputs when generating the current output. In other words, our systems did not have any memory elements. RNNs address this very basic and important issue by using memory (i.e. past inputs to the network) when producing the current output.

RNN Introduction

RNNs are artificial neural networks that can capture temporal dependencies which are dependencies over time.

if you look up the definition of the word recurrent you will find that it simply means occurring often or repeatedly. So, why are these networks called recurrent neural networks? It’s simply because with RNNs we perform the same task for each element in the input sequence.

RNN History

The first attempt to add memory to neural networks were the Time Delay Neural Networks, or TDNNs in short. in TDNNs, inputs from past time-steps were introduced to the network input, changing the actual external inputs. This had the advantage of clearly allowing the network to look beyond the current time-step, but also introduce to clear disadvantage, since the temporal dependencies were limited to the window of the time chosen.

Simple RNNs, also known as Elman networks and Jordan networks, were next to follow. We will talk about all those later. It was recognized in the early 90s that all of these networks suffer from what we call, the vanishing gradient problem, in which contributions of information decayed geometrically over time. So, capturing relationships that spanned more than eight or ten steps back was practically impossible.

Despite the elegance of these networks, they all had this key flaw. In the mid 90s, Long Short-Term Memory cells, or LSTMs in short, were invented to address this very problem. The key novelty in LSTMs was the idea that some signals, what we call state variables, can be kept fixed by using gates, and re-introduced or not at an appropriate time in the future. In this way, arbitrary time intervals can be represented, and temporal dependencies can be captured.

LSTM is one option to overcome the Vanishing Gradient problem in RNNs.

Please use these resources if you would like to read more about the Vanishing Gradient problem or understand further the concept of a Geometric Series and how its values may exponentially decrease.

If you are still curious, for more information on the important milestones mentioned here, please take a peek at the following links:

RNN Applications

The world’s leading tech companies are all using RNNs and LSTMs in their applications. Let’s take a look at some of those. Speech recognition, where a sequence of data samples extracted from an audio signal is continuously mapped to text. Good examples are Google Assistant, Apple’s Siri, Amazon’s Alexa, and Nuance’s Dragon solutions. All of these use RNNs as a part of their speech recognition software.

Time series predictions, where we predict traffic patterns. On specific roads to help drivers optimize their driving paths, like they do in Waze,or predicting what movie a consumer will want to watch next, like they do in Netflix. Predicting stock price movements based on historical patterns of stock movements and potentially other market conditions, that change over time. This is practiced by most quantitive hedge funds.

Natural Language Processing or NLP in short, such as machine translation used by Google or Salesforce for example. Question answering like Google Analytics, if you’ve got a question about your app, you’ll soon be able to ask Google Analytics directly. Many companies such as Google, Baidu, and, Slack are using RNNs to drive their Natural Language Processing engines for dialogue engine.

There are so many interesting applications, let’s look at a few more!

Feedforward Neural Network — A Reminder

The mathematical calculations needed for training RNN systems are fascinating. To deeply understand the process, we first need to feel confident with the vanilla FFNN system. We need to thoroughly understand the feedforward process, as well as the back-propagation process used in the training phases of such system. let’s remember the process we use in feedforward neural networks. We can have many hidden layers between the inputs and the outputs, but for simplicity, we will start with a single hidden layer.

We may be familiar with the concept of convolutional neural networks, or CNNs in short. When implementing your neural net, you will find that you can combine these techniques. For example, one can use CNNs in he first few layers for the purposes of feature extraction, and then use RNNs in the final layer where memory needs to be considered.

When working on a feedforward neural network, we actually simulate an artificial neural network by using a nonlinear function approximation. That function will act as a system that has n number of inputs, weights, and k number of outputs. We will use x as the input vector and y as the output vector. Inputs and outputs can also be many-to-many, many-to-one, and one-to-many.

There are two main types of applications. One is classification, where we identify which of a set of categories a new input belongs to. For example, an image classification where the neural network receives as an input an image, and can know if it’s a cat.

The other application is regression, where we approximate a function, so the network produces continuous values following a supervised training process. A simple example can be time series forecasting, where we predict the price of a stock tomorrow based on the price of the stock over the past five days. The input to the network would be five values representing the price of the stock for each of the past five days, and the output we want is tomorrow’s price.

Our task in neural networks is to find the best set of weights that yield a good output where x represents the inputs, and W represents the weights. We start with random weights. In feedforward neural networks, we have static mapping from the inputs to the outputs. We use the word static as we have no memory and the output depends only on the inputs and the weights. In other words, for the same input and the same weights, we always receive the same output.

Generally speaking, when working with neural networks, we have two primary phases: training and evaluation. In the training phase, we take the dataset called the training set which includes many pairs of inputs and their corresponding targets or outputs. And the goal is to find a set of weights that would best map the inputs to the desired outputs. In other words, the goal of the training phase is to yield a network that generalizes beyond the train set.

In the evaluation phase, we use the network that was created in the training phase, apply our new inputs, and expect to obtain the desired outputs.

Let’s look at a basic model of an artificial neural network, where we have only a single, hidden layer. The inputs are each connected to the neurons in the hidden layer and the neurons in the hidden layer are each connected to the neurons in the output layer where each neuron represents a single output. We can look at it as a collection of mathematical functions. Each input is connected mathematically to a hidden layer of neurons through a set of weights we need to modify, and each hidden layer neuron is connected to the output layer in a similar way. There is no limit to the number of inputs, number of hidden neurons in a layer, and number of outputs, so we can have n inputs, m hidden neurons, and k outputs. In a closer, even more simplistic look, we can see that each input is multiplied by its corresponding weight and added at the next layer’s neuron with a bias as well. The bias is an external parameter of the neuron and can be modeled by adding an external fixed value input. This entire summation will usually go through an activation function to the next layer or to the output. and that is the training phase.

In the backpropagation part, we will change the weights as we try to minimize the error, and start the feedforward process all over again.

The Feedforward Process

Let’s look at the feedforward part first. To make our computations easier, let’s decide to have n inputs, three neurons in a single hidden layer, and two outputs.

By the way, in practice, we can have thousands of neurons in a single hidden layer. We will use W_1 as the set of weights from x to h, and W_2 as the set of weights from h to y. Since we have only one hidden layer, we will have only two steps in each feedforward cycle. Step one, we’ll be finding h from a given input and a set of weights W_1. And step two, we’ll be finding the output y from the calculated h and the set of weights W_2. You will find that other than the use of non-linear activation functions, all of the calculations involve linear combinations of inputs and weights. Or in other words, we will use matrix multiplications. Let’s start with step number one, finding h. Notice that if we have more than one neuron in the hidden layer, which is usually the case, h is actually a vector. We will have our initial inputs x, x is also a vector, and we want to find the values of the hidden neurons, h. Each input is connected to each neuron in the hidden layer. For simplicity, we will use the following indices: W_11 connects x_1 to h_1, W_13 connects x_1 to h_3, W_21 connects x_2 to h_1, W_n3 connects x_n to h_3, and so on. The vector of the inputs x_1, x_2, all the way up to x_n, s multiplied by the weight matrix W_1 to give us the hidden neurons. So each vector, h, equals vector x multiplied by the weight matrix, W_1. In this case, we have a weight matrix with n rows, as we have n inputs, and three columns, as we have three neurons in the hidden layer. If you multiply the input vector by the weight matrix, you will have a simple linear combination for each neuron in the hidden layer giving us vector, h. So for example, h_1 will be x_1 times W_11, plus x_2 times W_21, and so on. But we are not done with calculating the hidden layer yet. Notice the prime symbol I’ve been using? I used it to remind us that we are not done with finding h yet.

To make sure that the values of h do not explode or increase too much in size, we need to use an activation function usually denoted by the Greek letter, phi. We can use a hyperbolic tangent. Using this function will ensure that our outputs are between one and negative one. We can also use a sigmoid. Using this function will ensure that our outputs are between one and zero.

We can also use a rectified linear unit or in short, a ReLu function, where the negative values are nulled and the positive values remain as they are. Each activation function has its advantages and disadvantages. What they all share is that they allow the network to represent nonlinear relationships between its inputs and its outputs. And this is very important since most real world data is nonlinear. Mathematically, the linear combination and activation function can simply be written as h quals to the output of an activation function of the input vector multiplied by the corresponding weight matrix. Using these functions can be a bit tricky as they contribute to the vanishing gradient problem that we mentioned before.

We finished step one, and will now start with step number two which is finding the output y, by using the values of h, we just calculated. Since we have more than one output, y will be a vector as well. We have our initial inputs h, and want to find the values of the output y. Mathematically, the idea is identical to what we just saw in step number one. We now have different inputs. We call them h, and a different weight matrix, we call it W2. The output will be vector y. Notice that the weight matrix has three rows, as we have three neurons in the hidden layer, and two columns, as we have only two outputs. And again, we have a vector by matrix multiplication.

Vector h, multiplied by the weight matrix W2, gives us the output vector y. We can put it in a simple equation, where y equals h times W. Once we have the outputs, we don’t necessarily need an activation function. In some applications, it can be beneficial to use for example, a softmax function, what we call sigma x, if we want the output values to be between zero and one.

You can find more information on this topic in the text after this video. To have a good approximation of the output y, we need more than one level of hidden layers. Maybe even 10 or more. In this picture, I use the general number P. The number of neurons in each layer can change from one layer to the next, and as I mentioned before, can be even thousands.

So to make the notation simple, let’s just stay with x. h1 is the output of an activation function of a sum, where the sum is a multiplication of each input xi, by its corresponding weight component Wi1. The same way hm is the output of an activation function of a sum, and the sum is the multiplication of each input xi, by its corresponding weight component Wim. For example, if we have three inputs, and we want to calculate h1, it will be the output of an activation function of the following linear combination. These single element calculations will be helpful in understanding back propagation, which is why we want to understand them as well. But as before, we can also look at these calculations as a vector by matrix multiplication.

Let’s focus on an intuitive error calculation, which is simply finding the difference between the calculated, and the desired output. This is our basic error. For our back propagation calculations, we will use the square error, which is also called the loss function.


We will now continue with an example focusing on the back-propagation process, and consider a network having two inputs [x_1, x_2][x1​,x2​], three neurons in a single hidden layer [h_1, h_2, h_3][h1​,h2​,h3​] and a single output y.

The weight matrices to update are W¹W1 from the input to the hidden layer, and W²W2 from the hidden layer to the output. Notice that in our case W²W2 is a vector, not a matrix, as we only have one output.

The chain of thought in the weight updating process is as follows:

To update the weights, we need the network error. To find the network error, we need the network output, and to find the network output we need the value of the hidden layer, vector h_bar

Each element of vector h_bar is calculated by a simple linear combination of the input vector with its corresponding weight matrix W¹W1, followed by an activation function.

We now need to find the network’s output, y:

After computing the output, we can finally find the network error.

As a reminder, the two Error functions most commonly used are the Mean Squared Error (MSE)(usually used in regression problems) and the cross entropy (often used in classification problems).

In this example, we use a variation of the MSE:

where dd is the desired output and y is the calculated one. Notice that y and d are not vectors in this case, as we have a single output.

The aim of the back-propagation process is to minimize the error, which in our case is the Loss Function. To do that we need to calculate its partial derivative with respect to all of the weights.

We will find all the elements of the gradient using the chain rule.

back-propagation process will consist of two steps:

Step 1: Calculating the gradient with respect to the weight vector W²W2 (from the output to the hidden layer).
Step 2: Calculating the gradient with respect to the weight matrix W¹W1 (from the hidden layer to the input).

Step 1 (Note that the weight vector referenced here will be W². All indices referring to W² have been omitted from the calculations to keep the notation simple).

As you may recall:

In this specific step, since the output is of only a single value, we can rewrite the equation the following way (in which we have a weights vector):

Since we already calculated the gradient, we now know that the incremental value we need for step one is:

Having calculated the incremental value, we can update vector W² the following way:

Step 2 (In this step, we will need to use both weight matrices. Therefore we will not be omitting the weight indices.)

In our second step we will update the weights of matrix W¹ by calculating the partial derivative of y with respect to the weight matrix W¹

The chain rule will be used the following way:

In this example we have only 3 neurons the the single hidden layer, therefore this will be a linear combination of three elements:


After updating the weight matrices we begin once again with the Feedforward pass, starting the process of updating the weights all over again.

AI Researcher - NLP Practitioner