building your first deep neural network — Practical Tricks(Part 2)

Full, batch, and stochastic gradient descent

As it turns out, this idea of learning one example at a time is a variant on gradient descent called stochastic gradient descent, and it’s one of the handful of methods that can be used to learn an entire dataset. How does stochastic gradient descent work? As you saw in the previous example, it performs a prediction and weight update for each training example separately. In other words, it takes the first streetlight, tries to predict it, calculates the weight_delta, and updates the weights. Then it moves on to the second streetlight, and so on. It iterates through the entire dataset many times until it can find a weight configuration that works well for all the training examples.

(Full) gradient descent updates weights one dataset at a time

Another method for learning an entire dataset is gradient descent (or average/full gradient descent). Instead of updating the weights once for each training example, the network calculates the average weight_delta over the entire dataset, changing the weights only each time it computes a full average

Batch gradient descent updates weights after n examples

This will be covered in more detail later, but there’s also a third configuration that sort of splits the difference between stochastic gradient descent and full gradient descent. Instead of updating the weights after just one example or after the entire dataset of examples, you choose a batch size (typically between 8 and 256) of examples, after which the weights are updated.

Neural networks learn correlation

You just got done training a single-layer neural network to take a streetlight pattern and identify whether it was safe to cross the street. Let’s take on the neural network’s perspective for a moment. The neural network doesn’t know that it was processing streetlight data. All it was trying to do was identify which input (of the three possible) correlated with the output. It correctly identified the middle light by analyzing the final weight positions of the network. Notice that the middle weight is very near 1, whereas the far-left and far-right weights are very near 0. At a high level, all the iterative, complex processes for learning accomplished something rather simple: the network identified correlation between the middle input and output. The correlation is located wherever the weights were set to high numbers. Inversely, randomness with respect to the output was found at the far-left and far-right weights (where the weight values are very near 0). How did the network identify correlation? Well, in the process of gradient descent, each training example asserts either up pressure or down pressure on the weights. On average, there was more up pressure for the middle weight and more down pressure for the other weights. Where does the pressure come from? Why is it different for different weights?

Up and down pressure

It comes from the data. Each node is individually trying to correctly predict the output given the input. For the most part, each node ignores all the other nodes when attempting to do so. The only cross communication occurs in that all three weights must share the same error measure. The weight update is nothing more than taking this shared error measure and multiplying it by each respective input. Why do you do this? A key part of why neural networks learn is error attribution, which means given a shared error, the network needs to figure out which weights contributed (so they can be adjusted) and which weights did not contribute (so they can be left alone).

Edge case: Overfitting

Sometimes correlation happens accidentally. Consider again the first example in the training data. What if the far-left weight was 0.5 and the far-right weight was –0.5? Their prediction would equal 0. The network would predict perfectly. But it hasn’t remotely learned how to safely predict streetlights (those weights would fail in the real world). This phenomenon is known as overfitting.

Stacking neural networks

When you look at the following architecture, the prediction occurs exactly as you might expect when I say, “Stack neural networks.” The output of the first lower network (layer_0 to layer_1) is the input to the second upper neural network (layer_1 to layer_2). The prediction for each of these networks is identical to what you saw before.

Backpropagation: Long-distance error attribution

The weighted average error. What’s the prediction from layer_1 to layer_2? It’s a weighted average of the values at layer_1. If layer_2 is too high by x amount, how do you know which values at layer_1 contributed to the error? The ones with higher weights (weights_1_2) contributed more. The ones with lower weights from layer_1 to layer_2 contributed less. Consider the extreme. Let’s say the far-left weight from layer_1 to layer_2 was zero. How much did that node at layer_1 cause the network’s error? Zero. It’s so simple it’s almost hilarious. The weights from layer_1 to layer_2 exactly describe how much each layer_1 node contributes to the layer_2 prediction. This means those weights also exactly describe how much each layer_1 node contributes to the layer_2 error. How do you use the delta at layer_2 to figure out the delta at layer_1? You multiply it by each of the respective weights for layer_1. It’s like the prediction logic in reverse. This process of moving delta signal around is called back-propagation.

Backpropagation in code

AI Researcher - NLP Practitioner

AI Researcher - NLP Practitioner