building your first deep neural network -introduction to back-propagation (Part 1)

The streetlight problem

This toy problem considers how a network learns entire datasets

Consider yourself approaching a street corner in a foreign country. As you approach, you look up and realize that the street light is unfamiliar. How can you know when it’s safe to cross the street?

street light in a strange country

You can know when it’s safe to cross the street by interpreting the streetlight. But in this case, you don’t know how to interpret it. Which light combinations indicate when it’s time to walk? Which indicate when it’s time to stop? To solve this problem, you might sit at the street corner for a few minutes observing the correlation between each light combination and whether people around you choose to walk or stop. You take a seat and record the following pattern:

OK, nobody walked at the first light. At this point you’re thinking, “Wow, this pattern could be anything. The left light or the right light could be correlated with stopping, or the central light could be correlated with walking.” There’s no way to know. Let’s take another datapoint:

Now you’re getting somewhere. Only the middle light changed this time, and you got the opposite pattern. The working hypothesis is that the middle light indicates when people feel safe to walk. Over the next few minutes, you record the following six light patterns, noting when people walk or stop. Do you notice a pattern overall?

As hypothesized, there is a perfect correlation between the middle (crisscross) light and whether it’s safe to walk. You learned this pattern by observing all the individual data points and searching for correlation. This is what you’re going to train a neural network to do.

Preparing the data because Neural networks don’t read streetlights.

You do indeed have two datasets. On the one hand, you have six streetlight states. On the other hand, you have six observations of whether people walked. These are the two datasets. You can train the neural network to convert from the dataset you know to the dataset that you want to know. In this particular real-world example, you know the state of the streetlight at any given time, and you want to know whether it’s safe to cross the street.

To prepare this data for the neural network, you need to first split it into these two groups (what you know and what you want to know)

Matrices and the matrix relationship Translate the streetlight into math

Math doesn’t understand streetlights.You want to teach a neural network to translate a streetlight pattern into the correct stop/walk pattern. What you really want to do is mimic the pattern of the streetlight in the form of numbers. Let me show you what I mean.

Notice that the pattern of numbers shown here mimics the pattern from the streetlights in the form of 1s and 0s. Each light gets a column (three columns total, because there are three lights). Notice also that there are six rows representing the six different observed streetlights. This structure of 1s and 0s is called a matrix. This relationship between the rows and columns is common in matrices, especially matrices of data (like the streetlights). In data matrices, it’s convention to give each recorded example a single row. It’s also convention to give each thing being recorded a single column. This makes the matrix easy to read. So, a column contains every state in which a thing was recorded. In this case, a column contains every on/off state recorded for a particular light. Each row contains the simultaneous state of every light at a particular moment in time. Again, this is common

Good data matrices perfectly mimic the outside world

The data matrix doesn’t have to be all 1s and 0s. What if the streetlights were on dimmers and turned on and off at varying degrees of intensity? Perhaps the streetlight matrix would look more like this:

Matrix A is perfectly valid. It’s mimicking the patterns that exist in the real world (streetlight), so you can ask the computer to interpret them. Would the following matrix still be valid?

Matrix (B) is valid. It adequately captures the relationships between various training examples (rows) and lights (columns). Note that Matrix A * 10 == Matrix B (A * 10 == B). This means these matrices are scalar multiples of each other.

Creating a matrix or two in Python

You’ve converted the streetlight pattern into a matrix (one with just 1s and 0s). Now let’s create that matrix (and, more important, its underlying pattern) in Python so the neural network can read it. Python’s NumPy library was built just for handling matrices. Let’s see it in action:

If you’re a regular Python user, something should be striking in this code. A matrix is just a list of lists. It’s an array of arrays. What is NumPy? NumPy is really just a fancy wrapper for an array of arrays that provides special, matrix-oriented functions. Let’s create a NumPy matrix for the output data, too:

What do you want the neural network to do? Take the streetlights matrix and learn to transform it into the walk_vs_stop matrix. More important, you want the neural network to take any matrix containing the same underlying pattern as streetlights and transform it into a matrix that contains the underlying pattern of walk_vs_stop. More on that later. Let’s start by trying to transform streetlights into walk_vs_stop using a neural network.

Building a neural network

The following code explain the basic neural network with the random weight initialization, and using mean square error as a loss function

Đoạn code sau đây thể hiện mạng neuron cơ bản với trọng số ngẫu nhiên và sử dụng mean squre error như là hàm loss

In part 2 we will discuss about full, batch and stochastic gradient descent also overfiting etc ….

The notebook for this section can be found at:

AI Researcher - NLP Practitioner

AI Researcher - NLP Practitioner