building your first deep neural network -modeling probabilities and nonlinearities (part 4)

What is an activation function?

An activation function is a function applied to the neurons in a layer during prediction. This should seem very familiar, because you’ve been using an activation function called relu (shown here in the three-layer neural network). The relu function had the effect of turning all negative numbers to 0. Oversimplified, an activation function is any function that can take one number and return another number. But there are an infinite number of functions in the universe, and not all them are useful as activation functions.

Standard hidden-layer activation functions

sigmoid is the bread-and-butter activation

sigmoid is great because it smoothly squishes the infinite amount of input to an output between 0 and 1. In many circumstances, this lets you interpret the output of any individual neuron as a probability. Thus, people use this nonlinearity both in hidden layers and output layers.

Tanh is better than sigmoid for hidden layers

Here’s the cool thing about tanh. Remember modeling selective correlation? Well, sigmoid gives varying degrees of positive correlation. That’s nice. tanh is the same as sigmoid except it’s between –1 and 1! This means it can also throw in some negative correlation. Although it isn’t that useful for output layers (unless the data you’re predicting goes between –1 and 1), this aspect of negative correlation is powerful for hidden layers; on many problems, tanh will outperform sigmoid in hidden layers.

Standard output layer activation functions

Choosing the best one depends on what you’re trying to predict

It turns out that what’s best for hidden-layer activation functions can be quite different from what’s best for output-layer activation functions, especially when it comes to classification. Broadly speaking, there are three major types of output layer.

Predicting raw data values

This is perhaps the most straightforward but least common type of output layer. In some cases, people want to train a neural network to transform one matrix of numbers into another matrix of numbers, where the range of the output (difference between lowest and highest values) is something other than a probability. One example might be predicting the average temperature in Ha Noi given the temperature in the surrounding states. The main thing to focus on here is ensuring that the output nonlinearity can predict the right answers. In this case, a sigmoid or tanh would be inappropriate because it forces every prediction to be between 0 and 1 (you want to predict any temperature, not just between 0 and 1). If I were training a network to do this prediction, I’d very likely train the network without an activation function on the output.

Predicting unrelated yes/no probabilities (sigmoid)

You’ll often want to make multiple binary probabilities in one neural network., predicting whether the team would win, whether there would be injuries, and the morale of the team (happy or sad) based on the input data. As an aside, when a neural network has hidden layers, predicting multiple things at once can be beneficial. Often the network will learn something when predicting one label that will be useful to one of the other labels. For example, if the network got really good at predicting whether the team would win ballgames, the same hidden layer would likely be very useful for predicting whether the team would be happy or sad. But the network might have a harder time predicting happiness or sadness without this extra signal. This tends to vary greatly from problem to problem, but it’s good to keep in mind. In these instances, it’s best to use the sigmoid activation function, because it models individual probabilities separately for each output node.

Predicting which-one probabilities (softmax)

By far the most common use case in neural networks is predicting a single label out of many. For example, in the MNIST digit classifier, you want to predict which number is in the image. You know ahead of time that the image can’t be more than one number. You can train this network with a sigmoid activation function and declare that the highest output probability is the most likely. This will work reasonably well. But it’s far better to have an activation function that models the idea that “The more likely it’s one label, the less likely it’s any of the other labels.” Why do we like this phenomenon? Consider how weight updates are performed. Let’s say the MNIST digit classifier should predict that the image is a 9. Also say that the raw weighted sums going into the final layer (before applying an activation function) are the following values:

AI Researcher - NLP Practitioner