Applications of CNNs
CNNs achieve state of the art results in a variety of problem areas including Voice User Interfaces, Natural Language Processing, and computer vision. In the field of Voice User Interfaces, Google made use of CNNs in its recently released WaveNet. WaveNet takes any piece of text as input and does an excellent job of returning computer-generated audio of a human reading the text.
As for the field of Natural Language Processing CNNs, however, can be used in this area too, CNNs are used to extract information from sentences. This information can be used to classify sentiment. For example, is the writer happy or sad?
In this section, we’ll focus on applications and computer vision and specifically work towards applying CNNs to image classification tasks. Given an image, your CNN will assign a corresponding label which you believe summarizes the content of the image. This is a core problem in computer vision and has applications in a wide range of problem areas. For instance, CNNs are used to teach artificially intelligent agents to play video games such as Atari Breakout.
Go, for example, is an ancient Chinese board game considered one of the most complex games in existence. It said that there are more configurations in the game than there are atoms in the universe. Recently, researchers from Google’s DeepMind use CNNs to train an artificially intelligent agent to beat human professional Go players.
CNNs also allowed drones to navigate unfamiliar territory.
How Computers Interpret Images
Any gray scale image is interpreted by a computer as an array. A grid of values for each grid cell is called a pixel, and each pixel has a numerical value. Each image in the MNIST database is 28 pixels high and wide. And so, it’s understood by a computer as a 28 by 28 array. In a typical gray scale image, white pixels are encoded as the value 255, and black pixels are encoded as zero. Gray pixels fall somewhere in between, with light-gray being closer to 255.
how might we approach the task of classifying these images? Well, you already learned one method for classification, using a multi-layer perceptron. How might we input this image data into an MLP? Recall that MLPs only take vectors as input. So, in order to use an MLP with images, we have to first convert any image array into a vector. This process is so common that it has a name, flattening.
consider breaking the image into four regions. Here, color coded as red, green, yellow, and blue. Then, each hidden node could be connected to only the pixels in one of these four regions. Here, each headed node sees only a quarter of the original image. With this new regional breakdown and the assignment of small local groups of pixels to different hidden nodes, every hidden node finds patterns in only one of the four regions in the image. Then, each hidden node still reports to the output layer where the output layer combines the findings for the discovered patterns learned separately in each region. This so called locally connected layer
Filters and the Convolutional Layer
A convolutional neural network is a special kind of neural network in that it can remember spatial information. The neural networks that you’ve seen so far only look at individual inputs. But convolutional neural networks, can look at an image as a whole, or in patches and analyze groups of pixels at a time. The key to preserving the spatial information is something called the convolutional layer. A convolutional layer applies a series of different image filters also known as convolutional kernels to an input image. The resulting filtered images have different appearances.
Frequency in images
We have an intuition of what frequency means when it comes to sound. High-frequency is a high pitched noise, like a bird chirp or violin. And low frequency sounds are low pitch, like a deep voice or a bass drum. For sound, frequency actually refers to how fast a sound wave is oscillating; oscillations are usually measured in cycles/s (Hz), and high pitches and made by high-frequency waves. Examples of low and high-frequency sound waves are pictured below. On the y-axis is amplitude, which is a measure of sound pressure that corresponds to the perceived loudness of a sound, and on the x-axis is time.
High and low frequency
Similarly, frequency in images is a rate of change. But, what does it means for an image to change? Well, images change in space, and a high frequency image is one where the intensity changes a lot. And the level of brightness changes quickly from one pixel to the next. A low frequency image may be one that is relatively uniform in brightness or changes very slowly. This is easiest to see in an example.
Most images have both high-frequency and low-frequency components. In the image above, on the scarf and striped shirt, we have a high-frequency image pattern; this part changes very rapidly from one brightness to another. Higher up in this same image, we see parts of the sky and background that change very gradually, which is considered a smooth, low-frequency pattern.
CNNs are a kind of deep learning model that can learn to do things like image classification and object recognition. They keep track of spatial information and learn to extract features like the edges of objects in something called a convolutional layer. Below you’ll see an simple CNN structure, made of multiple layers, below, including this “convolutional layer”.
the convolutional layer is produced by applying a series of many different image filters, also known as convolutional kernels, to an input image.
In the example shown, 4 different filters produce 4 differently filtered output images. When we stack these images, we form a complete convolutional layer with a depth of 4!
Consider this image of a dog. A single region in this image, may have many different patterns that we want to detect.
Let’s use four filters, each four pixels high and four pixels wide. Recall each filter will be convolved across the height and width the image to produce an entire collection of nodes in the convolutional layer. In this case, since we have four filters, we’ll have four collections of nodes In practice of for to each of these four collections is either feature maps or as activation maps. When we visualize these feature maps, we see that they look like filtered images.
But what about color images? Well, we’ve seen that grayscale images are interpreted by the computer as a 2D array with height and width Color images are interpreted by the computer as a 3D array with height, width and depth. In the case of RGB images, the depth is three.
we can also stack multi layers of filter to create multi-layer CNN
Stride and Padding
But there are even more hyperparameters that you can do. One of these hyperparameters is referred to as the stride of the convolution. The stride is just the amount by which the filter slides over the image.
when we move the filter two more units to the right, he filter extends outside the image. What do we do now? Do we still want to keep the corresponding convolutional node? For now, let’s just populate the places where the filter extends outside with a question mark and proceed as planned.
we could plan ahead for this case by padding the image with zeros to give the filter more space to move. Now, when we populate the convolutional layer, we get contributions from every region in the image.
A complicated dataset with many different object categories will require a large number of filters, each responsible for finding a pattern in the image. More filters means a bigger stack, which means that the dimensionality of our convolutional layers can get quite large. Higher dimensionality means, we’ll need to use more parameters, which can lead to over-fitting. Thus, we need a method for reducing this dimensionality. This is the role of pooling layers within a convolutional neural network. We’ll focus on two different types of pooling layers. The first type is a max pooling layer. max pooling layers will take a stack of feature maps as input. In this case, we’ll use a window size of two and a stride of two. To construct the max pooling layer, we’ll work with each feature map separately. Let’s begin with the first feature map, we start with our window in the top left corner of the image. The value of the corresponding node in the max pooling layer is calculated by just taking the maximum of the pixels contained in the window.
Groundbreaking CNN Architectures
- Check out the AlexNet paper!
- Read more about VGGNet here.
- The ResNet paper can be found here.
- Here’s the Keras documentation for accessing some famous CNN architectures.
- Read this detailed treatment of the vanishing gradients problem.
- Here’s a GitHub repository containing benchmarks for different CNN architectures.
- Visit the ImageNet Large Scale Visual Recognition Competition (ILSVRC) website.