Convolutional Neural Network
参考http://cs231n.github.io/convolutional-networks/
history
Convolutional Neural Networks (CNNs / ConvNets)
Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.
(卷积神经网络其实是一种比较简单的前馈神经网络,有些时候也可以使用backpropagation算法,减小误差。)
So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.
(我们假设CNN的输入值图片,并且能够高效的减少参数,训练网络模型。)
一,Architecture Overview(结构概述)
Recall: Regular Neural Nets. As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.
回想一下,前馈神经网络的内容,层与层之间的神经元是全部相互连接,最后输出有全连接层,并且得到属于每一类的得分。
Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
普通的神经网络并不能很好的适应大规模图片。在CIFAR-10中,图片是32x32x3,所以每一个隐藏层处理32*32*3 = 3072个权重。但是实际中我们会用到200x200x3的图片,200*200*3 = 120,000个权重。显然这样的计算量是增长是数量级的。我们还要保证快速计算。普通网络全连接显然不合适。
3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:
1069/5000
3D的神经元。卷积神经网络充分利用了输入由图像组成的事实,并以更合理的方式约束了体系结构。特别是,与常规的神经网络不同的是,ConvNet的层具有三维排列的神经元:宽度,高度,深度。 (请注意,这里的词深度是指激活体积的第三维,而不是完整的神经网络的深度,可以指网络中的总层数。)例如,CIFAR- 10是激活的输入量,并且体积具有32×32×3(分别为宽度,高度,深度)的尺寸。正如我们将要看到的那样,一层中的神经元只能连接到它之前层的一个小区域,而不是以完全连接的方式连接到所有的神经元。此外,CIFAR-10的最终输出层的尺寸为1x1x10,因为在ConvNet架构的最后,我们将把整个图像缩减为沿深度维度排列的单个分数向量。这是一个可视化:
上:常规的三层神经网络。 下:一个ConvNet在三个维度(宽度,高度,深度)上安排它的神经元,如在一个图层中可视化的那样。 ConvNet的每一层都将3D输入volume转换为神经元激活的3D输出volume。 在这个例子中,红色输入层保存图像,所以它的宽度和高度将是图像的尺寸,深度将是3(红色,绿色,蓝色通道)。
二,Layers used to build ConvNets
As we described above, a simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.
正如我们上面所描述的,一个简单的ConvNet是一系列的层次,ConvNet的每一层通过一个可微函数将一个激活体积转换成另一个激活体积。 我们使用三种主要类型的层来构建ConvNet体系结构:卷积层,池化层和完全连接层(正如在常规神经网络中看到的那样)。 我们将堆叠这些层来形成一个完整的ConvNet架构。
Example Architecture: Overview. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
INPUT - CONV - RELU - POOL - FC的架构讲解CNNs
- INPUT [32x32x3] will hold the raw pixel values of the image, in this
case an image of width 32, height 32, and with three color channels
R,G,B.
图像的宽度为32,高度为32,并且具有三个颜色通道R,G,B。 - CONV layer will compute the output of neurons that are connected to
local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
volume. This may result in volume such as [32x32x12] if we decided to
use 12 filters.
使用12个filter,这可能导致valume如[32x32x12] - RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
volume unchanged ([32x32x12]).
使用RELU的激活函数max(0,x)阈值为零。 这会使volume不变([32x32x12])。 - POOL layer will perform a downsampling operation along the spatial
dimensions (width, height), resulting in volume such as [16x16x12].
使用降低样本维度的采样操作,使valume变成[16x16x12] FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1x10], where each of the 10 numbers
correspond to a class score, such as among the 10 categories of
CIFAR-10. As with ordinary Neural Networks and as the name implies,
each neuron in this layer will be connected to all the numbers in
the previous volume.全连接层,会使valume变成[1x1x10],例如:CIFAR-10分成10种类别。和普通神经网络一般,就是直接连接所有的valume。
In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don’t. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.
- ConvNet体系结构是最简单的一个图层列表,可以将图像体积转换为输出体积(例如,保持课程成绩)
- 有几种不同的图层(例如CONV / FC / RELU / POOL是目前最流行的)
- 每个图层都接受一个输入3D音量,并通过可微分功能将其转换为输出3D音量
- 每一层可能有也可能没有参数(例如,CONV / FC do,RELU / POOL不)
- 每一层可能有也可能没有额外的超参数(例如,CONV / FC / POOL do,RELU不)
We now describe the individual layers and the details of their hyperparameters and their connectivities.
接下来,我们详细描述每一层以及相关的超参数。
1,Convolutional Layer卷积层
The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting.
Conv层是卷积网络的核心构建块,完成大部分计算繁重工作。
Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes without brain/neuron analogies. The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.
The brain view. If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter). We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme.
站在大脑的视角,我们来讨论一下神经元的连接,以及在空间上的安排,和他们之间传递的参数
Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.
在本地连接上,我们往往使用receptive field (局部感知野)相当于filter,去处理一个区域的图像。
Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.
例如:输入[32x32x3]。过滤器大小 5x5,每一个神经元会有一个大小为 [5x5x3]局部输入。
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
示例2.假设输入音量大小为[16x16x20]。 然后,使用3x3的示例感受野大小,Conv层中的每个神经元现在将具有总共3 * 3 * 20 = 180个连接到输入音量。 注意,连通性在空间上是局部的(例如3×3),但是输入深度(20)。
上:红色的示例输入体积(例如,32x32x3 CIFAR-10图像)以及第一卷积层中的示例体积的神经元。 卷积层中的每个神经元只在空间上连接到输入体积中的局部区域,但是连接到全深度(即所有颜色通道)。 请注意,沿着深度有多个神经元(在这个例子中是5个),所有神经元都在输入中查看相同的区域 - 请参阅下面文本中深度列的讨论。 下图:神经网络章节中的神经元保持不变:它们仍然计算它们的权值与输入之间的点积,然后是非线性,但是它们的连通性现在被限制在局部空间上。
Spatial arrangement. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven’t yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the depth, stride and zero-padding. We discuss these next:
空间排列。 三个超参数控制输出valume的大小:深度,步幅和零填充。 我们接下来讨论这些:
- First, the depth of the output volume is a hyperparameter: it
corresponds to the number of filters we would like to use, each
learning to look for something different in the input. For example,
if the first Convolutional Layer takes as input the raw image, then
different neurons along the depth dimension may activate in pr