Convolutional Neural Network
Convolutional Neural Networks (CNNs / ConvNets)
Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.
So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.
Recall: Regular Neural Nets. As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.
Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
普通的神经网络并不能很好的适应大规模图片。在CIFAR-10中，图片是32x32x3，所以每一个隐藏层处理32*32*3 = 3072个权重。但是实际中我们会用到200x200x3的图片，200*200*3 = 120,000个权重。显然这样的计算量是增长是数量级的。我们还要保证快速计算。普通网络全连接显然不合适。
3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:
3D的神经元。卷积神经网络充分利用了输入由图像组成的事实，并以更合理的方式约束了体系结构。特别是，与常规的神经网络不同的是，ConvNet的层具有三维排列的神经元：宽度，高度，深度。 （请注意，这里的词深度是指激活体积的第三维，而不是完整的神经网络的深度，可以指网络中的总层数。）例如，CIFAR- 10是激活的输入量，并且体积具有32×32×3（分别为宽度，高度，深度）的尺寸。正如我们将要看到的那样，一层中的神经元只能连接到它之前层的一个小区域，而不是以完全连接的方式连接到所有的神经元。此外，CIFAR-10的最终输出层的尺寸为1x1x10，因为在ConvNet架构的最后，我们将把整个图像缩减为沿深度维度排列的单个分数向量。这是一个可视化：
上：常规的三层神经网络。 下：一个ConvNet在三个维度（宽度，高度，深度）上安排它的神经元，如在一个图层中可视化的那样。 ConvNet的每一层都将3D输入volume转换为神经元激活的3D输出volume。 在这个例子中，红色输入层保存图像，所以它的宽度和高度将是图像的尺寸，深度将是3（红色，绿色，蓝色通道）。
二，Layers used to build ConvNets
As we described above, a simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.
正如我们上面所描述的，一个简单的ConvNet是一系列的层次，ConvNet的每一层通过一个可微函数将一个激活体积转换成另一个激活体积。 我们使用三种主要类型的层来构建ConvNet体系结构：卷积层，池化层和完全连接层（正如在常规神经网络中看到的那样）。 我们将堆叠这些层来形成一个完整的ConvNet架构。
Example Architecture: Overview. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
INPUT - CONV - RELU - POOL - FC的架构讲解CNNs
- INPUT [32x32x3] will hold the raw pixel values of the image, in this
case an image of width 32, height 32, and with three color channels
- CONV layer will compute the output of neurons that are connected to
local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
volume. This may result in volume such as [32x32x12] if we decided to
use 12 filters.
- RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
volume unchanged ([32x32x12]).
- POOL layer will perform a downsampling operation along the spatial
dimensions (width, height), resulting in volume such as [16x16x12].
FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1x10], where each of the 10 numbers
correspond to a class score, such as among the 10 categories of
CIFAR-10. As with ordinary Neural Networks and as the name implies,
each neuron in this layer will be connected to all the numbers in
the previous volume.
In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don’t. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.
- 有几种不同的图层（例如CONV / FC / RELU / POOL是目前最流行的）
- 每一层可能有也可能没有参数（例如，CONV / FC do，RELU / POOL不）
- 每一层可能有也可能没有额外的超参数（例如，CONV / FC / POOL do，RELU不）
We now describe the individual layers and the details of their hyperparameters and their connectivities.
The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting.
Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes without brain/neuron analogies. The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.
The brain view. If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter). We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme.
Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.
在本地连接上，我们往往使用receptive field (局部感知野)相当于filter，去处理一个区域的图像。
Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.
例如：输入[32x32x3]。过滤器大小 5x5，每一个神经元会有一个大小为 [5x5x3]局部输入。
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
示例2.假设输入音量大小为[16x16x20]。 然后，使用3x3的示例感受野大小，Conv层中的每个神经元现在将具有总共3 * 3 * 20 = 180个连接到输入音量。 注意，连通性在空间上是局部的（例如3×3），但是输入深度（20）。
上：红色的示例输入体积（例如，32x32x3 CIFAR-10图像）以及第一卷积层中的示例体积的神经元。 卷积层中的每个神经元只在空间上连接到输入体积中的局部区域，但是连接到全深度（即所有颜色通道）。 请注意，沿着深度有多个神经元（在这个例子中是5个），所有神经元都在输入中查看相同的区域 - 请参阅下面文本中深度列的讨论。 下图：神经网络章节中的神经元保持不变：它们仍然计算它们的权值与输入之间的点积，然后是非线性，但是它们的连通性现在被限制在局部空间上。
Spatial arrangement. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven’t yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the depth, stride and zero-padding. We discuss these next:
空间排列。 三个超参数控制输出valume的大小：深度，步幅和零填充。 我们接下来讨论这些：
- First, the depth of the output volume is a hyperparameter: it
corresponds to the number of filters we would like to use, each
learning to look for something different in the input. For example,
if the first Convolutional Layer takes as input the raw image, then
different neurons along the depth dimension may activate in presence
of various oriented edges, or blobs of color. We will refer to a set
of neurons that are all looking at the same region of the input as a
depth column (some people also prefer the term fibre).
- Second, we must specify the stride with which we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.
- As we will soon see, sometimes it will be convenient to pad the
input volume with zeros around the border. The size of this
zero-padding is a hyperparameter. The nice feature of zero padding
is that it will allow us to control the spatial size of the output
volumes (most commonly as we’ll see soon we will use it to exactly
preserve the spatial size of the input volume so the input and
output width and height are the same).
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output. Lets also see one more graphical example:
对于一个大小是W的volume，感知野是F，步长S，边界填充P，（W-F + 2P）/ S + 1用于计算输出的volume。
get a 5x5 output（7-3+0*2）/1+1 。
Use of zero-padding. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have “fit” across the original input. In general, setting zero padding to be P=(F−1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures.
Constraints on strides. Note again that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size W=10
, no zero-padding is used P=0, and the filter size is F=3, then it would be impossible to use stride S=2, since (W−F+2P)/S+1=(10−3+0)/2+1=4.5
, i.e. not an integer, indicating that the neurons don’t “fit” neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library could throw an exception or zero pad the rest to make it fit, or crop the input to make it fit, or something. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions “work out” can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate.
限制步伐。 再次注意，空间排列超参数具有相互约束。 例如，当输入大小为W = 10时，不使用零填充P = 0，并且过滤器大小为F = 3，则不可能使用步长S = 2，因为（W-F + 2P ）/ S + 1 =（10-3 + 0）/2+1=4.5，即不是一个整数，表明神经元在整个输入上不整齐和对称。 因此，超参数的这种设置被认为是无效的，ConvNet库可以抛出一个异常，或者将剩下的部分填充以使其合适，或者裁剪输入以使其合适，或者其他东西。 正如我们将在ConvNet体系结构部分中看到的那样，适当调整ConvNets的大小以使所有维度“解决”可能是一个真正令人头痛的问题，使用零填充和一些设计准则将显着减轻。
Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11
, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
真实世界的例子。 Krizhevsky等。赢得2012年ImageNet挑战的架构接受了大小[227x227x3]的图像。在第一个卷积层上，它使用感受野大小为F = 11的神经元，步长S = 4，零填充P = 0。由于（227-11）/ 4 + 1 = 55，并且由于Conv层具有K = 96的深度，所以Conv层输出体积具有尺寸[55×55×96]。volume中的每个55 * 55 * 96神经元连接到输入音量大小为[11x11x3]的区域。此外，每个深度列中的所有96个神经元连接到输入的相同的[11x11x3]区域，但是当然具有不同的权重。除此之外，如果你读到实际的纸张，它声称输入图像是224×224，这肯定是不正确的，因为（224 - 11）/ 4 + 1很明显不是一个整数。这让ConvNets的历史上很多人感到困惑，对发生的事情知之甚少。我自己最好的猜测是，Alex使用了他在本文中没有提到的3个额外像素的零填充。
Parameter Sharing. Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55*55*96 = 290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high.
参数共享。卷积层中使用参数共享方案来控制参数的数量。使用上面的实际例子，我们看到第一个Conv层有55 * 55 * 96 = 290,400个神经元，每个神经元有11 * 11 * 3 = 363个权重和1个偏差。总之，在ConvNet的第一层单独添加了290400 * 364 = 105,705,600个参数。显然，这个数字非常高。
It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.
Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer). This is why it is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the input.
事实证明，通过作出一个合理的假设，我们可以大大减少参数的数量：如果一个特征对于在某个空间位置（x，y）计算是有用的，那么在不同的位置（x2 ，y2）上。换句话说，将一个二维深度切片表示为一个深度切片（例如，大小为[55x55x96]的volume具有96个深度切片，每个切片的大小为[55x55]），我们将限制每个深度切片中的神经元使用相同的权重和偏见。使用这个参数共享方案，我们例子中的第一个Conv层现在将只有96个独特权重集（每个深度切片一个权重集），总共96 * 11 * 11 * 3 = 34,848个唯一权重或34,944个参数+96偏见）。或者，每个深度切片中的所有55 * 55个神经元现在将使用相同的参数。在反向传播的实践中，volume中的每个神经元将计算其权重的梯度，但这些梯度将叠加在每个深度切片上，并且仅更新每个切片的一组权重。
Krizhevsky等人学习的示例过滤器 这里显示的96个过滤器中的每一个都是大小为[11x11x3]，并且每一个在一个深度切片中由55 * 55个神经元共享。 注意，参数共享假设是比较合理的：如果检测水平边缘在图像的某个位置是重要的，那么由于图像的平移不变结构，它应该直观地在其他位置有用。 因此，不需要重新学习来检测Conv层输出体积中每个55 * 55个不同位置的水平边缘。
Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
Numpy examples. To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array X. Then:
A depth column (or a fibre) at position (x,y) would be the activations X[x,y,:]. A depth slice, or equivalently an activation map at depth d would be the activations X[:,:,d].
Conv Layer Example. Suppose that the input volume X has shape X.shape: (11,11,4). Suppose further that we use no zero padding (P=0
), that the filter size is F=5, and that the stride is S=2
. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example):
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0 V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0 V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0 V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays. Notice also that the weight vector W0 is the weight vector of that neuron and b0 is the bias. Here, W0 is assumed to be of shape W0.shape: (5,5,4), since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have:
V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1 V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1 V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1 V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1 V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1 (example of going along y) V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1 (or along both)
where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used. In the example above, we are for brevity leaving out some of the other operations the Conv Layer would perform to fill the other parts of the output array V. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here.
Summary. To summarize, the Conv Layer:
Accepts a volume of size W1×H1×D1
Requires four hyperparameters:
Number of filters K
their spatial extent F
the stride S
the amount of zero padding P
Produces a volume of size W2×H2×D2
(i.e. width and height are computed equally by symmetry)
With parameter sharing, it introduces F⋅F⋅D1
weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K
In the output volume, the d
-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.
A common setting of the hyperparameters is F=3,S=1,P=1.However, there are common conventions and rules of thumb that motivate these hyperparameters. See the ConvNet architectures section below.
Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size W1=5,H1=5,D1=3
, and the CONV layer parameters are K=2,F=3,S=2,P=1. That is, we have two filters of size 3×3, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1 is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.
Implementation as Matrix Multiplication. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows:
The local regions in the input image are stretched out into columns in an operation commonly called im2col. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix X_col of im2col of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix W_row of size [96 x 363]. The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. The result must finally be reshaped back to its proper output dimension [55x55x96].
This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used BLAS API). Moreover, the same im2col idea can be reused to perform the pooling operation, which we discuss next.
Backpropagation. The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now).
1x1 convolution. As an aside, several papers use 1x1 convolutions, as first investigated by Network in Network. Some people are at first confused to see 1x1 convolutions especially when they come from signal processing background. Normally signals are 2-dimensional so 1x1 convolutions do not make sense (it’s just pointwise scaling). However, in ConvNets this is not the case because one must remember that we operate over 3-dimensional volumes, and that the filters always extend through the full depth of the input volume. For example, if the input is [32x32x3] then doing 1x1 convolutions would effectively be doing 3-dimensional dot products (since the input depth is 3 channels).
Dilated convolutions. A recent development (e.g. see paper by Fisher Yu and Vladlen Koltun) is to introduce one more hyperparameter to the CONV layer called the dilation. So far we’ve only discussed CONV filters that are contiguous. However, it’s possible to have filters that have spaces between each cell, called dilation. As an example, in one dimension a filter w of size 3 would compute over input x the following: w*x + w*x + w*x. This is dilation of 0. For dilation 1 the filter would instead compute w*x + w*x + w*x; In other words there is a gap of 1 between the applications. This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers. For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5). If we use dilated convolutions then this effective receptive field would grow much quicker.
It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer:
周期性地在ConvNet体系结构中连续的Conv层之间插入一个Pooling层。 其功能是逐步减小表示的空间大小，以减少网络中的参数和计算量，从而也控制过拟合。 池层在输入的每个深度切片上独立运行，并使用MAX操作在空间上调整其大小。 最常见的形式是一个大小为2x2的过滤器的汇聚层，在输入的每个深度切片上沿着宽度和高度两次施加2个下采样的步幅，丢弃75％的激活。 在这种情况下，每个MAX操作最多需要超过4个数字（在某个深度切片中，只有很少的2×2区域）。 深度维度保持不变。 更一般地说，汇集层：
- Accepts a volume of size W1×H1×D1输入volume的大小
Requires two hyperparameters:
their spatial extent F,the stride S两个超参数，填充F，步长S
Produces a volume of size W2×H2×D2 where:输出volume大小
- Introduces zero parameters since it computes a fixed function of the
- Note that it is not common to use zero-padding for Pooling layers使用零填充
It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with F=3,S=2 (also called overlapping pooling), and more commonly F=2,S=2. Pooling sizes with larger receptive fields are too destructive.
General pooling. In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.
合并层在输入volume的每个深度切片中独立地在空间上下采样该volume。 上图：在此示例中，尺寸为[224x224x64]的输入量与过滤器尺寸2合并，跨度为2尺寸为[112x112x64]的输出量。 请注意，volume深度保留。 下图：最常见的下采样操作是最大的，产生最大的汇集，这里显示的步幅是2.也就是说，每个最大值是4个数字（小2×2平方）。
Backpropagation. Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.
Getting rid of pooling. Many people dislike the pooling operation and think that we can get away without it. For example, Striving for Simplicity: The All Convolutional Net proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers.
Many types of normalization layers have been proposed for use in ConvNet architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have since fallen out of favor because in practice their contribution has been shown to be minimal, if any. For various types of normalizations, see the discussion in Alex Krizhevsky’s cuda-convnet library API.
许多类型归一化层被应用到ConvNet结构中，这种倾向是来源于观察生物大脑给出的inhibition schemes。然而，这种归一化操作已经失宠，因为在实际操作中，它所带来的作用是非常小的。更多归一化操作，请看Alex Krizhevsky的cuda-convnet 的API.
Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the Neural Network section of the notes for more information.
完整连接层中的神经元与前一层中的所有激活都有完全连接，正如在常规神经网络中所见。 因此可以用一个矩阵乘法和一个偏置偏移来计算它们的激活。 有关更多信息，请参阅笔记的“神经网络”部分。
5，Converting FC layers to CONV layers
It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers:
值得注意的是，FC和CONV层之间的唯一区别在于，CONV层中的神经元仅连接到输入中的局部区域，并且CONV中的许多神经元共享参数。 然而，两层神经元仍然计算点积，所以它们的功能形式是相同的。 因此，可以在FC和CONV层之间进行转换
For any CONV layer there is an FC layer that implements the same
forward function. The weight matrix would be a large matrix that is
mostly zero except for at certain blocks (due to local connectivity)
where the weights in many of the blocks are equal (due to parameter
Conversely, any FC layer can be converted to a CONV layer. For
example, an FC layer with K=4096
that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096
也就是说看一些7×7×512的输入量可以等效地表示为F = 7，P = 0，S = 1，K = 4096的CONV层。 换句话说，我们将filter大小设置为输入volume大小，因此输出将简单地为1×1×4096
since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
FC->CONV conversion. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above:
FC-> CONV转换。 在这两种转换中，将FC层转换为CONV层的能力在实践中特别有用。 考虑一个采用224x224x3图像的ConvNet体系结构，然后使用一系列CONV图层和POOL图层将图像缩小为7x7x512的激活volume（在AlexNet体系结构中，我们稍后会看到，这是通过使用 5个汇聚层，每次对输入进行空间下采样，使得最终的空间尺寸为224/2/2/2/2/2 = 7）。 从那里，一个AlexNet使用两个大小为4096的FC层，最后使用1000个神经元计算类分数。 我们可以将这三个FC层中的每一个转换为CONV层，如上所述
- Replace the first FC layer that looks at [7x7x512] volume with a CONV
layer that uses filter size F=7, giving output volume [1x1x4096].
- Replace the second FC layer with a CONV layer that uses filter size
F=1, giving output volume [1x1x4096]
- Replace the last FC layer similarly, with F=1, giving final output
Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix W
in each FC layer into CONV layer filters. It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.
For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.
例如，如果224x224图像的大小为[7x7x512]，即减少32，那么通过转换的体系结构转发大小为384x384的图像会得到等效的大小[12x12x512]，因为384/32 = 12。 接下来我们刚刚从FC层转换而来的接下来的3个CONV层现在将给出最终的大小[6x6x1000]，因为（12-7）/ 1 + 1 = 6。注意，不是一个单独的向量类分数 的大小[1x1x1000]，我们现在得到整个384x384图像整个6x6的班级分数数组。
- Evaluating the original ConvNet (with FC layers) independently across
224x224 crops of the 384x384 image in strides of 32 pixels gives an
identical result to forwarding the converted ConvNet one time.
Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.
Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.
最后，如果我们想要在图像上有效地应用原始的ConvNet，但是步幅小于32像素呢？ 我们可以用多个前锋传球来实现这一点。 例如，如果我们想要使用16像素的步幅，我们可以通过将转换的ConvNet转发两次来合并所接收到的体积：首先在原始图像上方，然后在图像上方，但是图像在空间上被移动了16个像素 沿着宽度和高度。
An IPython Notebook on Net Surgery shows how to perform the conversion in practice, in code (using Caffe)
We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets.
The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:
ConvNet体系结构最常见的形式是将几个CONV-RELU层叠加在一起，然后使用POOL层，并重复这种模式，直到图像空间合并为一个小尺寸。 在某些情况下，过渡到完全连接的层是很常见的。 最后的完全连接层保存输出，例如班级分数。 换句话说，最常见的ConvNet架构遵循以下模式：
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:
INPUT -> FC, implements a linear classifier. Here N = M = K = 0.
INPUT -> CONV -> RELU -> FC
INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see
that there is a single CONV layer between every POOL layer.
INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2
-> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because
multiple stacked CONV layers can develop more complex features of the
input volume before the destructive pooling operation.
Prefer a stack of small filter CONV to one large receptive field CONV layer. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have C channels, then it can be seen that the single 7x7 CONV layer would contain C×(7×7×C)=49C2 parameters, while the three 3x3 CONV layers would only contain 3×(C×(3×3×C))=27C2 parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.
首选一个小的过滤器CONV堆叠到一个大的感受野CONV层。假设你将三个3x3的CONV层层叠在一起（当然是非线性的）。在这种布置中，第一CONV层上的每个神经元都具有输入音量的3×3视图。第二CONV层上的神经元具有第一CONV层的3×3视图，并且因此具有输入体积的5×5视图。类似地，第三CONV层上的神经元具有第二CONV层的3×3视图，因此具有输入体积的7×7视图。假设代替这三层3x3的CONV，我们只想使用一个带有7x7感受域的CONV层。这些神经元将具有输入体积的感受野大小，其在空间范围（7×7）中是相同的，但是具有几个缺点。首先，神经元将在输入上计算线性函数，而三个CONV层叠包含使得它们的特征更具表达性的非线性。其次，如果我们假设所有的体积都有C通道，那么可以看出单个7×7 CONV层将包含C×（7×7×C）= 49C2参数，而三个3×3 CONV层将只包含3× （C×（3×3×C））= 27C2.直观地说，使用小型过滤器来堆叠CONV层，而不是使用一个带有大型过滤器的CONV层，这使得我们可以表达更强大的输入功能，并且参数更少。 作为一个实际的缺点，如果我们打算做反向传播，我们可能需要更多的内存来保存所有的中间CONV层结果。
Recent departures. It should be noted that the conventional paradigm of a linear list of layers has recently been challenged, in Google’s Inception architectures and also in current (state of the art) Residual Networks from Microsoft Research Asia. Both of these (see details below in case studies section) feature more intricate and different connectivity structures.
最近的离开。应该指出的是，在Google的初始架构中，以及微软亚洲研究院（Microsoft Research Asia）当前（最先进的）剩余网络中，线性列表层的传统范例最近受到了挑战。这两个（请参阅案例研究部分的详细信息）功能更复杂和不同的连接结构。
In practice: use whatever works best on ImageNet. If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you’ll be pleased to know that in 90% or more of applications you should not have to worry about these. I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch. I also made this point at the Deep Learning school.
2，Layer Sizing Patterns
Until now we’ve omitted mentions of common hyperparameters used in each of the layers in a ConvNet. We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation:
The input layer (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.
The conv layers should be using small filters (e.g. 3x3 or at most 5x5), using a stride of S=1, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when F=3, then using P=1 will retain the original size of the input. When F=5, P=2. For a general F, it can be seen that P=(F−1)/2 preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.
conv层应该使用小的过滤器（例如3x3或至多5x5），使用S = 1的步幅
，关键是用零填充输入音量，使得conv层不会改变输入的空间维度。也就是说，当F = 3时，则使用P = 1将保留输入的原始大小。当F = 5时，P = 2。对于一般的F，可以看出P =（F-1）/ 2保留了输入大小。如果您必须使用更大的过滤器尺寸（例如7x7左右），则通常在查看输入图像的第一个conv层上看到这一点。
The pool layers are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. F=2), and with a stride of 2 (i.e. S=2). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another slightly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and aggressive. This usually leads to worse performance.
池层负责对输入的空间维度进行下采样。最常见的设置是使用具有2×2接受域（即F = 2）的max-pooling，并且步长为2（即S = 2）。请注意，这会丢弃输入音量中激活的75％（由于在宽度和高度上的向下采样都是2）。另一个稍微不太常见的设置是使用3x3接受字段，步幅为2，但是这使得。对于大于3的最大汇集来说，接受字段大小是非常罕见的，因为汇集过于有损和积极。这通常会导致更糟糕的表现。
Reducing sizing headaches. The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.
Why use stride of 1 in CONV? Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise.
Why use padding? In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.
Compromising based on memory constraints. In some cases (especially early in the ConvNet architectures), the amount of memory can build up very quickly with the rules of thumb presented above. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory (per image, for both activations and gradients). Since GPUs are often bottlenecked by memory, it may be necessary to compromise. In practice, people prefer to make the compromise at only the first CONV layer of the network. For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filter sizes of 11x11 and stride of 4.
There are several architectures in the field of Convolutional Networks that have a name. The most common are:
LeNet. The first successful applications of Convolutional Networks
were developed by Yann LeCun in 1990’s. Of these, the best known is
the LeNet architecture that was used to read zip codes, digits, etc.
AlexNet. The first work that popularized Convolutional Networks in
Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya
Sutskever and Geoff Hinton. The AlexNet was submitted to the
ImageNet ILSVRC challenge in 2012 and significantly outperformed the
second runner-up (top 5 error of 16% compared to runner-up with 26%
error). The Network had a very similar architecture to LeNet, but
was deeper, bigger, and featured Convolutional Layers stacked on top
of each other (previously it was common to only have a single CONV
layer always immediately followed by a POOL layer).
ZF Net. The ILSVRC 2013 winner was a Convolutional Network from
Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short
for Zeiler & Fergus Net). It was an improvement on AlexNet by
tweaking the architecture hyperparameters, in particular by
expanding the size of the middle convolutional layers and making the
stride and filter size on the first layer smaller.
GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from
Szegedy et al. from Google. Its main contribution was the
development of an Inception Module that dramatically reduced the
number of parameters in the network (4M, compared to AlexNet with
60M). Additionally, this paper uses Average Pooling instead of Fully
Connected layers at the top of the ConvNet, eliminating a large
amount of parameters that do not seem to matter much. There are also
several followup versions to the GoogLeNet, most recently
VGGNet. The runner-up in ILSVRC 2014 was the network from Karen
Simonyan and Andrew Zisserman that became known as the VGGNet. Its
main contribution was in showing that the depth of the network is a
critical component for good performance. Their final best network
contains 16 CONV/FC layers and, appealingly, features an extremely
homogeneous architecture that only performs 3x3 convolutions and 2x2
pooling from the beginning to the end. Their pretrained model is
available for plug and play use in Caffe. A downside of the VGGNet
is that it is more expensive to evaluate and uses a lot more memory
and parameters (140M). Most of these parameters are in the first
fully connected layer, and it was since found that these FC layers
can be removed with no performance downgrade, significantly reducing
the number of necessary parameters.
ResNet. Residual Network developed by Kaiming He et al. was the
winner of ILSVRC 2015. It features special skip connections and a
heavy use of batch normalization. The architecture is also missing
fully connected layers at the end of the network. The reader is also
referred to Kaiming’s presentation (video, slides), and some recent
experiments that reproduce these networks in Torch. ResNets are
currently by far state of the art Convolutional Neural Network
models and are the default choice for using ConvNets in practice (as
of May 10, 2016). In particular, also see more recent developments
that tweak the original architecture from Kaiming He et al. Identity
Mappings in Deep Residual Networks (published March 2016).
VGGNet in detail. Lets break down the VGGNet in more detail as a case study. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights:
VGGNet详细。 让我们更详细地分解VGGNet作为案例研究。 整个VGGNet由执行3x3卷积的CONV层和步长1和垫1组成，POOL层执行2×2最大的步长2（没有填充）。 我们可以在处理的每个步骤中写出表示的大小，并跟踪表示大小和权重总数：
INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K weights: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K weights: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K weights: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K weights: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K weights: 0 FC: [1x1x4096] memory: 4096 weights: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 weights: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
As is common with Convolutional Networks, notice that most of the memory (and also compute time) is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M.
The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of:
构建ConvNet体系结构时要注意的最大瓶颈是内存瓶颈。 许多现代GPU具有3/4 / 6GB内存的限制，最好的GPU具有大约12GB的内存。 记忆有三个主要的记忆来源：
From the intermediate volume sizes: These are the raw number of
activations at every layer of the ConvNet, and also their gradients
(of equal size). Usually, most of the activations are on the earlier
layers of a ConvNet (i.e. first Conv Layers). These are kept around
because they are needed for backpropagation, but a clever
implementation that runs a ConvNet only at test time could in
principle reduce this by a huge amount, by only storing the current
activations at any layer and discarding the previous activations on
From the parameter sizes: These are the numbers that hold the
network parameters, their gradients during backpropagation, and
commonly also a step cache if the optimization is using momentum,
Adagrad, or RMSProp. Therefore, the memory to store the parameter
vector alone must usually be multiplied by a factor of at least 3 or
- Every ConvNet implementation has to maintain miscellaneous memory,
such as the image data batches, perhaps their augmented versions,
Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn’t fit, a common heuristic to “make it fit” is to decrease the batch size, since most of the memory is usually consumed by the activations.
一旦粗略估计了总值（激活，梯度和杂项），数字应该转换为GB的大小。 取数值，乘以4得到原始字节数（因为每个浮点数是4个字节，或者可能是8个用于双精度），然后再除以1024得到以KB为单位的内存量， MB，最后是GB。 如果你的网络不适合，“适合”的一个普通的启发式就是减少批量，因为大部分内存通常被激活所消耗。
Additional resources related to implementation:
[Soumith benchmarks for CONV performance](https://github.com/soumith/convnet-benchmarks) ConvNetJS CIFAR-10 demo allows you to play with ConvNet architectures and see the results and computations in real time, in the browser. Caffe, one of the popular ConvNet libraries. [State of the art ResNets in Torch7](http://torch.ch/blog/2016/02/04/resnets.html)