Convolutional Neural Networks (CNNs / ConvNets) 翻译第二段_convolutional neural networks for video analysis翻译-CSDN博客

Architecture Overview

Recall: Regular Neural Nets. As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.

常规神经网络：就如我们在之前文章中知道的，神经网络接受一个输入，通过一系列隐含层将它转变。每一层由一系列神经元组成，这些神经元与前一层的神经元全连接，而且同一层的神经元完全独立不分享任何连接。最后一个全连接层名为“输出层”，在分类设置中它代表类分数。

Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectible size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

常规神经网络不能很好地缩放整个图片，在CIFAR-10中，图片大小只能设定为32x32x3，所以位于第一个隐含层的神经网络只有32x32x3=3072个权重，这个数量仍然看起来是可以管理的，但是显而易见这个全连接的体系结构不能缩放更大的图片。举个例子，一个更大的图片，比如200*200*3，将会导致神经元200*200*3个权重，更坏的是，我们显而易见需要更多这样的神经元。所以参数会添加的更快。因此，这种全连接式的神经网络是浪费的，同时如此之多的参数将会很快导致过拟合。

3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:

神经元的三维体积：卷积神经网络利用输入为图片同时图片输入将体系结构约束得更加有效。尤其，不同于传统神经网络，卷积网络层的神经元是从三种维度上组织的：长宽高。（强调“depth”这个词指代的是活化体积的第三位，而不是整个神经网络的深度，神经网络的深度指的是神经网络总的层数）例如，以CIFAR格式传入进来的输入图片是一种输入活化体积，这个体积有维度32*32*3(分别为宽度，长度，深度)。我们将会看到，本层的神经元只能看到前一层的很小一块区域，而不会是一种全部神经元的全连接方式。此外，最终输出层将会为CIFAR-10同时为维度1*1*10，因为使用卷积体系结构的最终目标是将一个完整的图片降维为表示类分数的单维向量，以深度层次来组织。

这几段就是在告诉我们，传统神经网络的体系结构太过臃肿，在处理图像上面有天然的劣势。卷积神经网络的体系结构天生就有处理图片的优势。