Convolutional Neural Networks (CNNs / ConvNets) 翻译第二段

Architecture Overview

Recall: Regular Neural Nets. As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of  hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.

Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectible size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions:  width, height, depth. (Note that the word  depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:
神经元的三维体积:卷积神经网络利用输入为图片同时图片输入将体系结构约束得更加有效。尤其,不同于传统神经网络,卷积网络层的神经元是从三种维度上组织的:长宽高。(强调“depth”这个词指代的是活化体积的第三位,而不是整个神经网络的深度,神经网络的深度指的是神经网络总的层数 )例如,以CIFAR格式传入进来的输入图片是一种输入活化体积,这个体积有维度32*32*3(分别为宽度,长度,深度)。我们将会看到,本层的神经元只能看到前一层的很小一块区域,而不会是一种全部神经元的全连接方式。此外,最终输出层将会为CIFAR-10同时为维度1*1*10,因为使用卷积体系结构的最终目标是将一个完整的图片降维为表示类分数的单维向量,以深度层次来组织。





