4.减少过拟合
“4 Reducing Overfitting” (Krizhevsky 等, 2017, p. 5)
“Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.” (Krizhevsky 等, 2017, p. 5)
我们的神经网络架构有6000万个参数。尽管ILSVRC的1000个类别使得每个训练样本对图像到标签的映射施加了10比特的约束,但这并不足以学习到如此多的参数,而不会产生严重的过拟合。下面,我们描述了我们对抗过拟合的两种主要方式。
解读
(1)10bit约束:由于有1000个类别,使用2进制来表示时可用10位来表示,10位二进制数表示的最大范围为1024
(2)对于1000个类别而言,每个类别的样本数量有限,参数过多可能会导致过拟合
“4.1 Data Augmentation” (Krizhevsky 等, 2017, p. 5)
4.1数据增强
“The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.” (Krizhevsky 等, 2017, p. 5)
在图像数据上减少过拟合最简单和最常见的方法是使用保持标签的变换(例如, [ 25 , 4 , 5 ])来人工放大数据集。我们采用了两种截然不同的数据增强形式,这两种形式都允许从原始图像生成变换图像,计算量很小,因此变换图像不需要存储在磁盘上。在我们的实现中,转换后的图像是在CPU上用Python代码生成的,而GPU是在前一批图像上训练的。因此,这些数据增强方案实际上是计算自由的。
解读
(1)减少过拟合最简单和最常用的是保持标签的变换,也就是对图片做一些旋转、平移、缩放等工作,对应的标签不变,从而增大数据集
(2)进行图片变换(预处理操作)是在cpu上进行的,处理过的图像送入GPU上进行训练
“The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches4. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.” (Krizhevsky 等, 2017, p. 5)
第一种数据增强形式包括生成图像平移和水平反射。我们通过从256 × 256的图像中提取随机的224 × 224块(及其水平反射),并在这些提取的块上训练我们的网络4。这使得我们的训练集规模增加了2048倍,尽管由此产生的训练示例当然是高度相互依赖的。如果没有这个方案,我们的网络就会出现严重的过拟合,这将迫使我们使用更小的网络。在测试时刻,该网络通过提取5个224 × 224的面片(四角面片和中心面片)和它们的水平反射(因此总共有十个斑块),并将网络的softmax层的预测结果取平均来进行预测。
解读
(1)第一种图像增强的方式:生成图像平移和水平反射(左右翻转)
(2)通过在256256上提取随机224*224的块及其水平反射,使数据集规模增大2048倍,避免了过拟合
“The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel Ixy = [I R xy, IG xy, IB xy]T we add the following quantity:” (Krizhevsky 等, 2017, p. 5)
第二种形式的数据增强包括改变训练图像中RGB通道的强度。具体来说,我们对整个ImageNet训练集的RGB像素值集合进行PCA。对每张训练图像,我们添加找到的主成分的倍数,随机变量取自均值为零、标准差为0.1的高斯分布,其大小与相应特征值的倍数成正比。因此,对于每一个RGB图像像素Ixy = [ I R xy , IG xy , IB xy]T,我们添加如下的量:
“where pi and λi are ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values, respectively, and αi is the aforementioned random variable. Each αi is drawn only once for all the pixels of a particular training image until that image is used for training again, at which point it is re-drawn. This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.” (Krizhevsky 等, 2017, p. 6)
其中pi和λ i分别为RGB像素值的3 × 3协方差矩阵的第i个特征向量和特征值,α i为上述随机变量。每个α i只对特定训练图像的所有像素绘制一次,直到该图像再次用于训练,此时重新绘制。该方案近似地抓住了自然图像的一个重要性质,即物体身份对光照的强度和颜色变化具有不变性。该方案将top - 1错误率降低了1 %以上。
解读
(1)第二种图像增强的方式:改变训练图片的通道强度,即进行降维操作(使用PCA来实现)
1. 数据中心化:首先,对原始数据进行中心化操作,即减去每个特征的均值,确保新坐标系的原点位于数据的中心。
2. 计算协方差矩阵:计算中心化后的数据的协方差矩阵。协方差矩阵描述了不同特征之间的相关性。对于两个特征,协方差矩阵的元素表示它们之间的协方差。
3. 特征值分解:对协方差矩阵进行特征值分解。这一步得到了协方差矩阵的特征值和对应的特征向量。特征向量表示新坐标系的方向,而特征值表示数据在这个方向上的方差。
4. 选择主成分:将特征值从大到小排序,选择最大的 k 个特征值对应的特征向量,构成新的坐标系。这些特征向量就是主成分,它们对应的特征值表示了数据在这些主成分上的重要性。
5. 投影:将原始数据投影到选定的主成分上,得到降维后的数据。这可以通过将数据与选定的主成分构成的矩阵相乘来实现。
(2)对于每张训练图像,找到主成分的随机倍数$αi$,随机倍数取自均值为0、方差为0.1的高斯分布并且与特征值的倍数成正比
(3)在图像训练过程中,参数$αi$会更新
(4)通过引入了随机变量,将top-1的错误率降低1%,并且发现无论光照和颜色怎么变,物体本身不变。
“4.2 Dropout” (Krizhevsky 等, 2017, p. 6)
4.2正则化
“Combining the predictions of many different models is a very successful way to reduce test errors [1, 3], but it appears to be too expensive for big neural networks that already take several days to train. There is, however, a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5.The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.” (Krizhevsky 等, 2017, p. 6)
结合多个不同模型的预测是一种非常成功的减少试验误差[ 1、3]的方法,但对于已经花费数天时间训练的大型神经网络来说,这似乎过于昂贵。然而,有一个非常有效的模型组合版本,在训练过程中只花费大约两倍的时间。最近提出的技术称为" dropout " [ 10 ],它以概率0.5将每个隐藏神经元的输出置零。以这种方式"掉队"的神经元不参与前向传递,也不参与反向传播。因此,每当一个输入被呈现时,神经网络采样一个不同的架构,但所有这些架构共享权重。这种技术减少了神经元之间复杂的协同适应,因为一个神经元不能依赖于其他特定神经元的存在。因此,它被迫学习更健壮的特征,这些特征与其他神经元的许多不同的随机子集一起使用。在测试时,我们使用所有的神经元,但将它们的输出乘以0.5,这是一个合理的近似,以取指数多的dropout网络产生的预测分布的几何平均值。
解读
(1)对于已经花费时间训练大型模型的神经网络来说使用dropout技术可以有效减少误差,其他情况下可以使用结合多个模型进行预测(集成学习)来减少误差
集成学习:通过多个模型的预测结果进行取平均值得到
dropout:以一定的概率抛弃神经元,丢掉的不参与前向和反向传播
(2)神经网络具有权值共享的性质,这个dropout技术减少了神经元之间的依赖性
(3)为了使网络引入随机性,测试时将神经元输出乘0.5
“We use dropout in the first two fully-connected layers of Figure 2. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.” (Krizhevsky 等, 2017, p. 6)
我们在图2的前两个全连接层中使用dropout。在没有dropout的情况下,我们的网络表现出明显的过拟合。Dropout大约将收敛所需的迭代次数增加一倍。
解读
(1)使用dropout后减少了网络的过拟合现象,减少了收敛代数。
“5 Details of learning” (Krizhevsky 等, 2017, p. 6)
5 学习细节
“We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. The update rule for weight w was” (Krizhevsky 等, 2017, p. 6)
我们使用随机梯度下降训练我们的模型,批大小为128个样本,动量为0.9,权重衰减为0.0005。我们发现这种少量的重量衰减对于模型的学习很重要。换句话说,这里的权重衰减不仅仅是一个正则项:它减少了模型的训练误差。权重w的更新规则为
“where i is the iteration index, v is the momentum variable, is the learning rate, and 〈 ∂L ∂w ∣ ∣wi 〉 Di is the average over the ith batch Di of the derivative of the objective with respect to w, evaluated at wi.” (Krizhevsky 等, 2017, p. 6)
其中i是迭代指数,v是动量变量,是学习率,〈∂L∂w︱wi〉Di是第i个批次Di上目标对w的导数的平均值,在wi处计算。
解读
(1)使用SGD方式训练模型batch_size=128,权重衰减=0.0005,学习率为0.9
(2)使用权重衰减方式减少了训练误差
“Figure 3: 96 convolutional kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU 2. See Section 6.1 for details.” (Krizhevsky 等, 2017, p. 6)
图3:第一层卷积层在224 × 224 × 3输入图像上学习到的96个大小为11 × 11 × 3的卷积核。前48个核在GPU 1上学习,后48个核在GPU 2上学习。详见6.1节。
“We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.” (Krizhevsky 等, 2017, p. 6)
我们从一个标准差为0.01的零均值高斯分布中初始化各层的权重。我们在第二层、第四层和第五层卷积层以及全连接隐藏层中初始化神经元偏置,常数为1。这种初始化通过为ReLU提供积极的输入来加速学习的早期阶段。我们用常数0初始化剩余层中的神经元偏置。
解读
(1)初始化设置:权重:从标准差为0.01 ,均值为0的高斯分布中选值
2、4、5卷积层以及全连接层神经元偏置:1
剩余层偏置:0
(2)以上初始化方式为Relu提供输入,加速了学习
“We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination. We trained the network for roughly 90 cycles through the training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs.” (Krizhevsky 等, 2017, p. 6)
我们对所有层使用相同的学习率,并在整个训练过程中手动调整。我们遵循的启发式是当验证错误率随着当前学习率的提高而停止时,将学习率除以10。学习率初始化为0.01,并在终止前减少三次。我们通过120万张图像的训练集对网络进行了大约90个循环的训练,在两台NVIDIA GTX 580 3GB GPU上花费了5 ~ 6天的时间。
解读
(1)所有层的学习率相同,训练过程会手动调整学习率
(2)对学习率的调整:学习率除以10
(3)120万张图片训练了90轮,使用两台NVIDIA GTX 580 3GB GPU上花费了5 ~ 6天