1 介绍
原文
“Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]).” (Krizhevsky 等, 2017, p. 1) (pdf) Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations.” (Krizhevsky 等, 2017, p. 1)
目前的物体识别方法主要利用机器学习方法。为了提高它们的性能,我们可以收集更大的数据集,学习更强大的模型,并使用更好的技术来防止过拟合。直到最近,有标记图像的数据集还比较小- -在成千上万张图像(例如NORB 、加州理工学院- 101 / 256 [ 8,9 ]和CIFAR - 10 / 100 )的数量级上。对于这种大小的数据集,简单的识别任务可以得到很好的解决,特别是如果它们用保留标签的变换来增强。
解读
(1)进行物体识别的方法:
cnn
SVM是一种监督学习算法,可以用于二分类或多分类问题
**决策树和随机森林**: 决策树是一种基于树状结构的模型,可以用于分类问题。随机森林是多个决策树的集合,通过投票机制提高准确性。
K近邻 (K-Nearest Neighbors): KNN是一种基于实例的学习方法,它根据物体周围的邻居来进行分类。在物体识别中,KNN可以用于找到与待识别物体相似的训练样本。
(2)防止过拟合:大数据集,模型强
过拟合:在训练集上变现良好,在测试集上效果差
欠拟合:在训练集上表现较差
(3)图像增强:针对小数据集,使用保留标签变换增强数据,具体就是通过对数据进行平移、旋转、缩放等变换,来增加数据集量,但是数据原有的标签不变。这样做的目的是引入一些多样性,帮助模型更好地泛化到不同的图像变换或视角,从而提升模型的性能。
原文
“For example, the currentbest error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images.” (Krizhevsky 等, 2017, p. 1) (pdf) “The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.” (Krizhevsky 等, 2017, p. 1) (pdf)
例如,目前在MNIST数字识别任务上的最佳错误率( < 0.3 % )接近人类的表现[ 4 ]。但是现实场景中的物体表现出相当大的可变性,因此要学会识别它们就必须使用更大的训练集。的确,小图像数据集的缺点已经得到了广泛的认可( e.g . , Pinto et al ),但最近才有可能收集到数百万张图像的有标签数据集。新的更大的数据集包括LabelMe [ 23 ]和ImageNet [ 6 ],其中LabelMe [ 23 ]十万张全分割图像组成,ImageNet [ 6 ]由超过1,500万张标注的高分辨率图像组成,涉及22,000多个类别。
解读
(1)一些数字识别任务表现已经很好了,需要用大数据集解决现实场景中可变性
(2)大标签数据集:labelMe(数据标注软件)和ImageNet
原文
“To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have.”Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse. (Krizhevsky 等, 2017, p. 1) (pdf)
为了从数以百万计的图像中学习到数以千计的物体,我们需要一个学习容量大的模型。然而,物体识别任务的巨大复杂性意味着即使是像ImageNet这样大的数据集也无法指定这个问题,因此我们的模型也应该有大量的先验知识来弥补我们没有的所有数据。卷积神经网络( CNNs )就是这样一类模型[ 16、11、13、18、15、22、26]。它们的容量可以通过改变它们的深度和广度来控制,并且它们也对图像(即统计的平稳性和像素依赖的局部性)的性质做出了强有力的和最正确的假设。因此,与具有类似大小层的标准前馈神经网络相比,CNN具有更少的连接和参数,因此更容易训练,而其理论上最好的性能可能只是稍差。
解读
(1)先验知识:指的是在进行学习、推断或决策之前已经存在的关于问题领域的信息
例如:在处理图像问题时,图片本身最重要的先验知识是局部依赖性和平移不变性。
获取先验知识需要综合考虑多方面的信息源,并与领域专家和其他相关人士进行交流。
(2)cnn能够学习和表示的复杂特征和模式的数量和复杂度可以通过改变它的深度(卷积层数)和广度(神经元数量)来控制。
(3)标准前馈神经网络通常指全连接的多层感知机(MLP),与cnn相比,cnn有局部连接和权值共享的设计原则,使得它在处理图像等具有局部结构的数据时可以更有效地使用参数,减少了需要学习的参数数量,同时提高了模型对于局部特征的提取能力。
原文
“Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.” (Krizhevsky 等, 2017, p. 2) (pdf)
尽管CNNs具有吸引人的品质,尽管其局部架构的相对效率,但它们在大规模应用于高分辨率图像方面仍然非常昂贵。幸运的是,目前的GPU,加上高度优化的2D卷积的实现,强大到足以方便训练有趣的大型CNN,而最近的数据集,如ImageNet,包含足够的标记示例来训练这些模型,而不会出现严重的过拟合。
解读
(1)cnn在大规模应用在高分辨率图片花费很大
(2)gpu+优化的2D卷积可以训练大型cnn,使用ImageNet数据集不会出现过拟合
优化的2D卷积举例:
-
分组卷积: 将输入和卷积核分成多个组,每个组进行独立的卷积操作。这有助于减少参数数量和计算量,特别是在某些轻量级的模型中。
-
深度可分离卷积: 将标准卷积拆分为深度卷积和逐点卷积两个步骤。这可以显著减少参数数量,并加速计算。
-
膨胀卷积(deeplabv3模型中使用): 也称为空洞卷积,通过在卷积核之间引入间隔,以增大感受野而不增加参数数量。这有助于捕捉更大范围的特征。
-
Winograd卷积算法: 使用Winograd变换来减少卷积操作的计算量。这种算法通常用于小卷积核的情况,可以提高卷积操作的速度。
-
快速傅里叶变换(FFT)卷积: 使用FFT算法加速卷积计算,特别是对于大尺寸的卷积核和输入。
原文
“The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly1. Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4.we used several effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.” (Krizhevsky 等, 2017, p. 2) (pdf)
本文的具体贡献如下:我们在ILSVRC - 2010和ILSVRC - 2012竞赛[ 2 ]中使用的ImageNet子集上训练了迄今为止最大的卷积神经网络之一,并在这些数据集上取得了迄今为止报道的最好结果。我们编写了一个高度优化的2D卷积是使用GPU实现的,以及训练卷积神经网络所固有的所有其他操作,并公开发布1。我们的网络包含了许多新的和不寻常的特征,这些特征提高了网络的性能,减少了网络的训练时间,详见第3节。我们的网络规模使得过拟合成为一个显著的问题,即使有120万个标记的训练样本,因此我们使用了几种有效的技术来防止过拟合。我们使用了几种有效的防止过拟合的技术,如第4节所述。我们最终的网络包含5个卷积层和3个全连接层,这个深度似乎很重要:我们发现去除任何一个卷积层(每个模型包含的参数不超过1 %)都会导致性能下降。
解读
(1)使用imagenet训练卷积神经网络取得好结果
(2)使用GPU训练优化的2D卷积
(3)第三节讲了网络包含的特征,这些特征提高了网络性能
(4)第四节讲了几种技术防止过拟合
(5)网络包含5和卷积层3个全连接层,去除任何一个卷积层都会导致性能下降
原文
“In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.” (Krizhevsky 等, 2017, p. 2) (pdf)
最后,网络的规模主要受限于当前GPU上可利用的内存量和我们愿意容忍的训练时间。我们的网络需要5到6天的时间在两台GTX 580 3GB GPU上训练。我们所有的实验都表明,我们的结果可以通过等待更快的GPU和更大的数据集来改善。
解读
(1)网络规模和GPU内存量以及训练时间有关,GPU越快,数据集越大,结果更加理想