AlexNet模型

Q渡劫

已于 2023-09-19 20:00:22 修改

阅读量337

点赞数

分类专栏：经典原文模型文章标签：人工智能

于 2023-09-09 18:14:19 首次发布

本文链接：https://blog.csdn.net/qq_51691366/article/details/132780577

版权

经典原文模型专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1、摘要：介绍背景及提出AlexNet模型，获得ILSVRC-2012冠军

2、Introduction：介绍了本文的主要贡献；研究的成果主要得益于大量的数据以及高性能的GPU

3、The DataSet：ILSVRC-2012数据集简介；图片预处理细节

4、The Architecture：AlexNet网络结构及其内部细节，ReLU、GPU、LRN、Overlapping Pooling

5、Reducing Overfitting：防止过拟合，数据增强和dropout

6、Details of learning：超参数调整、权重初始化

1、摘要：介绍背景及提出AlexNet模型，获得ILSVRC-2012冠军

ILSVRC：大规模图像识别挑战赛从包含21841个类别、14197122张图片的ImageNet数据集中挑选了1000类的1200000张作为训练集，获得了最优的结果，“top-1 and top-5 error rates of 37.5% and 17.0%” (Krizhevsky 等, 2017, p. 84)

“The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax.” (Krizhevsky 等, 2017, p. 84)

“To make training faster, we used nonsaturating neurons and a very efficient GPU implementation of the convolution operation.” (Krizhevsky 等, 2017, p. 84)

“To reduce overfitting in the fully connected layers we employed a recently developed regularization method called “dropout” that proved to be very effective.” (Krizhevsky 等, 2017, p. 84)

2、Introduction：介绍了本文的主要贡献；研究的成果主要得益于大量的数据以及高性能的GPU

“But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Ref.25), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe,28 which consists of hundreds of thousands of fully segmented images, and ImageNet,7 which consists of over 15 million labeled high-resolution images in over 22,000 categories.” (Krizhevsky 等, 2017, p. 85)

“To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we do not have.” (Krizhevsky 等, 2017, p. 85)

“In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate.” (Krizhevsky 等, 2017, p. 85)

3、The DataSet：ILSVRC-2012数据集简介；图片预处理细节

“mageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.” (Krizhevsky 等, 2017, p. 85)

“ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 × 256 patch from the resulting image. We did not pre process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.” (Krizhevsky 等, 2017, p. 85)

4、The Architecture：AlexNet网络结构及其内部细节，ReLU、GPU、LRN、Overlapping Pooling

（1）、tanh 和 sigmod函数是饱和的激活函数；ReLU以及其变种为非饱和激活函数。非饱和激活函数主要有如下优势：

非饱和激活函数可以解决梯度消失问题。

非饱和激活函数可以加速收敛。

“This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network.” (Krizhevsky 等, 2017, p. 85)

“A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line).” (Krizhevsky 等, 2017, p. 86)

（2）指出原因使用双GPU的原因

“A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU.” (Krizhevsky 等, 2017, p. 86)

“Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory.” (Krizhevsky 等, 2017, p. 86)

双GPU的具体使用方法：“The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.” (Krizhevsky 等, 2017, p. 86)

（3）、局部归一化：

“local normalization scheme aids generalization” (Krizhevsky 等, 2017, p. 86)

模型泛化能力是指模型在未曾见过的数据上的表现能力，也就是模型对于新的数据的适应能力，是机器学习算法的一个评价指标

“We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization.c” (Krizhevsky 等, 2017, p. 86)

（4）、Overlapping Pooling：使用池化层减少过拟合

“To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z × z centered at the location of the pooling unit. If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non overlapping scheme s = 2, z = 2, which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.” (Krizhevsky 等, 2017, p. 87)

（5）AlexNet网络结构：

“the net contains eight layers with weights; the first five are convolutional and the remaining three are fully connected. The output of the last fully connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.” (Krizhevsky 等, 2017, p. 87)

5、Reducing Overfitting：防止过拟合，数据增强和dropout

（1）、两种方式进行数据增强：

“We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk” (Krizhevsky 等, 2017, p. 87)

“The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256 × 256 images and training our network on these extracted patches.d This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly inter dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence 10 patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.” (Krizhevsky 等, 2017, p. 87)

训练阶段：

。图片统一缩放到256*256

。随机位置裁剪出224*224

。随机进行水平翻转

测试阶段：

。图片统一缩放至256*256

。裁剪出5个224*224区域

。均进行水平翻转，共得到10张224*224照片

“The second form of data augmentation consists of altering the intensities of the RGB channels in training images.” (Krizhevsky 等, 2017, p. 88)

对 RGB这些通道上的数据进行一个主成分分析，然后对主成分分析上的参数进行一个扰动，经过这些扰动，图像的色彩就会发生一个微小的变化来实现对图像数据增强，增加图像的一个多样性和丰富度。但是效果了有限

（2）、Dropout（随机失活）

“The recently introduced technique, called “dropout”,12 consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights.” (Krizhevsky 等, 2017, p. 88)

“This technique reduces complex co adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.” (Krizhevsky 等, 2017, p. 88)

“We use dropout in the first two fully connected layers of Figure 2. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.” (Krizhevsky 等, 2017, p. 88)

6、Details of learning：超参数调整、权重初始化

（1）、超参数调整

“We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. The update rule for weight w was where i is the iteration index, u is the momentum variable, ε is the learning rate, and 〈 wi〉Di is the average over the ith batch Di of the derivative of the objective with respect to w, evaluated at wi.” (Krizhevsky 等, 2017, p. 88)

（2）、权重初始化

“We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.” (Krizhevsky 等, 2017, p. 88)

7、实验结果和分析

（1）、相似图片的第二个全连接层输出特征向量的欧式距离相近

（2）、可用AlexNet提取高级特征进行图像检索、图像聚类、图像编码

“Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vectors is inefficient, but it could be made efficient by training an auto encoder to compress these vectors to short binary codes. This should produce a much better image retrieval method than applying auto encoders to the raw pixels,16 which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.” (Krizhevsky 等, 2017, p. 90)

8、论文总结

（1）、关键点

。大量带标签的数据——ImageNet，算料

。高性能计算资源——GPU，算力

。合理算法模型——深度卷积神经网络，算法

（2）、创新点

。采用ReLu加快大型神经网络训练

。采用LRN提升大型网络泛化能力

。采用随机裁剪翻转以及色彩扰动增加数据多样性

。采用Drpour减轻过拟合

（3）、启发点

。深度与广度可以决定网络能力

“Their capacity can be controlled by varying their depth and breadth” (Krizhevsky 等, 2017, p. 85)

。更大的数据集以及更快的GPU可以进一步提高模型的性能

“All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.” (Krizhevsky 等, 2017, p. 85)

。图片缩放，先对短边缩放

“Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 × 256 patch from the resulting image” (Krizhevsky 等, 2017, p. 85)

。ReLu函数不需要对输入进行标准化来防止饱和，也即说明sigmod和tanh函数需要进行标准化来防止饱和

“ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. I” (Krizhevsky 等, 2017, p. 86)

。卷积核学习到频率、方向和颜色特征

“The network has learned a variety of frequency- and orientation-selective kernels, as well as various colored blobs.” (Krizhevsky 等, 2017, p. 89)

。相似图片具有”相似”的高级特征

“If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar.” (Krizhevsky 等, 2017, p. 89)

。图片检索可以基于高级特征，效果应该由于基于原始图像

“This should produce a much better image retrieval method than applying auto encoders to the raw pixels,16 which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.” (Krizhevsky 等, 2017, p. 90)

。网络结果具有相关性，不可轻易移除某一层

“It is notable that our network’s performance degrades if a single convolutional layer is removed.” (Krizhevsky 等, 2017, p. 90)

。采用视频数据，可能有新突破

“Ultimately we would like to use very large and deep convolutional nets on video sequences where the temporal structure provides very helpful information, that is, missing or far less obvious in static images.” (Krizhevsky 等, 2017, p. 90)

9、代码复现

（1）、模型结构

（2）、结构特点

AlexNet 包含 8 层变换，有 5 个卷积和 3 个全连接
AlexNet 第一层中的卷积核 shape 为 11*11，第二层的卷积核形状缩小到 5*5，之后全部采用 3*3 的卷积核
所有的池化层窗口大小为 3*3，步长为 2，最大池化
采用 Relu 激活函数，代替 sigmoid，梯度计算更简单，模型更容易训练
采用 Dropout 来控制模型复杂度，防止过拟合
采用大量图像增强技术，比如翻转、裁剪和颜色变化，扩大数据集，防止过拟合

（3）、代码实现

# 导入工具包
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# 模型构建
net = keras.models.Sequential([
    # 卷积：卷积核数量96，尺寸11*11，步长4，激活函数relu
    layers.Conv2D(filters=96, kernel_size=11, strides=4, activation='relu'),
    # 最大池化：尺寸3*3，步长2
    layers.MaxPool2D(pool_size=3, strides=2),
    # 卷积：卷积核数量256，尺寸5*5，激活函数relu，same卷积
    layers.Conv2D(filters=256, kernel_size=5, padding='same', activation='relu'),
    # 最大池化：尺寸3*3，步长3
    layers.MaxPool2D(pool_size=3, strides=2),
    # 卷积：卷积核数量384，尺寸3，激活函数relu，same卷积
    layers.Conv2D(filters=384, kernel_size=3, padding='same', activation='relu'),
    # 卷积：卷积核数量384，尺寸3，激活函数relu，same卷积
    layers.Conv2D(filters=384, kernel_size=3, padding='same', activation='relu'),
    # 卷积：卷积核数量256，尺寸3，激活函数relu，same卷积
    layers.Conv2D(filters=256, kernel_size=3, padding='same', activation='relu'),
    # 最大池化：尺寸3*3，步长2
    layers.MaxPool2D(pool_size=3, strides=2),
    # 展平特征图
    layers.Flatten(),
    # 全连接：4096神经元，relu
    layers.Dense(4096, activation='relu'),
    # 随机失活
    layers.Dropout(0.5),
    layers.Dense(4096, activation='relu'),
    layers.Dropout(0.5),
    # 输出层：多分类用softmax，二分类用sigmoid
    layers.Dense(10, activation='softmax')],
    name='AlexNet')

# 模拟输入
x = tf.random.uniform((1, 227, 227, 1))
y = net(x)

net.summary()

Q渡劫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
AlexNet模型

大规模图像识别挑战赛从包含21841个类别、14197122张图片的ImageNet数据集中挑选了1000类的1200000张作为训练集，获得了最优的结果，“top-1 and top-5 error rates of 37.5% and 17.0%” (Krizhevsky 等, 2017, p. 84)“The neural network, which has 60 million parameters and 650,000 neurons, consists of five conv
复制链接

扫一扫