论文笔记：Alexnet

最新推荐文章于 2024-05-23 09:37:52 发布

eight_Jessen

最新推荐文章于 2024-05-23 09:37:52 发布

阅读量251

点赞数

分类专栏：论文笔记 network 文章标签：深度学习机器学习 pytorch 神经网络

本文链接：https://blog.csdn.net/eight_Jessen/article/details/107943162

版权

论文笔记同时被 2 个专栏收录

49 篇文章 7 订阅

订阅专栏

network

8 篇文章 0 订阅

订阅专栏

The Architecture

It contains eight learned layers — five convolutional and three fully-connected.

ReLU Nonlinearity

$(f (x) = m a x (0, x)$
Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units.

Local Response Normalization

ReLUs do not require input normalization to prevent them from saturating. However, we still find that the following local normalization scheme aids generalization.

Denoting by a ix,y the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity b ix,y is given by the expression

where the sum runs over n “adjacent” kernel maps at the same spatial position.
N the total number of kernels in the layer

The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 10 −4 , and β = 0.75.

Overlapping Pooling

If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling.

Overall Architecture

The net contains eight layers with weights;

The first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.

The network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

在这里插入图片描述

Input image 224×224×3
First convolutional layer 96 kernels of size 11×11×3 stride of 4 pixels
Second convolutional layer 256 kernels of size 5 × 5 × 48
The third convolutional layer 84 kernels of size 3 × 3 × 256
The fourth convolutional layer 384 kernels of size 3 × 3 × 192
THe fifth convolutional layer 256 kernels of size 3 × 3 × 192
The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU
Response-normalization layers follow the first and second convolutional layers.
Max-pooling layers, follow both response-normalization layers as well as the fifth convolutional layer.
The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

Reducing Overfitting

1.Data Augmentation

The first form of data augmentation consists of generating image translations and horizontal reflections.

Extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches 4 . This increases the size of our training set by a factor of 2048.
Altering the intensities of the RGB channels in training images.

Perform PCA on the set of RGB pixel values throughout the ImageNet training set

To each training image, add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.

To each RGB image pixel $I_{xy} = {[I_{xy}^R, I_{xy}^G, I_{xy}^B]}^T$ , add the following quantity:
$p_1,p_2,p_3]{[α_1 λ_1 , α_2 λ_2 , α_3 λ_3]}^T$
*p_i and λ_i are ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values

α_i is the aforementioned random variable,Each α_i is drawn only once for all the pixels of a particular training image until that image is used for training again, at which point it is re-drawn. *

2. Dropout

Details of learning

Use SGD with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.The update rule for weight w was
在这里插入图片描述
i is the iteration index, v is the momentum variable, b is the learning rate, and ethleart learning rate, and the average over the ith batch $D_i$ of the derivative of the objective with respect to w, evaluated at $w_i$ .

Initial

Initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01.
Initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1.(ReLU)
Initialized the neuron biases in the remaining layers with the constant 0.
Used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate.The learning rate was initialized at 0.01 and reduced three times prior to termination.

Qualitative Evaluations

The kernels on GPU 1 are largely color-agnostic, while the kernels on on GPU 2 are largely color-specific. This kind of specialization occurs during every run and is independent of any particular random weight initialization.
在这里插入图片描述