论文笔记:Alexnet

The Architecture

It contains eight learned layers — five convolutional and three fully-connected.

ReLU Nonlinearity

( f ( x ) = m a x ( 0 , x ) (f(x) = max(0,x) (f(x)=max(0,x)
Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units.

Local Response Normalization

ReLUs do not require input normalization to prevent them from saturating. However, we still find that the following local normalization scheme aids generalization.

Denoting by a ix,y the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity b ix,y is given by the expression
2
where the sum runs over n “adjacent” kernel maps at the same spatial position.
N the total number of kernels in the layer

The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 10 −4 , and β = 0.75.

Overlapping Pooling

If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling.

Overall Architecture

The net contains eight layers with weights;

The first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.

The network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

在这里插入图片描述
1

  • Input image 224×224×3
  • First convolutional layer 96 kernels of size 11×11×3 stride of 4 pixels
  • Second convolutional layer 256 kernels of size 5 × 5 × 48
  • The third convolutional layer 84 kernels of size 3 × 3 × 256
  • The fourth convolutional layer 384 kernels of size 3 × 3 × 192
  • THe fifth convolutional layer 256 kernels of size 3 × 3 × 192
  • The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU
  • Response-normalization layers follow the first and second convolutional layers.
  • Max-pooling layers, follow both response-normalization layers as well as the fifth convolutional layer.
  • The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

Reducing Overfitting

1.Data Augmentation

The first form of data augmentation consists of generating image translations and horizontal reflections.

  • Extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches 4 . This increases the size of our training set by a factor of 2048.
  • Altering the intensities of the RGB channels in training images.

Perform PCA on the set of RGB pixel values throughout the ImageNet training set

To each training image, add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.

To each RGB image pixel I x y = [ I x y R , I x y G , I x y B ] T I_{xy} = {[I_{xy}^R, I_{xy}^G, I_{xy}^B]}^T Ixy=[IxyR,IxyG,IxyB]T, add the following quantity:
[ p 1 , p 2 , p 3 ] [ α 1 λ 1 , α 2 λ 2 , α 3 λ 3 ] T [p_1,p_2,p_3]{[α_1 λ_1 , α_2 λ_2 , α_3 λ_3]}^T [p1,p2,p3][α1λ1,α2λ2,α3λ3]T
*p_i and λ_i are ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values

α_i is the aforementioned random variable,Each α_i is drawn only once for all the pixels of a particular training image until that image is used for training again, at which point it is re-drawn. *

2. Dropout

Details of learning

Use SGD with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.The update rule for weight w was
在这里插入图片描述
i is the iteration index, v is the momentum variable, b is the learning rate, and ethleart learning rate, and the average over the ith batch D i D_i Di of the derivative of the objective with respect to w, evaluated at w i w_i wi .

Initial

  • Initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01.
  • Initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1.(ReLU)
  • Initialized the neuron biases in the remaining layers with the constant 0.
  • Used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate.The learning rate was initialized at 0.01 and reduced three times prior to termination.

Qualitative Evaluations

The kernels on GPU 1 are largely color-agnostic, while the kernels on on GPU 2 are largely color-specific. This kind of specialization occurs during every run and is independent of any particular random weight initialization.
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值