cs231n-notes-Lecture-4/5/6: 反向传播/激活函数/数据预处理/权重初始化/batch norm

Lecture-4 Backpropagation and Neural Networks

Computational Graphs

  • Node gradient = [local gradient] x [upstream gradient]
  • add gate: gradient distributor
  • max gate: gradient router (choose only a way)
  • mul gate: gradient switcher
    在这里插入图片描述
    在这里插入图片描述

Lecture-5 Convolutional Neural Networks

image N*N, fliter F*F, stride S, then the feature map: (N-F)/S + 1.
common setting: F=2/3, S=2.

ConvNetJS demo:http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

Lecture-6 Training Neural Networks

Activation function

Sigmoid
  • Pros:
    • squashes numbers into range [0,1].
    • nice interpretation as a saturating “firing rate” of a neuron
  • Cons:
    • Saturated neurons kill the gardients
    • not zero-centered
    • exp() is a bit compute expensive
tanh
  • squashes numbers into range [0,1].
  • zero-centered
  • Saturated neurons kill the gardients
Relu
  • Pros:
    • Does not Saturate
    • Computationally efficient
    • Converges much faser that tanh and sigmoid.(eg. 6x)
    • Actually more biologically plausible than sigmoid.
  • Cons:
    • not zero-centered
    • kill the half gradient. dead relu will never update the weights
Leaky Relu

f ( x ) = m a x ( 0.01 x , x ) f(x) = max(0.01x, x) f(x)=max(0.01x,x)

  • Pros:
    • Does not Saturate
    • Computationally efficient
    • Converges much faser that tanh and sigmoid.(eg. 6x)
    • will not “die”
  • parametric Relu : f ( x ) = m a x ( α x , x ) f(x) = max(\alpha x, x) f(x)=max(αx,x)
Exponential Linear Units(ELU)

在这里插入图片描述

Maxout

f ( x ) = m a x ( W 1 T x + b 1 , W 2 T x + b 2 ) f(x) = max(W^T_1x+b_1, W^T_2x+b_2) f(x)=max(W1Tx+b1,W2Tx+b2)

Data Preprocessing

Preprocess Data
  • Normalization
    • For images, e.g. consider CIFAR-10 example with [32,32,3] images.
      • Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)
      • Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)

Weight Initialization

  • pre-training or fine-tuning
  • Small random numbers.
    • eg.N(0,1e-2), but it has problems in deep neural Networks. The weights of deep layers become zeroes because the gradients are too small.
  • large random numbers: it’s easy for the neurons to Saturate.
  • Xavier Initialization : $ W_{a*b} = \frac{N(0,1)}{\sqrt{a}}$
    • performs well using tanh but breaks using relu. Hence, it’s Initialized by W a ∗ b = N ( 0 , 1 ) a / 2 W_{a*b} = \frac{N(0,1)}{\sqrt{a/2}} Wab=a/2 N(0,1)

ref: https://www.leiphone.com/news/201703/3qMp45aQtbxTdzmK.html

Batch Normalization

W : N*D (N: sample numbers; D: the dimension of features)

  1. compute the mean and variance for each dimension
  2. normlize
  • usually after fc and conv layers and before nonlinearity.
Tip:

Before training a large number of data, you can test if the model can overfit a small amount of data.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值