cs231n-notes-Lecture-4/5/6: 反向传播/激活函数/数据预处理/权重初始化/batch norm

最新推荐文章于 2020-09-01 16:11:01 发布

Ravi-Jay

最新推荐文章于 2020-09-01 16:11:01 发布

阅读量280

点赞数

分类专栏： Machine Learning Deep Learning 文章标签： Computational Graphs CNN Activation function Data Preprocessing Batch Normalization

本文链接：https://blog.csdn.net/ravi_jay/article/details/82757931

版权

Machine Learning 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

Deep Learning

5 篇文章 0 订阅

订阅专栏

这篇博客详细介绍了反向传播原理，包括计算图的概念，以及在神经网络中的应用。接着讲解了卷积神经网络的基本结构，并通过ConvNetJS演示了其工作原理。在激活函数部分，比较了Sigmoid、tanh、ReLU、Leaky ReLU、ELU和Maxout等函数的优缺点。讨论了数据预处理的重要性，如归一化，并提到了权重初始化的策略，如Xavier初始化。最后，阐述了批量归一化的概念及其在神经网络训练中的作用。

摘要由CSDN通过智能技术生成

Lecture-4 Backpropagation and Neural Networks

Computational Graphs

Node gradient = [local gradient] x [upstream gradient]
add gate: gradient distributor
max gate: gradient router (choose only a way)
mul gate: gradient switcher

Lecture-5 Convolutional Neural Networks

image N*N, fliter F*F, stride S, then the feature map: (N-F)/S + 1.
common setting: F=2/3, S=2.

ConvNetJS demo:http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

Lecture-6 Training Neural Networks

Activation function

Sigmoid

Pros：
- squashes numbers into range [0,1].
- nice interpretation as a saturating “firing rate” of a neuron
Cons:
- Saturated neurons kill the gardients
- not zero-centered
- exp() is a bit compute expensive

tanh

squashes numbers into range [0,1].
zero-centered
Saturated neurons kill the gardients

Relu

Pros：
- Does not Saturate
- Computationally efficient
- Converges much faser that tanh and sigmoid.(eg. 6x)
- Actually more biologically plausible than sigmoid.
Cons:
- not zero-centered
- kill the half gradient. dead relu will never update the weights

Leaky Relu

$f (x) = m a x (0.01 x, x)$

Pros：
- Does not Saturate
- Computationally efficient
- Converges much faser that tanh and sigmoid.(eg. 6x)
- will not “die”
parametric Relu : $max(\alpha x, x)$

Exponential Linear Units(ELU)

在这里插入图片描述

Maxout

$max(W^T_1x+b_1, W^T_2x+b_2)$

Data Preprocessing

Preprocess Data

Normalization
- For images, e.g. consider CIFAR-10 example with [32,32,3] images.
  - Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)
  - Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)

Weight Initialization

pre-training or fine-tuning
Small random numbers.
- eg.N(0,1e-2), but it has problems in deep neural Networks. The weights of deep layers become zeroes because the gradients are too small.
large random numbers: it’s easy for the neurons to Saturate.
Xavier Initialization : $ W_{a*b} = \frac{N(0,1)}{\sqrt{a}}$
- performs well using tanh but breaks using relu. Hence, it’s Initialized by $W_{a*b} = \frac{N(0,1)}{\sqrt{a/2}}$