Dive into Deep Learning

Basics

Standard notations

  • Variable: X (uppercase and no bold)
  • Matrix: X (upper-case and bold)
  • Vetor: x (lower-case and bold)
  • Element/Scalar: x (lower-case and no bold)



Basic Steps for Deep Learning

  1. Define the model structure
  2. Initialize the model’s parameters
  3. Loop:
    • Calculate current loss(forward propagation)
    • Calculate current gradient(backward propagation)
    • Update parameters(gradient descent)



Backpropagation

Here are some notations we will need later. We use wljk to denote the weight for the connection from the kth neuron in the (l1)th layer to the jth neuron in the lth layer. And we use zlj to represent the input of the jth neuron in the lth layer, alj to represent the activation output in the j^{th} neuron in the lth layer. Similarly, blj represents the bias of the jth neuron in the lth layer.

Why use this cumbersome notation? Maybe it is better to use j to refer to the inpurt neuron, and k to the output neuron. Why we use vice versa? The reason is the activation output of the jth neuron in the lth layer can be expressed like,

alj=σ(kwljkal1k+blj)
This expression can be rewritten into a matrix from as followings,
al=σ(Wlal1+bl)
where, al , al1 and bl are vectores, Wl is a weight matirx for the jth layer, and its jth row and kth column is wljk . The elements in jth row of Wl are reprent the weights of neurons in (l1)th layer connecting to the jth neuron in lth layer.

Then, we define the loss function C , here we use the following notation(mean square error, MSE) as a example,
C=121mimy(i)aL(x(i))2
where, L denotes the number of layers in the networks, aL denotes the final output of the network. And the loss of a single training example is Cx(i)=12y(i)aL2 .

Note: Backpropagation actually compute the partial derivatives Cx(i)w and Cx(i)b for single trainning example. Then, we calculate Cw and Cb by averageing over training samples (this step is for GD or mini-bath GD). Here we suppose the training example x has been fixed. And in order to simplify notation, we drop the x subscript, writing the loss C(i)x as C .
So, for each single training sample x, the lose maybe written as,
C=12yaL=12j(yjaLj)2
Here, we define δlj as
δlj=Czlj

δlj shows that the input of jth neuron in the lth layer influences the extent of the network loss change (Details can be obtained from here).

理解: δlj 表达了在第 l 层网络的第j个神经元的输入值的变化对最终的loss function的影响程度。
And we have,

zlj=kwljkal1k
alj=σ(zlj)
Then,
δLj=CzLj=kCaLkaLkzLj=CaLkσ(zLj)
Moreover,
δlj=Czlj=kCzl+1kzl+1kzlj=kδl+1kzl+1kzlj
Because
zl+1k=iwl+1kiali+bli=iwl+1kiσ(zli)+bli
Differentiating, we obtain
zl+1kzlj=wl+1kjσ(zlj)      (i=j)
Then, we get
δlj=kδl+1kwl+1kjσ(zlj)

理解: wl+1kj 表示位于 (l+1)th 层的 kth 神经元连接到 lth jth 神经元的权值,该公式表明,将 (l+1)th 层的所有神经元的梯度变化分别乘以其与 lth kth 神经元的权值并相加。
Our goal is to update wljk and blj , and we need to calculate the partial derivative,
Cwljk=iCzlizliwljk=Czljzljwljk=δljal1k
Cblj=iCzlizliblj=δj
So far, we have four key formulas of backpropagation,
δLj=CaLkσ(zLj)δlj=kδl+1kwl+1kjσ(zlj)Cwljk=δljal1kCblj=δlj (1) (2) (3) (4)


Deduce BP with Vectorization

Here we use the concept of differential:
- Monadic calculus: df=f(x)dx
- Multivariable calculus:
- Scalar to vector

$$
    \mathrm{d}f = \sum_i \frac{\partial f}{\partial x_i} = {\frac{\partial f}{\partial \mathbf{x}}^T}\mathrm{d}\mathbf{x}
    $$

</p>

- Scalar to matrix
    <p>

    base on trace of a matrix,
    $$
        \sum_i \sum_j a_{ij}b_{ij} = \mathrm{Tr}(A^TB)
        $$

    $$
        \mathrm{Tr}(AB) = \mathrm{Tr}(BA)
        $$

    we can have,
    $$
        \mathrm{d}f = \sum_i \sum_j \frac{\partial y}{x_{ij}}\mathrm{d}x_{ij} = \sum_i \sum_j [\frac{\partial f}{\partial \mathbf{X}}]_{ij} [\mathrm{d}\mathbf{X}]_{ij} = \mathrm{Tr}[{({\frac{\partial f}{\partial \mathbf{X}})}^T} \mathrm{d}\mathbf{X}]
        $$

    so,
    $$
        \mathrm{d}f = \mathrm{Tr}({\frac{\partial f}{\partial \mathbf{X}}}^T)\mathrm{d}\mathbf{X}
        $$

    </p>

We already have,

zl=Wlal1+bl
al=σ(zl)
And,
JWl=JalalzlzWl=JzlzlWl
dzl=dWlal1=al1dWl
so,
zlWl=al1T
then, calulate Jzl
dJ=Tr[(Jal)Tdal]=Tr[(Jal)Tσ(zl)dzl]
Jzl=Jalσ(zl)
and,
dJ=Tr[(Jzl+1)Tdzl+1]=Tr[(Jzl+1)TdWl+1al]=Tr[(Jzl+1)TWl+1dal]
Jal=Wl+1Jzl+1
Until now we have,
JWl=Jzlal1T
JzL=(Wl+1Jzl+1)σ(zl)

We note,
δl=Jzl
And we can rewrite these formulas into matrix-based form, as
δL=aLCσ(zL)δl=((Wl+1)Tδl+1)σ(zl)WlC=δl(al1)TblC=δl (1) (2) (3) (4)
JWL1=JaLaLzLzLaL1aL1zL1zL1WL1
...
JWl=JzlzlWl
Jbl=Jzl
Note that,
dzl=tr(dWlal1)=tr(al1dWl)
zlWl=al1T
JWl=Jzlal1T
-->

Reference:
- 知乎:矩阵求导术(上)
- Neural Networks and Deep Learning: How the backpropagation algorithm works
- The Matrix Cookbook
- Caltech: EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 5


Regularization

Key idea is to add another term to the loss, which penalizes large weights.

L1 regularization

λi=1n|wi|=λw1

By using L1 regularization, w will be sparse.

L2 regularization

L2 regularization are used much more often during training neural network, it will make weights uniform,

λi=1nw2i=λw22

In neural network, the loss function with regularization is written as,

J(w1,b1,w2,b2,...,wL,bL)=1mi=1mL(ŷ (i),y(i))+λ2ml=1Lwl22
where, wl is weights matrix and b is a bias vector.
When we do backpropagation to update weights(here assume we use SGD), the gradient of wL is,
JwL=L(ŷ (i),y(i))wL+λwL
we note L(ŷ (i),y(i))wL as dwL
wL:=wLαdwLαλwL
Here, the λ is called **weight decay**, no matter what value of wL is, this notation is intent to decay the weights(make weights’ absolute value small).

Dropout

Core concept:
1. Dropout randomly knocks out units in the network, so it’s as if on every iteration, we are working with a smaller neural network and so using a smaller neural network seems like it should has a regularization effect.

  1. Make the neuron can not rely on any one feature, so it makes to spread out weights.
    理解: dropout在每一次迭代都会抛弃部分输入数据(使某些输入为0),不使权值集中于某个或者部分输入特征上,而是使权值参数更加均匀的分布,可以理解为shrink weights,因此于 L2 正则化类似。

Tips of using Dropout:
1. Dropout is for preventing over-fitting. It the model is not over-fitting, it’s better not to use dropout.
理解: Dropout是用来解决over-fitting的,如果模型没有over-fitting,不必非要使用。

  1. Because of dropout, the loss function J can not be defined explicitly. So it’s hard to check whether loss decrease rightly. It’s a good choice to close dropout and check the loss decreases right to ensure that your code has no bug, and then open dropout to work.

Other Regularization Methods

  1. Data augmentation

    • Flipping
    • Random rotation
    • Clip
    • Distortion
  2. Early stopping
    Check dev set error and early stop training. w is small at initialization and it will increase along with iteration. Early stop will get a mid-size rate w , so it’s similar to L2 regularization.

Preprocessing

Most dateset maybe have different size of image, a common preprocession is:
1. Scale image into same size or scale one side(width or height, often the short one) into the same size
2. Do data augmentation: flipping, random rotation
3. Crop a square from each image randomly
4. mean subtract

Per-pixel mean subtract

Subtract input image with per-pixel mean. The whole training set is (N,C,H,W) , the per-pixel mean is calculated by for each C computing the average of all the same position pixel over all image, and then will get mean matrix which size is (C,H,W).

# X size is (N, C, H, W)
mean = np.mean(X, axis=0)
mean.shape
>>> (C, H, W)
`Caffe` use per-pixel mean subtract in its [tutorial]().
**注:** per-pixel mean处理时,每个通道是独立处理的, 因为不同通道的像素不具有平稳性(图像中不同部分的统计特性是相同的),并对同一位置的像素计算所有样本的平均值。 ### Per-channel mean subtract Subtract the mean of per channel calculated over all images. The training set size is (N,C,H,W) , the mean is calculated each channel over all images, and get the `mean vector` size of (C,) .
# X size is (N, C, H, W)
mean = np.mean(X, axis=(0, 2, 3))
mean.shape
>>> (C,)

Whether **per-pixel mean subtract** or **per-channel mean subtract**, they all serves to “center” the data, it means to make the mean of the dataset is around zero, which will help train the networks(make gradient healthy). And as far as I knowm, **per-channel mean subtract** is better and common choice for preprocessing. ***References:*** - [Github: KaimingHe/deep-residual-networks: preprocessing? #5](https://github.com/KaimingHe/deep-residual-networks/issues/5) - [caffe: Brewing ImageNet](http://caffe.berkeleyvision.org/gathered/examples/imagenet.html) - [Google Groups: Subtract mean image/pixel](https://groups.google.com/forum/#!topic/digits-users/FfeFp0MHQfQ) - [StackExchange: Why do we normalize images by subtracting the dataset’s image mean and not the current image mean in deep learning?](https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c) - [MathWorks: What is per-pixel mean?](https://cn.mathworks.com/matlabcentral/answers/292415-what-is-per-pixel-mean)
## Batch Normalization

Assume X is 4d input (N,C,H,W) , the output of batch normalization layer is

y=xE[x]Var[x]+ϵγ+β
where x is a mini-batch of 3d input (N,H,W). The E[x] and Var[x] are calculate pre-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (the input size).

理解:按照C的维度,把其他维度的值拉成一个向量计算均值和方差,之后进行归一化:即对每个Channel的所有mini-batch样本所有值计算均值和方差并归一化。

A toy example:

from torch import nn
from torch.autograd import Variable
import numpy as np

x = np.array([
             [[[1,1], [1,1]], [[1,1], [1,1]]],
             [[[2,2],[2,2]], [[2,2], [2,2]]]
             ], dtype=np.float32)
x = Variable(torch.from_numpy(x))
# No affine parameters.
bn = nn.BatchNorm2d(2, affine=False)
output = bn(x)
>>> Variable containing:
(0 ,0 ,.,.) =
 -1.0000 -1.0000
 -1.0000 -1.0000

(0 ,1 ,.,.) =
 -1.0000 -1.0000
 -1.0000 -1.0000

(1 ,0 ,.,.) =
  1.0000  1.0000
  1.0000  1.0000

(1 ,1 ,.,.) =
  1.0000  1.0000
  1.0000  1.0000
[torch.FloatTensor of size 2x2x2x2]
***Reference:*** - [PyTorch: BathNorm2d](http://pytorch.org/docs/master/nn.html#batchnorm2d) - [pytorch: 利用batch normalization对Variable进行normalize/instance normalize](http://blog.csdn.net/u014722627/article/details/68947016)
## Weight Initialization

Input featres x(μ,σ2) ,the output a=ni=1wixi ,其方差为

Var(a)=Var(i=1nwixi)=i=1nVar(wixi)
=i=1n[E(wi)]2Var(xi)+[E(xi)]2Var(wi)+Var(wi)Var(xi)
=i=1nVar(wi)Var(xi)
=nVar(w)Var(x)
Here, we assumed zero mean inputs and weights, so E[xi]=0,E[wi]=0 , and wi,xi are independent each other, xi(i=1,2,..,n) are independent identically distributed and wi(i=1,2,..,n) are alse independent identically distributed.
If we want output a to have the same variance as all of its input x, the variance of w needs to be 1n, Var(x)=1n , it means w(0,1n) .

理解:我们假设了输入特征和权重的均值都是0, E[xi]=0 E(wi)=0 ,并且 wi,xi 之间都是相互独立的,且 xi 独立同分布, wi 独立同分布。因此,如果想要 a x的方差相同(网络输入与输出的分布不发生改变),我们需要让 Var(w)=1n ,即 w(0,1n) ,又因为 Var(nx)=n2Var(x) ,所以有`w = np.random.randn(n) / sqrt(n)`. 另外,在深度学习代码实现中,通常采用下面所示的方法对参数初始化

# Calculate standard deviation.
stdv = 1 / math.sqrt(n)
# Numpy
w = np.random.uniform(-stdv, stdv)
即在以均值0为中心,一个标准差的范围内进行随机采样,这样使权值 w 更为接近0。 ***Reference:*** - [cs231n: Weight Initialization](http://cs231n.github.io/neural-networks-2/#init) - [Wiki: Variance](https://en.wikipedia.org/wiki/Variance) - [知乎: 为什么神经网络在考虑梯度下降的时候,网络参数的初始值不能设定为全0,而是要采用随机初始化思想?](https://www.zhihu.com/question/36068411)
## Optimization Methods

Loss function is defined as J(W,X), Xm×n - η : learning rate

### Batch Gradient Descent(BGD) BGD calculate the sum gradients of all samples and get the mean,

wi:=wiη1mk=1mwiJ(w,x)(k)

Advantages:
- Simpleness

Disdvantages:
- Large amounts of computation
- Memory may not enough to put all samples
- Difficult to update weights online

When the training set is very large, BGD will takes lots of time.

Stochastic Gradient Descent

SGD get one sample and calculate the gradient to update weights,

wi:=wiηJ(w,x)(k)
A drawback of SGD is that direction maybe not always of the miniumum, because you only calculate one sample’s gradient time.

Mini-batch Gradient Descent

These method calculate the gradient of a mini-batch samples and the get the mean to update weights,

wi:=wiη1bk=jj+bwiJ(w,x)(k)
n_batches = m / batch_size
for i in n_batches
    # use matrix to calculate loss of mini-batch
    output = xxxx
    loss = loss_function(output, target)
    # update weights
    w := w - lr * gradient
Mini-batch GD is much faster than Batch GD. But the loss of GD will go down all the time(assume that lear rate is suitable), but the loss of SGD or mini-batch GD will be noisy, it means that sometime the loss is decrease and sometime it will increase.

Summary
In one epoch(a single pass through training set), BGD update the weights onec; SGD update the weights m times; mini-batch GD update the weights mbatch size times.

注: epoch的意思是训练时,遍历了整个训练集一次

How to choose mini-batch size?
- If small training set( m2000 ): use batch gradient descent
- Typical mini-batch size: 64 ~ 512, which is a power of 2 (because of the way computer memory is laid out and accessed, it will make computation run faster)


The following methods are optimized based on Gradient Descent, we ues g to notate gradient J(w,x)(This gradient can be the mean of all samples, a samples or a mean of a batch of samples).

Note: In deep learning we often use SGD, but you should know that then SGD here often represents mini-batch gradient descent.

Exponentially weighted averages(指数加权平均)

  • vt : exponentially weighted average(a moving average) when time is t
  • θt: current value

vt=βvt1+(1β)θt
It can be rewritten as,
vt=(1β)θt+(1β)βθt1+(1β)β2θt2+...

vt is an approximate average over 11β previous data. And v0 is 0.

理解: vt 是前 11β 个数据的平均值的近似。该方法是一种加窗型的平均值计算方法, β 决定了窗口的大小。(该平均值可以用来预测下一时刻的值为多少)

本质就是以指数式递减加权的移动平均。各数值的加权而随时间而指数式递减,越近期的数据加权越重,但较旧的数据也给予一定的加权。

Momentum

Basic idea is to computate an exponentially weighted average of the gradients, and use this average gradient to update weights.

  • dw : current gradient
  • η : learning rate

vdw:=βvdw+(1β)dw
w:=wηvdw

Some version of momentum is written as (like PyTorch),

vdw:=βvdw+dw

It mean that v is divided by 1β.

The most common value for β is 0.9 (a average of last 10 gradients) for both two version momentum. The difference is that the second version’s v is larger than the first one, which will only have a influence on the learning rate.

理解: Momentum通过计算当前时刻的梯度的平均值(指数加权平均)来作为更新参数的梯度, 即借用了当前信息与历史信息来修正梯度从而得到更好的优化方向。

References:
- deeplearing.ai: Momentum

Nesterov Momentum

Init vdw=0
Then in each iteration t
Compute dw and db on current mini-batch, then

v:=μvt1+g
winext=wηv
v:=μvt1+gwinext
wi:=wiηv

Reference:
- 知乎专栏:深度学习最全优化方法总结比较(SGD,Adagrad,Adadelta,Adam,Adamax,Nadam)
- 卷积神经网络中的优化算法比较 (注:该博客写的有些错误,主要了解其讲解的思想)
- 知乎:在神经网络中weight decay起到的做用是什么?momentum呢?normalization呢?

RMSprop (Root Mean Square)

Init sdw=0
Then in iteration t
Compute dw and db on current mini-batch, then

sdw:=βsdw+(1β)dw2
w:=wηdwsdw
Here dw2 is square of dw

理解: 直观理解,通过 s 的作用,让梯度大的参数除以一个大的值从而让参数更新的幅度减小,而让梯度小的参数除以一个小的值从而让参数更新的幅度变大。

In order to avoid sdw is zero or near to zero, in parctice, we often add a small valur ϵ ( 108 ) to denominator dwsdw+ϵ to avoid getting inf or a very large value.

Adam (Adaptive Moment Estimation)

A combination of Momentum and RMSprop

Init vdw=0 , sdw=0
Then in iteration t
Compute dw and db on current mini-batch, then

vdw:=β1vdw+(1β1)dw
sdw:=β2sdw+(1β2)dw
Do bias correction,
vcorrectdw=vdw(1βt1)
scorrectdw=sdw(1βt2)
Update weights,
w:=wηvcorrectdwscorrectdw+ϵ

A common value for β1 is 0.9, and β2 is 0.999

Learning Rate Decay

During training, with the increase of epoch, decrease value of learning rate. Because when in earlier epoch network can accept a relatively large learning rate which can accelarate training, but with the loss decrease we are getting closer to the optimal solution, using a smaller learning rate can help us get the solution in a tighter region around the minimum.


1D, 2D, 3D Convlutions

  • 1D convolution:
    • Input: a vector [Cin,Lin]
    • Kernel: a vector [k,]
    • Output(one kernel): a vector [Lout,]
  • 2D convolution:
    • Input: a image [1,H,W] or [Cin,H,W]
    • Kernel: [Cin,k,k]
    • Output(one kernel): a feature map [Hout,Wout]
  • 3D convolution:
    • Input: a video or CT [Cin,D,H,W]
    • Kernel: [Cin,k,k,k]
    • Output(one kernel): [Dout,Hout,Wout]

Notice that the dimensions of the output after convolution make the name of what kind convolution it is.
注: 几维的卷积是由一个卷积核卷积之后的输出结果的维度决定的。

References:
- 网易-deeplearning.ai: Convolution over volumes

Loss Function

Classification

Cross Entropy

H(Y,Ŷ )=EY[1logŶ ]=EY[logŶ ]

Basic knowledge:
- Entropy(Shannon Entropy): Shannon defined the entropy H of a discrete random variable X with possible values x1,x2,...,xn and probability mass function P(X) as:

$$
    H(X) = E[I(X)] = E[-\ln(P(X))]
    $$

</p>

Here $E$ is the *expected value operator*, and $I$ is the *information content* of $X$.<br>

It can be explicitly be written as,
<p>

$$
    H(X) = \sum_{i=1}^{n}P(x_i)I(x_i) = -\sum_{i=1}^{n}P(x_i)\log_b P(x_i)
    $$
where $b$ is the base of the logarithm used. Common values of $b$ are 2, Euler's number $e$, and 10. In machine learning and deep learning, people often use $e$.

</p>


  • KL divergence from Ŷ  to Y is the difference between cross entropy and entropy

KL(YŶ )=iyilog1yi^iyilog1yi=iyilogyiyi^

注: 熵的本质是香农信息量的期望,信息量就是上面公式中的 I(X)
- 信息量: 对信息的度量,随机变量所代表的事件发生所带来的信息的大小。出现概率小的事件信息量多,而事件发生的概率越大,则信息量越小,即信息量的大小与事件发生的概率大小成反比。
- 熵:度量了随机变量 X 平均的信息量。
- 交叉熵:使用估计出的分布q去逼近真实分布p所需要的信息
- KL散度: 交叉熵与熵的差值

References:
- A Friendly Introduction to Cross-Entropy Loss
- 知乎:如何通俗的解释交叉熵与相对熵?




Awesome Papers

Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv-2014

这篇论文提出了VGG Net。核心思想是使用小的kernel(3 * 3)来实现深度比较深的网络。使用小的kernel的原因在于,在加深网路的同时控制网络的参数不会过多。
其中,该网络在训练时使用的fully connected layer,而在测试时为了适应不同大小的图片,将全连接层的参数转化为了对应参数大小的卷积核从而实现了fully convolution layer。

Deep Residual Learning for Image Recognition, CVPR-2016

论文中提出了deep network出现的degradation问题,深度网络退化问题,即:深度网络在训练是会出现其training loss比前层网络的loss要大,并将这种网络称为plain networks。论文中认为该问题的出现不是因为梯度消失(vanishing gradients), 因为这些plain networks中使用BN从而保证了在前向传播中信号是有non-zero variance的,并且他们通过实验证明方向传播中的由于BN的作用,梯度是正常的。作者认为出现这种plain networks的问题在于The deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error, 而且单纯地增加迭代次数无法解决这个问题。我的理解:这些病态网络的参数空间存在很多类似于马鞍面的这种情况,导致梯度值变化不大从而影响了最优化。
因此,该论文提出了Resudual Learning:

(x)=(x)x

where, (x) is output of a few stacked layers, x denotes the input of the first layer of these layers. It makes the network to learn the residual between the input and output. And this reidual learning is realized by skip connection.
注: 参差网络学习的是输出与输入之间的参差,就是说:输出等于在输入的 x 的基础上在加上 (x) 。而之前的方法是学习从输入到输出的mapping: xouput , 并没有参差的学习。

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

CMU的多人姿态估计论文,效果很不错。核心思想为:heatmap + paf,其中heatmap为多人的关键点预测,paf为骨骼bone的方向预测。通过在图像上计算每个bone的方向(使用单位向量表示)来构建对骨骼点之间的相互关系的表示。网络的label为每个骨骼点的heatmap和每个bone的paf(数量为bone的两倍,分别描述x方向和y方向)。其数据预处理部分中,使用了matlab将含有多人的同一张图片的annotation生成为多个sample,即分为self_jointsothers_joints,另外其数据增强是内嵌在caffe代码中,依次对图片做了scale, rotate, crop and padflip的操作。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值