Basics
Standard notations
- Variable: X (uppercase and no bold)
- Matrix:
X (upper-case and bold) - Vetor: x (lower-case and bold)
- Element/Scalar: x (lower-case and no bold)
Basic Steps for Deep Learning
- Define the model structure
- Initialize the model’s parameters
- Loop:
- Calculate current loss(forward propagation)
- Calculate current gradient(backward propagation)
- Update parameters(gradient descent)
Backpropagation
Here are some notations we will need later. We use
Why use this cumbersome notation? Maybe it is better to use
j
to refer to the inpurt neuron, and
Then, we define the loss function C , here we use the following notation(mean square error, MSE) as a example,
Note: Backpropagation actually compute the partial derivatives ∂Cx(i)∂w and ∂Cx(i)∂b for single trainning example. Then, we calculate ∂C∂w and ∂C∂b by averageing over training samples (this step is for GD or mini-bath GD). Here we suppose the training example x has been fixed. And in order to simplify notation, we drop the x subscript, writing the loss C(i)x as C .
So, for each single training sample
δlj
shows that the input of
jth
neuron in the
lth
layer influences the extent of the network loss change (Details can be obtained from here).
理解:
δlj
表达了在第
l
层网络的第
And we have,
理解: wl+1kj 表示位于 (l+1)th 层的 kth 神经元连接到 lth 层 jth 神经元的权值,该公式表明,将 (l+1)th 层的所有神经元的梯度变化分别乘以其与 lth 层 kth 神经元的权值并相加。
Our goal is to update wljk and blj , and we need to calculate the partial derivative,
Deduce BP with Vectorization
Here we use the concept of differential:
- Monadic calculus:
df=f′(x)dx
- Multivariable calculus:
- Scalar to vector
$$
\mathrm{d}f = \sum_i \frac{\partial f}{\partial x_i} = {\frac{\partial f}{\partial \mathbf{x}}^T}\mathrm{d}\mathbf{x}
$$
</p>
- Scalar to matrix
<p>
base on trace of a matrix,
$$
\sum_i \sum_j a_{ij}b_{ij} = \mathrm{Tr}(A^TB)
$$
$$
\mathrm{Tr}(AB) = \mathrm{Tr}(BA)
$$
we can have,
$$
\mathrm{d}f = \sum_i \sum_j \frac{\partial y}{x_{ij}}\mathrm{d}x_{ij} = \sum_i \sum_j [\frac{\partial f}{\partial \mathbf{X}}]_{ij} [\mathrm{d}\mathbf{X}]_{ij} = \mathrm{Tr}[{({\frac{\partial f}{\partial \mathbf{X}})}^T} \mathrm{d}\mathbf{X}]
$$
so,
$$
\mathrm{d}f = \mathrm{Tr}({\frac{\partial f}{\partial \mathbf{X}}}^T)\mathrm{d}\mathbf{X}
$$
</p>
We already have,
We note,
Reference:
- 知乎:矩阵求导术(上)
- Neural Networks and Deep Learning: How the backpropagation algorithm works
- The Matrix Cookbook
- Caltech: EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 5
Regularization
Key idea is to add another term to the loss, which penalizes large weights.
L1 regularization
By using L1 regularization, w will be sparse.
L2 regularization
L2 regularization are used much more often during training neural network, it will make weights uniform,
In neural network, the loss function with regularization is written as,
When we do backpropagation to update weights(here assume we use SGD), the gradient of
Dropout
Core concept:
1. Dropout randomly knocks out units in the network, so it’s as if on every iteration, we are working with a smaller neural network and so using a smaller neural network seems like it should has a regularization effect.
- Make the neuron can not rely on any one feature, so it makes to spread out weights.
理解: dropout在每一次迭代都会抛弃部分输入数据(使某些输入为0),不使权值集中于某个或者部分输入特征上,而是使权值参数更加均匀的分布,可以理解为shrink weights,因此于 L2 正则化类似。
Tips of using Dropout:
1. Dropout is for preventing over-fitting. It the model is not over-fitting, it’s better not to use dropout.
理解: Dropout是用来解决over-fitting的,如果模型没有over-fitting,不必非要使用。
- Because of dropout, the loss function J can not be defined explicitly. So it’s hard to check whether loss decrease rightly. It’s a good choice to close dropout and check the loss decreases right to ensure that your code has no bug, and then open dropout to work.
Other Regularization Methods
Data augmentation
- Flipping
- Random rotation
- Clip
- Distortion
Early stopping
Check dev set error and early stop training.w is small at initialization and it will increase along with iteration. Early stop will get a mid-size rate w , so it’s similar to L2 regularization.
Preprocessing
Most dateset maybe have different size of image, a common preprocession is:
1. Scale image into same size or scale one side(width or height, often the short one) into the same size
2. Do data augmentation: flipping, random rotation
3. Crop a square from each image randomly
4. mean subtract
Per-pixel mean subtract
Subtract input image with per-pixel mean. The whole training set is
(N,C,H,W)
, the per-pixel mean is calculated by for each
C
computing the average of all the same position pixel over all image, and then will get mean matrix
which size is
# X size is (N, C, H, W)
mean = np.mean(X, axis=0)
mean.shape
>>> (C, H, W)
`Caffe` use per-pixel mean subtract in its [tutorial]().**注:** per-pixel mean处理时,每个通道是独立处理的, 因为不同通道的像素不具有平稳性(图像中不同部分的统计特性是相同的),并对同一位置的像素计算所有样本的平均值。 ### Per-channel mean subtract Subtract the mean of per channel calculated over all images. The training set size is (N,C,H,W) , the mean is calculated each channel over all images, and get the `mean vector` size of (C,) .
# X size is (N, C, H, W)
mean = np.mean(X, axis=(0, 2, 3))
mean.shape
>>> (C,)
Whether **per-pixel mean subtract** or **per-channel mean subtract**, they all serves to “center” the data, it means to make the mean of the dataset is around zero, which will help train the networks(make gradient healthy). And as far as I knowm, **per-channel mean subtract** is better and common choice for preprocessing. ***References:*** - [Github: KaimingHe/deep-residual-networks: preprocessing? #5](https://github.com/KaimingHe/deep-residual-networks/issues/5) - [caffe: Brewing ImageNet](http://caffe.berkeleyvision.org/gathered/examples/imagenet.html) - [Google Groups: Subtract mean image/pixel](https://groups.google.com/forum/#!topic/digits-users/FfeFp0MHQfQ) - [StackExchange: Why do we normalize images by subtracting the dataset’s image mean and not the current image mean in deep learning?](https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c) - [MathWorks: What is per-pixel mean?](https://cn.mathworks.com/matlabcentral/answers/292415-what-is-per-pixel-mean)
## Batch Normalization
Assume X is 4d input (N,C,H,W) , the output of batch normalization layer is
理解:按照
A toy example:
from torch import nn
from torch.autograd import Variable
import numpy as np
x = np.array([
[[[1,1], [1,1]], [[1,1], [1,1]]],
[[[2,2],[2,2]], [[2,2], [2,2]]]
], dtype=np.float32)
x = Variable(torch.from_numpy(x))
# No affine parameters.
bn = nn.BatchNorm2d(2, affine=False)
output = bn(x)
>>> Variable containing:
(0 ,0 ,.,.) =
-1.0000 -1.0000
-1.0000 -1.0000
(0 ,1 ,.,.) =
-1.0000 -1.0000
-1.0000 -1.0000
(1 ,0 ,.,.) =
1.0000 1.0000
1.0000 1.0000
(1 ,1 ,.,.) =
1.0000 1.0000
1.0000 1.0000
[torch.FloatTensor of size 2x2x2x2]
***Reference:*** - [PyTorch: BathNorm2d](http://pytorch.org/docs/master/nn.html#batchnorm2d) - [pytorch: 利用batch normalization对Variable进行normalize/instance normalize](http://blog.csdn.net/u014722627/article/details/68947016) ## Weight Initialization
Input featres x∼(μ,σ2) ,the output a=∑ni=1wixi ,其方差为
If we want output a to have the same variance as all of its input
理解:我们假设了输入特征和权重的均值都是0, E[xi]=0 , E(wi)=0 ,并且 wi,xi 之间都是相互独立的,且 xi 独立同分布, wi 独立同分布。因此,如果想要 a 与
# Calculate standard deviation.
stdv = 1 / math.sqrt(n)
# Numpy
w = np.random.uniform(-stdv, stdv)
即在以均值0为中心,一个标准差的范围内进行随机采样,这样使权值
w
更为接近0。
***Reference:***
- [cs231n: Weight Initialization](http://cs231n.github.io/neural-networks-2/#init)
- [Wiki: Variance](https://en.wikipedia.org/wiki/Variance)
- [知乎: 为什么神经网络在考虑梯度下降的时候,网络参数的初始值不能设定为全0,而是要采用随机初始化思想?](https://www.zhihu.com/question/36068411)
## Optimization Methods
Loss function is defined as
Advantages:
- Simpleness
Disdvantages:
- Large amounts of computation
- Memory may not enough to put all samples
- Difficult to update weights online
When the training set is very large, BGD will takes lots of time.
Stochastic Gradient Descent
SGD get one sample and calculate the gradient to update weights,
Mini-batch Gradient Descent
These method calculate the gradient of a mini-batch samples and the get the mean to update weights,
n_batches = m / batch_size
for i in n_batches
# use matrix to calculate loss of mini-batch
output = xxxx
loss = loss_function(output, target)
# update weights
w := w - lr * gradient
Mini-batch GD is much faster than Batch GD. But the loss of GD will go down all the time(assume that lear rate is suitable), but the loss of SGD or mini-batch GD will be noisy, it means that sometime the loss is decrease and sometime it will increase.
Summary
In one epoch(a single pass through training set), BGD update the weights onec; SGD update the weights
m
times; mini-batch GD update the weights
注: epoch的意思是训练时,遍历了整个训练集一次
How to choose mini-batch size?
- If small training set(
m≤2000
): use batch gradient descent
- Typical mini-batch size: 64 ~ 512, which is a power of 2 (because of the way computer memory is laid out and accessed, it will make computation run faster)
The following methods are optimized based on Gradient Descent, we ues
g
to notate gradient
Note: In deep learning we often use SGD, but you should know that then SGD here often represents mini-batch gradient descent.
Exponentially weighted averages(指数加权平均)
- vt : exponentially weighted average(a moving average) when time is t
θt : current value
vt
is an approximate average over
11−β
previous data. And
v0
is 0.
理解: vt 是前 11−β 个数据的平均值的近似。该方法是一种加窗型的平均值计算方法, β 决定了窗口的大小。(该平均值可以用来预测下一时刻的值为多少)
本质就是以指数式递减加权的移动平均。各数值的加权而随时间而指数式递减,越近期的数据加权越重,但较旧的数据也给予一定的加权。
Momentum
Basic idea is to computate an exponentially weighted average of the gradients, and use this average gradient to update weights.
- dw : current gradient
- η : learning rate
Some version of momentum is written as (like PyTorch),
It mean that
v
is divided by
The most common value for β is 0.9 (a average of last 10 gradients) for both two version momentum. The difference is that the second version’s v is larger than the first one, which will only have a influence on the learning rate.
理解: Momentum通过计算当前时刻的梯度的平均值(指数加权平均)来作为更新参数的梯度, 即借用了当前信息与历史信息来修正梯度从而得到更好的优化方向。
References:
- deeplearing.ai: Momentum
Nesterov Momentum
Init
Then in each iteration
t
Compute
Reference:
- 知乎专栏:深度学习最全优化方法总结比较(SGD,Adagrad,Adadelta,Adam,Adamax,Nadam)
- 卷积神经网络中的优化算法比较 (注:该博客写的有些错误,主要了解其讲解的思想)
- 知乎:在神经网络中weight decay起到的做用是什么?momentum呢?normalization呢?
RMSprop (Root Mean Square)
Init
sdw=0
Then in iteration
t
Compute
理解: 直观理解,通过 s 的作用,让梯度大的参数除以一个大的值从而让参数更新的幅度减小,而让梯度小的参数除以一个小的值从而让参数更新的幅度变大。
In order to avoid inf
or a very large value.
Adam (Adaptive Moment Estimation)
A combination of Momentum and RMSprop
Init
vdw=0
,
sdw=0
Then in iteration
t
Compute
A common value for β1 is 0.9, and β2 is 0.999
Learning Rate Decay
During training, with the increase of epoch, decrease value of learning rate. Because when in earlier epoch network can accept a relatively large learning rate which can accelarate training, but with the loss decrease we are getting closer to the optimal solution, using a smaller learning rate can help us get the solution in a tighter region around the minimum.
1D, 2D, 3D Convlutions
- 1D convolution:
- Input: a vector [Cin,Lin]
- Kernel: a vector [k,]
- Output(one kernel): a vector [Lout,]
- 2D convolution:
- Input: a image [1,H,W] or [Cin,H,W]
- Kernel: [Cin,k,k]
- Output(one kernel): a feature map [Hout,Wout]
- 3D convolution:
- Input: a video or CT [Cin,D,H,W]
- Kernel: [Cin,k,k,k]
- Output(one kernel): [Dout,Hout,Wout]
Notice that the dimensions of the output after convolution make the name of what kind convolution it is.
注: 几维的卷积是由一个卷积核卷积之后的输出结果的维度决定的。
References:
- 网易-deeplearning.ai: Convolution over volumes
Loss Function
Classification
Cross Entropy
Basic knowledge:
- Entropy(Shannon Entropy): Shannon defined the entropy
H
of a discrete random variable
$$
H(X) = E[I(X)] = E[-\ln(P(X))]
$$
</p>
Here $E$ is the *expected value operator*, and $I$ is the *information content* of $X$.<br>
It can be explicitly be written as,
<p>
$$
H(X) = \sum_{i=1}^{n}P(x_i)I(x_i) = -\sum_{i=1}^{n}P(x_i)\log_b P(x_i)
$$
where $b$ is the base of the logarithm used. Common values of $b$ are 2, Euler's number $e$, and 10. In machine learning and deep learning, people often use $e$.
</p>
- KL divergence from Ŷ to Y is the difference between cross entropy and entropy
注: 熵的本质是香农信息量的期望,信息量就是上面公式中的
I(X)
。
- 信息量: 对信息的度量,随机变量所代表的事件发生所带来的信息的大小。出现概率小的事件信息量多,而事件发生的概率越大,则信息量越小,即信息量的大小与事件发生的概率大小成反比。
- 熵:度量了随机变量
X
平均的信息量。
- 交叉熵:使用估计出的分布q去逼近真实分布p所需要的信息
- KL散度: 交叉熵与熵的差值
References:
- A Friendly Introduction to Cross-Entropy Loss
- 知乎:如何通俗的解释交叉熵与相对熵?
Awesome Papers
Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv-2014
这篇论文提出了VGG Net。核心思想是使用小的kernel(3 * 3)来实现深度比较深的网络。使用小的kernel的原因在于,在加深网路的同时控制网络的参数不会过多。
其中,该网络在训练时使用的fully connected layer,而在测试时为了适应不同大小的图片,将全连接层的参数转化为了对应参数大小的卷积核从而实现了fully convolution layer。
Deep Residual Learning for Image Recognition, CVPR-2016
论文中提出了deep network出现的degradation问题,深度网络退化问题,即:深度网络在训练是会出现其training loss比前层网络的loss要大,并将这种网络称为plain networks。论文中认为该问题的出现不是因为梯度消失(vanishing gradients), 因为这些plain networks中使用BN从而保证了在前向传播中信号是有non-zero variance的,并且他们通过实验证明方向传播中的由于BN的作用,梯度是正常的。作者认为出现这种plain networks的问题在于The deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error, 而且单纯地增加迭代次数无法解决这个问题。我的理解:这些病态网络的参数空间存在很多类似于马鞍面的这种情况,导致梯度值变化不大从而影响了最优化。
因此,该论文提出了Resudual Learning:
where,
(x)
is output of a few stacked layers,
x
denotes the input of the first layer of these layers. It makes the network to learn the residual between the input and output. And this reidual learning is realized by skip connection.
注: 参差网络学习的是输出与输入之间的参差,就是说:输出等于在输入的
x
的基础上在加上
(x)
。而之前的方法是学习从输入到输出的mapping:
x→ouput
, 并没有参差的学习。
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
CMU的多人姿态估计论文,效果很不错。核心思想为:heatmap + paf,其中heatmap为多人的关键点预测,paf为骨骼bone的方向预测。通过在图像上计算每个bone的方向(使用单位向量表示)来构建对骨骼点之间的相互关系的表示。网络的label为每个骨骼点的heatmap和每个bone的paf(数量为bone的两倍,分别描述x方向和y方向)。其数据预处理部分中,使用了matlab将含有多人的同一张图片的annotation生成为多个sample,即分为self_joints
和others_joints
,另外其数据增强是内嵌在caffe代码中,依次对图片做了scale
, rotate
, crop and pad
和flip
的操作。