Dive into Deep Learning

最新推荐文章于 2024-05-14 11:18:11 发布

rivergold_jobs

最新推荐文章于 2024-05-14 11:18:11 发布

阅读量1.4k

点赞数

分类专栏： Deep-Learn 文章标签： deep-learning

本文链接：https://blog.csdn.net/u012841277/article/details/78948674

版权

Deep-Learn 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Basics

Standard notations

Variable: $X$ (uppercase and no bold)
Matrix: $\mathbf{X}$ (upper-case and bold)
Vetor: $\mathbf{x}$ (lower-case and bold)
Element/Scalar: $x$ (lower-case and no bold)

Basic Steps for Deep Learning

Define the model structure
Initialize the model’s parameters
Loop:
- Calculate current loss(forward propagation)
- Calculate current gradient(backward propagation)
- Update parameters(gradient descent)

Backpropagation

Here are some notations we will need later. We use $w^l_{jk}$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l - 1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer. And we use $z^{l}_j$ to represent the input of the $j^{th}$ neuron in the $l^{th}$ layer, $a^l_j$ to represent the activation output in the j^{th} neuron in the $l^{th}$ layer. Similarly, $b^l_j$ represents the bias of the $j^{th}$ neuron in the $l^{th}$ layer.

Why use this cumbersome notation? Maybe it is better to use $j$ to refer to the inpurt neuron, and $k$ to the output neuron. Why we use vice versa? The reason is the activation output of the $j^{th}$ neuron in the $l^{th}$ layer can be expressed like,

a l j = σ (\sum k w l j k a l - 1 k + b l j)

$a^l_j = \sigma(\sum_k w^l_{jk}a^{l-1}_k + b^l_j)$ This expression can be rewritten into a matrix from as followings,

a l = σ (W l a l - 1 + b l)

$\mathbf{a}^l = \sigma(\mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l)$ where,

al $\mathbf{a}^{l}$ ,

al−1 $\mathbf{a}^{l-1}$ and

bl $\mathbf{b}^l$ are vectores,

Wl $\mathbf{W}^l$ is a weight matirx for the

jth $j^{th}$ layer, and its

jth $j^{th}$ row and

kth $k^{th}$ column is

wljk $w^l_{jk}$ . The elements in

jth $j^{th}$ row of

Wl $\mathbf{W}^l$ are reprent the weights of neurons in

(l−1)th $(l-1)^{th}$ layer connecting to the

jth $j^{th}$ neuron in

lth $l^{th}$ layer.

Then, we define the loss function

C $C$ , here we use the following notation(mean square error, MSE) as a example,

C = 1 2 1 m \sum i m ∥ y (i) - a L (x (i)) ∥ 2

$C = \frac{1}{2}\frac{1}{m}\sum_{i}^{m}\| \mathbf{y}^{(i)} - \mathbf{a}^{L}(\mathbf{x}^{(i)}) \|^2$ where,

L $L$ denotes the number of layers in the networks,

aL $\mathbf{a}^L$ denotes the final output of the network. And the loss of a single training example is

Cx(i)=12∥y(i)−aL∥2 $C_{\mathbf{x}^{(i)}} = \frac{1}{2}\|\mathbf{y}^{(i) - \mathbf{a}^L}\|^2$ .

Note: Backpropagation actually compute the partial derivatives

∂Cx(i)∂w $\frac{\partial C_{x^{(i)}}}{\partial w}$ and

∂Cx(i)∂b $\frac{\partial C_{x^{(i)}}}{\partial b}$ for single trainning example. Then, we calculate

∂C∂w $\frac{\partial C}{\partial w}$ and

∂C∂b $\frac{\partial C}{\partial b}$ by averageing over training samples (this step is for GD or mini-bath GD). Here we suppose the training example

x $\mathbf{x}$ has been fixed. And in order to simplify notation, we drop the

x $\mathbf{x}$ subscript, writing the loss

C(i)x $C_\mathbf{x}^{(i)}$ as

C $C$ .
So, for each single training sample

x $\mathbf{x}$ , the lose maybe written as,

C = 1 2 ∥ y - a L ∥ = 1 2 \sum j (y j - a L j) 2

$C = \frac{1}{2}\| \mathbf{y} - \mathbf{a}^L \| = \frac{1}{2}\sum_j (y_j - a^L_j)^2$ Here, we define

δlj $\delta^l_j$ as

δ l j = \partial C \partial z l j

$\delta^l_j = \frac{\partial C}{\partial z^l_j}$

$\delta^l_j$ shows that the input of $j^{th}$ neuron in the $l^{th}$ layer influences the extent of the network loss change (Details can be obtained from here).

理解： $\delta^l_j$ 表达了在第 $l$ 层网络的第 $j$ 个神经元的输入值的变化对最终的loss function的影响程度。
And we have,

z l j = \sum k w l j k a l - 1 k

$z^l_j = \sum_k w_{jk}^l a_k^{l-1}$

a l j = σ (z l j)

$a_j^l = \sigma(z_j^l)$ Then,

δ L j = \partial C \partial z L j = \sum k \partial C \partial a L k \partial a L k \partial z L j = \partial C \partial a L k σ' (z L j)

$\delta_j^L = \frac{\partial C}{\partial z^L_j} = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a_k^L}{\partial z_j^L} = \frac{\partial C}{\partial a^L_k} \sigma^{'}(z^L_j)$ Moreover,

δ l j = \partial C \partial z l j = \sum k \partial C \partial z l + 1 k \partial z l + 1 k \partial z l j = \sum k δ l + 1 k \partial z l + 1 k \partial z l j

$\delta^l_j = \frac{\partial C}{\partial z_j^l} = \sum_k \frac{\partial C}{\partial z^{l + 1}_k} \frac{\partial z_k^{l+1}}{\partial z_j^l} = \sum_k \delta^{l+1}_k \frac{\partial z_k^{l+1}}{\partial z_j^l}$ Because

z l + 1 k = \sum i w l + 1 k i a l i + b l i = \sum i w l + 1 k i σ (z l i) + b l i

$z_k^{l+1} = \sum_i w_{ki}^{l+1}a_i^l + b^l_i = \sum_i w^{l+1}_{ki}\sigma(z^{l}_i) + b^l_i$ Differentiating, we obtain

\partial z l + 1 k \partial z l j = w l + 1 k j σ' (z l j) (i = j)

$\frac{\partial z_k^{l+1}}{\partial z_j^l} = w^{l+1}_{kj}\sigma^{'}(z^l_j) ~~~~~~(i = j)$ Then, we get

δ l j = \sum k δ l + 1 k w l + 1 k j σ' (z l j)

$\delta_j^l = \sum_k \delta_k^{l+1} w^{l+1}_{kj} \sigma^{'}(z^l_j)$
理解：

wl+1kj $w^{l+1}_{kj}$ 表示位于

(l+1)th $(l+1)^{th}$ 层的

kth $k^{th}$ 神经元连接到

lth $l^{th}$ 层

jth $j^{th}$ 神经元的权值，该公式表明，将

(l+1)th $(l+1)^{th}$ 层的所有神经元的梯度变化分别乘以其与

lth $l^{th}$ 层

kth $k^{th}$ 神经元的权值并相加。
Our goal is to update

wljk $w^l_{jk}$ and

blj $b^l_j$ , and we need to calculate the partial derivative,

\partial C \partial w l j k = \sum i \partial C \partial z l i \partial z l i w l j k = \partial C \partial z l j \partial z l j \partial w l j k = δ l j a l - 1 k

$\frac{\partial C}{\partial w_{jk}^{l}} = \sum_i \frac{\partial C}{\partial z^l_{i}} \frac{\partial z^l_i}{w^l_{jk}} = \frac{\partial C}{\partial z^l_{j}} \frac{\partial z^l_{j}}{\partial w^l_{jk}} = \delta^{l}_j a^{l-1}_k$

\partial C \partial b l j = \sum i \partial C z l i \partial z l i b l j = δ j

$\frac{\partial C}{\partial b^l_j} = \sum_i \frac{\partial C}{z^l_i} \frac{\partial z^l_i}{b^l_j} = \delta_j$ So far, we have four key formulas of backpropagation,

δ L j = \partial C \partial a L k σ' (z L j) δ l j = \sum k δ l + 1 k w l + 1 k j σ' (z l j) \partial C \partial w l j k = δ l j a l - 1 k \partial C \partial b l j = δ l j (1) (2) (3) (4)

$\begin{aligned} & \delta_j^L = \frac{\partial C}{\partial a^L_k} \sigma^{'}(z^L_j) & ~(1) \\ & \delta_j^l = \sum_k \delta_k^{l+1} w^{l+1}_{kj} \sigma^{'}(z^l_j) & ~(2) \\ & \frac{\partial C}{\partial w_{jk}^{l}} = \delta^{l}_j a^{l-1}_k &~(3)\\ & \frac{\partial C}{\partial b^l_j} = \delta_j^l &~(4) \\ \end{aligned}$

Deduce BP with Vectorization

Here we use the concept of differential:
- Monadic calculus: $\mathrm{d}f = f^{'}(x)\mathrm{d}x$
- Multivariable calculus:
- Scalar to vector

$$
    \mathrm{d}f = \sum_i \frac{\partial f}{\partial x_i} = {\frac{\partial f}{\partial \mathbf{x}}^T}\mathrm{d}\mathbf{x}
    $$

</p>

- Scalar to matrix
    <p>

    base on trace of a matrix,
    $$
        \sum_i \sum_j a_{ij}b_{ij} = \mathrm{Tr}(A^TB)
        $$

    $$
        \mathrm{Tr}(AB) = \mathrm{Tr}(BA)
        $$

    we can have,
    $$
        \mathrm{d}f = \sum_i \sum_j \frac{\partial y}{x_{ij}}\mathrm{d}x_{ij} = \sum_i \sum_j [\frac{\partial f}{\partial \mathbf{X}}]_{ij} [\mathrm{d}\mathbf{X}]_{ij} = \mathrm{Tr}[{({\frac{\partial f}{\partial \mathbf{X}})}^T} \mathrm{d}\mathbf{X}]
        $$

    so,
    $$
        \mathrm{d}f = \mathrm{Tr}({\frac{\partial f}{\partial \mathbf{X}}}^T)\mathrm{d}\mathbf{X}
        $$

    </p>

We already have,

z l = W l a l - 1 + b l

$\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l$

a l = σ (z l)

$\mathbf{a}^l = \sigma(\mathbf{z}^l)$ And,

\partial J \partial W l = \partial J \partial a l \partial a l \partial z l \partial z \partial W l = \partial J \partial z l \partial z l \partial W l

$\frac{\partial J}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{a}^{l}} \frac{\partial \mathbf{a}^l}{\partial \mathbf{z}^l} \frac{\partial \mathbf{z}}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{z}^l} \frac{\partial \mathbf{z}^l} {\partial \mathbf{W}^l}$

d z l = d W l a l - 1 = a l - 1 d W l

$\mathrm{d}\mathbf{z}^l = \mathrm{d} \mathbf{W}^l \mathbf{a}^{l-1} = \mathbf{a}^{l-1} \mathrm{d} \mathbf{W}^{l}$ so,

\partial z l \partial W l = a l - 1 T

$\frac{\partial \mathbf{z}^l} {\partial \mathbf{W}^l} = {\mathbf{a}^{l-1}}^T$ then, calulate

∂J∂zl $\frac{\partial J}{\partial \mathbf{z}^l}$

d J = T r [(\partial J \partial a l) T d a l] = T r [(\partial J \partial a l) T σ' (z l) ⊙ d z l]

$\mathrm{d} J = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{a}^l})}^T \mathrm{d} \mathbf{a}^l] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{a}^l})}^T \sigma^{'}(\mathbf{z}^l) \odot \mathrm{d} \mathbf{z}^l]$

\partial J \partial z l = \partial J \partial a l ⊙ σ' (z l)

$\frac{\partial J}{\partial \mathbf{z}^l} = \frac{\partial J}{\partial \mathbf{a}^l} \odot \sigma^{'}(\mathbf{z}^l)$ and,

d J = T r [(\partial J \partial z l + 1) T d z l + 1] = T r [(\partial J \partial z l + 1) T d W l + 1 a l] = T r [(\partial J \partial z l + 1) T W l + 1 d a l]

$\mathrm{d}J = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathrm{d}\mathbf{z}^{l+1}] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathrm{d}\mathbf{W}^{l+1}\mathbf{a}^l] = \mathrm{Tr}[{(\frac{\partial J}{\partial \mathbf{z}^{l+1}})}^T \mathbf{W}^{l+1}\mathrm{d} \mathbf{a}^l]$

\partial J \partial a l = W l + 1 \partial J \partial z l + 1

$\frac{\partial J}{\partial \mathbf{a}^l} = \mathbf{W}^{l+1} \frac{\partial J}{\partial \mathbf{z}^{l+1}}$ Until now we have,

\partial J \partial W l = \partial J \partial z l a l - 1 T

$\frac{\partial J}{\partial \mathbf{W}^l} = \frac{\partial J}{\partial \mathbf{z}^l} {\mathbf{a}^{l-1}}^T$

\partial J \partial z L = (W l + 1 \partial J \partial z l + 1) ⊙ σ' (z l)

$\frac{\partial J}{\partial \mathbf{z}^L} = (\mathbf{W}^{l+1} \frac{\partial J}{\partial \mathbf{z}^{l+1}}) \odot \sigma^{'}(\mathbf{z}^l)$
We note,

δ l = \partial J \partial z l

$\delta^l = \frac{\partial J}{\partial \mathbf{z}^{l}}$ And we can rewrite these formulas into matrix-based form, as

δ L = \nabla a L C ⊙ σ' (z L) δ l = ((W l + 1) T δ l + 1) ⊙ σ' (z l) \nabla W l C = δ l (a l - 1) T \nabla b l C = δ l (1) (2) (3) (4)

$\begin{aligned} & \delta^L = \nabla_{\mathbf{a}^L} C \odot \sigma^{'}(\mathbf{z}^L) & ~(1) \\ & \delta^l = ({(\mathbf{W}^{l+1})}^T \delta^{l+1}) \odot \sigma^{'}(\mathbf{z}^l) & ~(2) \\ & \nabla_{\mathbf{W}^l}C = \delta^{l} {(\mathbf{a}^{l - 1})}^T & ~(3) \\ & \nabla_{\mathbf{b}^l}C = \delta^l & ~(4) \\ \end{aligned}$

\partial J \partial W L - 1 = \partial J \partial a L \partial a L \partial z L \partial z L a L - 1 \partial a L - 1 \partial z L - 1 \partial z L - 1 \partial W L - 1

$\frac{\partial J}{\partial \mathbf{W}^{L-1}} = \frac{\partial J}{\partial\mathbf{a}^L} \frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} \frac{\partial \mathbf{z}^L}{\mathbf{a}^{L-1}} \frac{\partial\mathbf{a}^{L-1}}{\partial \mathbf{z}^{L-1}} \frac{\partial\mathbf{z}^{L-1}}{\partial\mathbf{W}^{L-1}}$

. . .

$...$

\partial J \partial W l = \partial J z l \partial z l \partial W l

$\frac{\partial J}{\partial \mathbf{W}^l} = \frac{\partial J}{\mathbf{z}^l} \frac{\partial \mathbf{z}^l}{\partial \mathbf{W}^l}$

\partial J \partial b l = \partial J \partial z l

$\frac{\partial J}{\partial \mathbf{b}^l} = \frac{\partial J}{\partial \mathbf{z}^l}$ Note that,

d z l = t r (d W l a l - 1) = t r (a l - 1 d W l)

$\mathrm{d}\mathbf{z}^l = \mathrm{tr}(\mathrm{d}\mathbf{W}^l\mathbf{a}^{l-1}) = \mathrm{tr}(\mathbf{a}^{l-1}\mathrm{d}\mathbf{W}^{l})$

\partial z l \partial W l = a l - 1 T

$\frac{\partial \mathbf{z}^l}{\partial \mathbf{W}^l} = {\mathbf{a}^{l-1}}^T$

\partial J \partial W l = \partial J \partial z l a l - 1 T

$\frac{\partial J}{\partial\mathbf{W}^l} = \frac{\partial J}{\partial\mathbf{z}^l} {\mathbf{a}^{l-1}}^T$ -->

Reference:
- 知乎：矩阵求导术（上）
- Neural Networks and Deep Learning: How the backpropagation algorithm works
- The Matrix Cookbook
- Caltech: EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 5

Regularization

Key idea is to add another term to the loss, which penalizes large weights.

$L_1$ regularization

λ \sum i = 1 n | w i | = λ ∥ w ∥ 1

$\lambda \sum_{i=1}^{n} | w_i | = \lambda {\|\mathbf{w}\|}_1$

By using $L_1$ regularization, $\mathbf{w}$ will be sparse.

$L_2$ regularization

$L2$ regularization are used much more often during training neural network, it will make weights uniform,

λ \sum i = 1 n w 2 i = λ ∥ w ∥ 22

$\lambda \sum_{i=1}^{n} w_i^2 = \lambda {\|\mathbf{w}\|}_2^2$

In neural network, the loss function with regularization is written as,

J (w 1, b 1, w 2, b 2, . . ., w L, b L) = 1 m \sum i = 1 m L (y ̂ (i), y (i)) + λ 2 m \sum l = 1 L ∥ w l ∥ 22

$J(\mathbf{w}^1, b^1, \mathbf{w}^2, b^2, ..., \mathbf{w}^L, b^L) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\sum_{l=1}^{L}\|\mathbf{w}^l\|^2_2$ where,

wl $\mathbf{w}^l$ is weights matrix and

b $b$ is a bias vector.
When we do backpropagation to update weights(here assume we use SGD), the gradient of

wL $\mathbf{w}^L$ is,

\partial J \partial w L = \partial L ( y ̂ ( i ) , y ( i ) ) \partial w L + λ w L

$\frac{\partial J}{\partial \mathbf{w}^L} = \frac{\partial L(\hat{y}^{(i)}, y^{(i)})}{\partial \mathbf{w}^L} + \lambda \mathbf{w}^L$ we note

∂L(ŷ (i),y(i))∂wL $\frac{\partial L(\hat{y}^{(i)}, y^{(i)})}{\partial \mathbf{w}^L}$ as

dwL $\mathrm{d}\mathbf{w}^L$

w L : = w L - α d w L - α λ w L

$\mathbf{w}^L := \mathbf{w}^L - \alpha \mathrm{d} \mathbf{w}^L - \alpha \lambda \mathbf{w}^L$ Here, the

λ $\lambda$ is called **weight decay**, no matter what value of

wL $\mathbf{w}^L$ is, this notation is intent to decay the weights(make weights’ absolute value small).

Dropout

Core concept:
1. Dropout randomly knocks out units in the network, so it’s as if on every iteration, we are working with a smaller neural network and so using a smaller neural network seems like it should has a regularization effect.

Make the neuron can not rely on any one feature, so it makes to spread out weights.
理解： dropout在每一次迭代都会抛弃部分输入数据（使某些输入为0），不使权值集中于某个或者部分输入特征上，而是使权值参数更加均匀的分布，可以理解为shrink weights，因此于 $L_2$ 正则化类似。

Tips of using Dropout:
1. Dropout is for preventing over-fitting. It the model is not over-fitting, it’s better not to use dropout.
理解： Dropout是用来解决over-fitting的，如果模型没有over-fitting，不必非要使用。

Because of dropout, the loss function $J$ can not be defined explicitly. So it’s hard to check whether loss decrease rightly. It’s a good choice to close dropout and check the loss decreases right to ensure that your code has no bug, and then open dropout to work.

Other Regularization Methods

Data augmentation
- Flipping
- Random rotation
- Clip
- Distortion
Early stopping
Check dev set error and early stop training. $\mathbf{w}$ is small at initialization and it will increase along with iteration. Early stop will get a mid-size rate $\mathbf{w}$ , so it’s similar to $L_2$ regularization.

Preprocessing

Most dateset maybe have different size of image, a common preprocession is:
1. Scale image into same size or scale one side(width or height, often the short one) into the same size
2. Do data augmentation: flipping, random rotation
3. Crop a square from each image randomly
4. mean subtract

Per-pixel mean subtract

Subtract input image with per-pixel mean. The whole training set is $(N, C, H, W)$ , the per-pixel mean is calculated by for each $C$ computing the average of all the same position pixel over all image, and then will get mean matrix which size is $(C, H, W)$ .

# X size is (N, C, H, W)
mean = np.mean(X, axis=0)
mean.shape
>>> (C, H, W)

`Caffe` use per-pixel mean subtract in its [tutorial]().
**注：** per-pixel mean处理时，每个通道是独立处理的，因为不同通道的像素不具有平稳性（图像中不同部分的统计特性是相同的），并对同一位置的像素计算所有样本的平均值。 ### Per-channel mean subtract Subtract the mean of per channel calculated over all images. The training set size is

(N,C,H,W) $(N, C, H, W)$ , the mean is calculated each channel over all images, and get the `mean vector` size of

(C,) $(C, )$ .

# X size is (N, C, H, W)
mean = np.mean(X, axis=(0, 2, 3))
mean.shape
>>> (C,)

Whether **per-pixel mean subtract** or **per-channel mean subtract**, they all serves to “center” the data, it means to make the mean of the dataset is around zero, which will help train the networks(make gradient healthy). And as far as I knowm, **per-channel mean subtract** is better and common choice for preprocessing. ***References:*** - [Github: KaimingHe/deep-residual-networks: preprocessing? #5](https://github.com/KaimingHe/deep-residual-networks/issues/5) - [caffe: Brewing ImageNet](http://caffe.berkeleyvision.org/gathered/examples/imagenet.html) - [Google Groups: Subtract mean image/pixel](https://groups.google.com/forum/#!topic/digits-users/FfeFp0MHQfQ) - [StackExchange: Why do we normalize images by subtracting the dataset’s image mean and not the current image mean in deep learning?](https://stats.stackexchange.com/questions/211436/why-do-we-normalize-images-by-subtracting-the-datasets-image-mean-and-not-the-c) - [MathWorks: What is per-pixel mean?](https://cn.mathworks.com/matlabcentral/answers/292415-what-is-per-pixel-mean)
## Batch Normalization

Assume $\mathbf{X}$ is 4d input $(N, C, H, W)$ , the output of batch normalization layer is

y = x - E [ x ] V a r [ x ] + ϵ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt * γ + β

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta$ where

x $x$ is a mini-batch of 3d input

(N,H,W) $(N, H, W)$ . The

E[x] $\mathrm{E}[x]$ and

Var[x] $\mathrm{Var}[x]$ are calculate pre-dimension over the mini-batches and

γ $\gamma$ and

β $\beta$ are learnable parameter vectors of size

C $C$ (the input size).

理解：按照

C $C$ 的维度，把其他维度的值拉成一个向量计算均值和方差，之后进行归一化：即对每个Channel的所有mini-batch样本所有值计算均值和方差并归一化。

A toy example:

from torch import nn
from torch.autograd import Variable
import numpy as np

x = np.array([
             [[[1,1], [1,1]], [[1,1], [1,1]]],
             [[[2,2],[2,2]], [[2,2], [2,2]]]
             ], dtype=np.float32)
x = Variable(torch.from_numpy(x))
# No affine parameters.
bn = nn.BatchNorm2d(2, affine=False)
output = bn(x)
>>> Variable containing:
(0 ,0 ,.,.) =
 -1.0000 -1.0000
 -1.0000 -1.0000

(0 ,1 ,.,.) =
 -1.0000 -1.0000
 -1.0000 -1.0000

(1 ,0 ,.,.) =
  1.0000  1.0000
  1.0000  1.0000

(1 ,1 ,.,.) =
  1.0000  1.0000
  1.0000  1.0000
[torch.FloatTensor of size 2x2x2x2]

***Reference:*** - [PyTorch: BathNorm2d](http://pytorch.org/docs/master/nn.html#batchnorm2d) - [pytorch: 利用batch normalization对Variable进行normalize/instance normalize](http://blog.csdn.net/u014722627/article/details/68947016)
## Weight Initialization

Input featres $\mathbf{x} \sim \mathcal{N}(\mu, \sigma^2)$ ，the output $\mathbf{a} = \sum_{i=1}^{n}w_ix_i$ ，其方差为

V a r (a) = V a r (\sum i = 1 n w i x i) = \sum i = 1 n V a r (w i x i)

$\mathrm{Var}(a) = \mathrm{Var}(\sum_{i=1}^{n}w_ix_i) = \sum_{i=1}^{n}\mathrm{Var}(w_ix_i)$

= \sum i = 1 n [E (w i)] 2 V a r (x i) + [E (x i)] 2 V a r (w i) + V a r (w i) V a r (x i)

$= \sum_{i=1}^{n}[\mathrm{E}(w_i)]^2\mathrm{Var}(x_i) + [\mathrm{E}(x_i)]^2 \mathrm{Var}(w_i) + \mathrm{Var}(w_i)\mathrm{Var}(x_i)$

= \sum i = 1 n V a r (w i) V a r (x i)

$= \sum_{i=1}^{n} \mathrm{Var}(w_i) \mathrm{Var}(x_i)$

= n V a r (w) V a r (x)

$= n\mathrm{Var}(w) \mathrm{Var}(x)$ Here, we assumed zero mean inputs and weights, so

E[xi]=0,E[wi]=0 $\mathrm{E}[x_i] = 0, \mathrm{E}[w_i] = 0$ , and

wi,xi $w_i, x_i$ are independent each other,

xi(i=1,2,..,n) $x_i (i = 1,2,..,n)$ are independent identically distributed and

wi(i=1,2,..,n) $w_i (i = 1,2,..,n)$ are alse independent identically distributed.
If we want output

a $a$ to have the same variance as all of its input

x $x$ , the variance of

w $w$ needs to be

1n $\frac{1}{n}$ ,

Var(x)=1n $\mathrm{Var}(x) = \frac{1}{n}$ , it means

w∼(0,1n) $w \sim \mathcal{N}(0, \frac{1}{n})$ .

理解：我们假设了输入特征和权重的均值都是0，

E[xi]=0 $\mathrm{E}[x_i] = 0$ ，

E(wi)=0 $\mathrm{E}(w_i) = 0$ ，并且

wi,xi $w_i, x_i$ 之间都是相互独立的，且

xi $x_i$ 独立同分布，

wi $w_i$ 独立同分布。因此，如果想要

a $a$ 与

x $x$ 的方差相同（网络输入与输出的分布不发生改变），我们需要让

Var(w)=1n $\mathrm{Var}(w) = \frac{1}{n}$ ，即

w∼(0,1n) $w \sim \mathcal{N}(0, \frac{1}{n})$ ，又因为

Var(nx)=n2Var(x) $\mathrm{Var}(nx) = n^2\mathrm{Var}(x)$ ，所以有`w = np.random.randn(n) / sqrt(n)`. 另外，在深度学习代码实现中，通常采用下面所示的方法对参数初始化

# Calculate standard deviation.
stdv = 1 / math.sqrt(n)
# Numpy
w = np.random.uniform(-stdv, stdv)

即在以均值0为中心，一个标准差的范围内进行随机采样，这样使权值

w $w$ 更为接近0。 ***Reference:*** - [cs231n: Weight Initialization](http://cs231n.github.io/neural-networks-2/#init) - [Wiki: Variance](https://en.wikipedia.org/wiki/Variance) - [知乎: 为什么神经网络在考虑梯度下降的时候，网络参数的初始值不能设定为全0，而是要采用随机初始化思想？](https://www.zhihu.com/question/36068411)
## Optimization Methods

Loss function is defined as $J(\mathbf{W}, \mathbf{X})$ , $X \in \mathbb{R}^{m \times n}$ - $\eta$ : learning rate

### Batch Gradient Descent(BGD) BGD calculate the sum gradients of all samples and get the mean,

w i : = w i - η 1 m \sum k = 1 m \nabla w i J (w, x) (k)

$w_i := w_i - \eta \frac{1}{m}\sum_{k=1}^{m}\nabla_{w_i}J(\mathbf{w}, \mathbf{x})^{(k)}$

Advantages:
- Simpleness

Disdvantages:
- Large amounts of computation
- Memory may not enough to put all samples
- Difficult to update weights online

When the training set is very large, BGD will takes lots of time.

Stochastic Gradient Descent

SGD get one sample and calculate the gradient to update weights,

w i : = w i - η \nabla J (w, x) (k)

$w_i := w_i - \eta \nabla J(\mathbf{w}, \mathbf{x})^{(k)}$ A drawback of SGD is that direction maybe not always of the miniumum, because you only calculate one sample’s gradient time.

Mini-batch Gradient Descent

These method calculate the gradient of a mini-batch samples and the get the mean to update weights,

w i : = w i - η 1 b \sum k = j j + b \nabla w i J (w, x) (k)

$w_i := w_i - \eta \frac{1}{b}\sum_{k=j}^{j+b}\nabla_{w_i}J(\mathbf{w}, \mathbf{x})^{(k)}$

n_batches = m / batch_size
for i in n_batches
    # use matrix to calculate loss of mini-batch
    output = xxxx
    loss = loss_function(output, target)
    # update weights
    w := w - lr * gradient

Mini-batch GD is much faster than Batch GD. But the loss of GD will go down all the time(assume that lear rate is suitable), but the loss of SGD or mini-batch GD will be noisy, it means that sometime the loss is decrease and sometime it will increase.

Summary
In one epoch(a single pass through training set), BGD update the weights onec; SGD update the weights $m$ times; mini-batch GD update the weights $\frac{m}{\mathrm{batch~size}}$ times.

注： epoch的意思是训练时，遍历了整个训练集一次

How to choose mini-batch size?
- If small training set( $m \le 2000$ ): use batch gradient descent
- Typical mini-batch size: 64 ~ 512, which is a power of 2 (because of the way computer memory is laid out and accessed, it will make computation run faster)

The following methods are optimized based on Gradient Descent, we ues $g$ to notate gradient $\nabla J(\mathbf{w}, \mathbf{x})$ (This gradient can be the mean of all samples, a samples or a mean of a batch of samples).

Note: In deep learning we often use SGD, but you should know that then SGD here often represents mini-batch gradient descent.

Exponentially weighted averages(指数加权平均)

$v_t$ : exponentially weighted average(a moving average) when time is $t$
$\theta_t$ : current value

v t = β v t - 1 + (1 - β) θ t

$v_t = \beta v_{t-1} + (1-\beta) \theta_t$ It can be rewritten as,

v t = (1 - β) θ t + (1 - β) β θ t - 1 + (1 - β) β 2 θ t - 2 + . . .

$v_t = (1-\beta)\theta_t + (1-\beta) \beta \theta_{t-1} + (1-\beta)\beta^2 \theta_{t-2} + ...$

$v_t$ is an approximate average over $\frac{1}{1-\beta}$ previous data. And $v_0$ is 0.

理解： $v_t$ 是前 $\frac{1}{1-\beta}$ 个数据的平均值的近似。该方法是一种加窗型的平均值计算方法， $\beta$ 决定了窗口的大小。（该平均值可以用来预测下一时刻的值为多少）

本质就是以指数式递减加权的移动平均。各数值的加权而随时间而指数式递减，越近期的数据加权越重，但较旧的数据也给予一定的加权。

Momentum

Basic idea is to computate an exponentially weighted average of the gradients, and use this average gradient to update weights.

$\mathrm{dw}$ : current gradient
$\eta$ : learning rate

v d w : = β v d w + (1 - β) d w

$v_{dw} := \beta v_{dw} + (1-\beta) \mathrm{dw}$

w : = w - η v d w

$w := w - \eta v_{dw}$

Some version of momentum is written as (like PyTorch),

v d w : = β v d w + d w

$v_{dw} := \beta v_{dw} + \mathrm{dw}$

It mean that $v$ is divided by $1-\beta$ .

The most common value for $\beta$ is $0.9$ (a average of last 10 gradients) for both two version momentum. The difference is that the second version’s $v$ is larger than the first one, which will only have a influence on the learning rate.

理解： Momentum通过计算当前时刻的梯度的平均值（指数加权平均）来作为更新参数的梯度，即借用了当前信息与历史信息来修正梯度从而得到更好的优化方向。

References:
- deeplearing.ai: Momentum

Nesterov Momentum

Init $v_{dw}=0$
Then in each iteration $t$
Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then

v : = μ v t - 1 + g

$v := \mu v_{t-1} + g$

w i next = w - η v

$w_{i_{\text{next}}} = w - \eta v$

v : = μ v t - 1 + g w i next

$v := \mu v_{t-1} + g_{w_{i_{\text{next}}}}$

w i : = w i - η v

$w_i := w_i - \eta v$

Reference:
- 知乎专栏：深度学习最全优化方法总结比较（SGD，Adagrad，Adadelta，Adam，Adamax，Nadam）
- 卷积神经网络中的优化算法比较 (注：该博客写的有些错误，主要了解其讲解的思想)
- 知乎：在神经网络中weight decay起到的做用是什么？momentum呢？normalization呢？

RMSprop (Root Mean Square)

Init $s_{dw}=0$
Then in iteration $t$
Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then

s d w : = β s d w + (1 - β) d w 2

$s_{dw} := \beta s_{dw} + (1 - \beta) {\mathrm{dw}}^2$

w : = w - η d w s d w ‾ ‾ ‾ \sqrt

$w := w - \eta \frac{\mathrm{dw}}{\sqrt{s_{dw}}}$ Here

dw2 ${\mathrm{dw}}^2$ is square of

dw $\mathrm{dw}$

理解： 直观理解，通过 $s$ 的作用，让梯度大的参数除以一个大的值从而让参数更新的幅度减小，而让梯度小的参数除以一个小的值从而让参数更新的幅度变大。

In order to avoid $\sqrt{s_{dw}}$ is zero or near to zero, in parctice, we often add a small valur $\epsilon$ ( $10^{-8}$ ) to denominator $\frac{\mathrm{dw}}{\sqrt{s_{dw}}+\epsilon}$ to avoid getting inf or a very large value.

Adam (Adaptive Moment Estimation)

A combination of Momentum and RMSprop

Init $v_{dw}=0$ , $s_{dw}=0$
Then in iteration $t$
Compute $\mathrm{dw}$ and $\mathrm{db}$ on current mini-batch, then

v d w : = β 1 v d w + (1 - β 1) d w

$v_{dw} := \beta_1 v_{dw} + (1-\beta_1) \mathrm{dw}$

s d w : = β 2 s d w + (1 - β 2) d w

$s_{dw} := \beta_2 s_{dw} + (1-\beta_2) \mathrm{dw}$ Do bias correction,

v c o r r e c t d w = v d w ( 1 - β t 1 )

$v_{dw}^{correct} = \frac{v_{dw}}{(1 - \beta_1^t)}$

s c o r r e c t d w = s d w ( 1 - β t 2 )

$s_{dw}^{correct} = \frac{s_{dw}}{(1 - \beta_2^t)}$ Update weights,

w : = w - η v c o r r e c t d w s c o r r e c t d w ‾ ‾ ‾ ‾ ‾ ‾ \sqrt + ϵ

$w := w - \eta \frac{v_{dw}^{correct}}{\sqrt{s_{dw}^{correct}} + \epsilon}$

A common value for $\beta_1$ is 0.9, and $\beta_2$ is 0.999

Learning Rate Decay

During training, with the increase of epoch, decrease value of learning rate. Because when in earlier epoch network can accept a relatively large learning rate which can accelarate training, but with the loss decrease we are getting closer to the optimal solution, using a smaller learning rate can help us get the solution in a tighter region around the minimum.

1D, 2D, 3D Convlutions

1D convolution:
- Input: a vector $[C_{in}, L_{in}]$
- Kernel: a vector $[k,]$
- Output(one kernel): a vector $[L_{out},]$
2D convolution:
- Input: a image $[1, H, W]$ or $[C_{in}, H, W]$
- Kernel: $[C_{in}, k, k]$
- Output(one kernel): a feature map $[H_{out}, W_{out}]$
3D convolution:
- Input: a video or CT $[C_{in}, D, H, W]$
- Kernel: $[C_{in}, k, k, k]$
- Output(one kernel): $[D_{out}, H_{out}, W_{out}]$

Notice that the dimensions of the output after convolution make the name of what kind convolution it is.
注：几维的卷积是由一个卷积核卷积之后的输出结果的维度决定的。

References:
- 网易-deeplearning.ai: Convolution over volumes

Loss Function

Classification

Cross Entropy

H (Y, Y ̂) = E Y [1 log Y ̂] = E Y [- log Y ̂]

$H(Y, \hat{Y}) = E_Y [\frac{1}{\log \hat{Y}}] = E_Y [-\log \hat{Y}]$

Basic knowledge:
- Entropy(Shannon Entropy): Shannon defined the entropy $H$ of a discrete random variable $X$ with possible values ${x_1, x_2, ..., x_n}$ and probability mass function $P(X)$ as:

$$
    H(X) = E[I(X)] = E[-\ln(P(X))]
    $$

</p>

Here $E$ is the *expected value operator*, and $I$ is the *information content* of $X$.<br>

It can be explicitly be written as,
<p>

$$
    H(X) = \sum_{i=1}^{n}P(x_i)I(x_i) = -\sum_{i=1}^{n}P(x_i)\log_b P(x_i)
    $$
where $b$ is the base of the logarithm used. Common values of $b$ are 2, Euler's number $e$, and 10. In machine learning and deep learning, people often use $e$.

</p>

KL divergence from $\hat{Y}$ to $Y$ is the difference between cross entropy and entropy

K L (Y ∥ Y ̂) = \sum i y i log 1 y i ^- \sum i y i log 1 y i = \sum i y i log y i y i ^

$\mathrm{KL}(Y\|\hat{Y}) = \sum_{i}y_i\log\frac{1}{\hat{y_i}} - \sum_{i}y_i\log\frac{1}{y_i} = \sum_{i}y_i\log\frac{y_i}{\hat{y_i}}$

注：熵的本质是香农信息量的期望，信息量就是上面公式中的 $I(X)$ 。
- 信息量：对信息的度量，随机变量所代表的事件发生所带来的信息的大小。出现概率小的事件信息量多，而事件发生的概率越大，则信息量越小，即信息量的大小与事件发生的概率大小成反比。
- 熵：度量了随机变量 $X$ 平均的信息量。
- 交叉熵：使用估计出的分布q去逼近真实分布p所需要的信息
- KL散度：交叉熵与熵的差值

References:
- A Friendly Introduction to Cross-Entropy Loss
- 知乎：如何通俗的解释交叉熵与相对熵?

Awesome Papers

Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv-2014

这篇论文提出了VGG Net。核心思想是使用小的kernel(3 * 3)来实现深度比较深的网络。使用小的kernel的原因在于，在加深网路的同时控制网络的参数不会过多。
其中，该网络在训练时使用的fully connected layer，而在测试时为了适应不同大小的图片，将全连接层的参数转化为了对应参数大小的卷积核从而实现了fully convolution layer。

Deep Residual Learning for Image Recognition, CVPR-2016

论文中提出了deep network出现的degradation问题，深度网络退化问题，即：深度网络在训练是会出现其training loss比前层网络的loss要大，并将这种网络称为plain networks。论文中认为该问题的出现不是因为梯度消失(vanishing gradients), 因为这些plain networks中使用BN从而保证了在前向传播中信号是有non-zero variance的，并且他们通过实验证明方向传播中的由于BN的作用，梯度是正常的。作者认为出现这种plain networks的问题在于The deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error, 而且单纯地增加迭代次数无法解决这个问题。我的理解：这些病态网络的参数空间存在很多类似于马鞍面的这种情况，导致梯度值变化不大从而影响了最优化。
因此，该论文提出了Resudual Learning:

 (x) =  (x) - x

$\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}$

where, $\mathcal{H}(\mathbf{x})$ is output of a few stacked layers, $\mathbf{x}$ denotes the input of the first layer of these layers. It makes the network to learn the residual between the input and output. And this reidual learning is realized by skip connection.
注：参差网络学习的是输出与输入之间的参差，就是说：输出等于在输入的 $\mathbf{x}$ 的基础上在加上 $\mathcal{F}(\mathbf{x})$ 。而之前的方法是学习从输入到输出的mapping: $\mathbf{x} \to \mathrm{ouput}$ , 并没有参差的学习。

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

CMU的多人姿态估计论文，效果很不错。核心思想为：heatmap + paf，其中heatmap为多人的关键点预测，paf为骨骼bone的方向预测。通过在图像上计算每个bone的方向（使用单位向量表示）来构建对骨骼点之间的相互关系的表示。网络的label为每个骨骼点的heatmap和每个bone的paf（数量为bone的两倍，分别描述x方向和y方向）。其数据预处理部分中，使用了matlab将含有多人的同一张图片的annotation生成为多个sample，即分为self_joints和others_joints，另外其数据增强是内嵌在caffe代码中，依次对图片做了scale, rotate, crop and pad和flip的操作。

rivergold_jobs

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Dive into Deep Learning

Deep Learning Programming StyleImperative Programs (命令式编程): PyTorch Programs perform computation as what you write.Symbolic Programs (符号式编程): Tensorflow We define the abstract function in terms of
复制链接

扫一扫