Neural Network， CNN 简介

最新推荐文章于 2024-04-25 10:10:07 发布

不吃草的小猪

最新推荐文章于 2024-04-25 10:10:07 发布

阅读量415

点赞数

分类专栏： # Machine learning in Action 文章标签：深度学习 pytorch 机器学习神经网络

本文链接：https://blog.csdn.net/Convolution_ZQ/article/details/104352384

版权

Machine learning in Action 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1. Activation FCN

1.1.常用于全链接层

1.1.1. Sigmoid FCN

梯度下降过程中，容易出现过饱和和造成终止梯度传递现象，且没有0中心化。
$\frac1{1 + e^{-z}}$

saturated nuerons can kill off the gradients
sigmoid outputs are not zero-centered.

1.1.2. tanh FCN

$f (x) = t a n h (x)$

Squash numbers to range[-1, 1]
zero centered
still kills gradientd when satuated

tanh graph

1.2. 常用于卷积层

ReLU FCN (Rectified Linear Unit)
收敛快， gradient求解简单。
but it still kill off haof of the gradients
$f (x) = m a x (0, x)$
be carefule with your learning rate

1.3. Other actification fcn

1.3.1. dead ReLu and active ReLu

1.3.2. Leaky ReLU

$f (x) = m a x (0.01 x, x)$

does not satuate
computationally efficient
Converges much faster than sigmoid/tanh in practice
will not kill off gradients
sometimes it could also be following formula: $\begin{cases} 1, \quad x<0 \\\ \alpha x\, +\, 1, \quad x\geq 0 \end{cases}$ where $\alpha$ is a small number.

1.3.3. Parametric Rectifier(PReLU)

$max(\alpha x, x)$

1.3.4. Exponential Linear Units(ELU)

$\begin{cases} x, \quad x>0 \\\ \alpha (e^{x} - 1), \quad x\leq 0 \end{cases}$

all benefits of ReLU
closer to zero mean outputs
negative saturation regime compared with leaky ReLU adds some robustness to noise

1.3.5. Maxout

$max(W_{1}^Tx_{1}\, +\, b_{1}, W_{2}^Tx_{2}\, +\, b_{2})$

ReLU and leaky ReLu are particular examples of Maxout

2. CNN

2.1. Computer vision

对于CNN而言，它是一块一块地对图像进行对比。而这个小块，我们称之为Features

2.2. 卷积

对图像（不同的数据窗口数据）和滤波矩阵（一组固定的权重：因为每个神经元的多个权重固定，所以又可以看作一个恒定的滤波器filter）做内积（逐个元素相乘再求和）的操作就是卷积。

下图中，途中左边部分是原始输入数据，途中中间部分是滤波器filter，图中右边是输出的二维数据。

$j)=\sum_{m}\sum_{n}I(m, n)K(i-m, j-n)$

一次操作（一层）中使用多个卷积 kernel 得到该尺度下的多张feature map。
多层（次）提取不同尺度下的不同特征信息

由于上述第一点改进，即使第一张图片输入通道只有一个通道，后面其他层的输入都是多通道。所以对应我们的convolution kernel 也是多通道。即输入图像和convolution都添加了channel 这个dimension，那么convolution layer中的convolution operation变为如下formula：
$c)=(I*K_{c})(i, j) = \sum_{c}\sum_{m}\sum_{n}I(m, n, c)K_{c}(i-m, j-n, c)$
Conclusion of feedforward calculation of convolution and corresponding function.

convolution operation 最重要的是如何确定convolution kernel的核数
BP 告诉我们如何通过监督学习方法来优化我们convolution kernel的数值，是我们能够找到在对应任务下表现最好的convolution kernel（feature）。
在我们实现的convolution layer的class中，还应该包含一个backward方法，用于反向传播求导。

2.3. 图像上的卷积

输入是一定区域大小（width*height）的数据，和filter做内积后得到新的二维数据。

Basically，左边是图像输入，中间部分是filter，不同的filter会得到不同的输出数据，比如颜色深浅、轮廓。相当于如果想要提取图像的不同特征，则用不同的filter，提取想要的关于图像的特定信息：颜色深浅或轮廓。

2.4. GIF 动态卷积

在CNN中， filter每次计算完成后，数据窗口会不断移动，直到计算完所有data。

depth：神经元个数，决定输出的depth厚度，同时代表filter个数
stride：决定滑动多少步可以到达边缘。
zero-padding（填充值）：在外围边缘补充若干圈0，方便从初始位置以stride为单位可以刚好画到末尾位置，通俗来说就是为了中场能够被stride整除。

gif

在下图中，参数如下：

depth = 2
stride = 2
zero-padding = 1

然后分别以两个filter为轴滑动数组进行convolution calculation。
左边为输入（7*7*3， 7*7 表示图像的pixel和width-height， 3表示R、G、B三个颜色channel）
中间为2 filters
右侧为result
随着左边窗口的滑动， filter对不同的局部数据进行convolution。
data窗口在滑动，导致input在发生改变，但是filter始终未发生变化，即采用了CNN中的参数（weights）共享机制
if we have m*m matrix as input, and n*n as filter, the stride is k, we will get the output matrix shape, i.e. (m-n)/k + 1, which shoule be integer or it won’t fit.

2.5. Pooling layer

pooling， basically，为区域平均或最大。

下图展示的是区域最大
max pooling, which is commonly use than aveage pooling.

max pooling

makes the representation smaller and more manageable
operates over each activation map independaently

2.6. Padding in practice

in pactice, we usually add padding border into our input matrix. And noramally, we use zero pad the border.

3. pre-work and corresponding process

3.1. weights initialisation

W = 0.01 * np.random.randn(D, H)
# it works for small networks, but it have problems in deeper neural networks

we should not initialise all weights to be zero, cause we want our neurons to do different thing
there is anoter mathod to initilise weights, which has been confirmed as practical, i.e. Calibrating the variances with $\frac1{\sqrt n}$

w = np.random.randn(n) / sqrt(n)  # where n is the num of inputs

3.2. Batch normalization

input:
$\times D$
Learnabke params:
$\gamma, \beta: D$
Intermediates:
$\mu, \sigma: D$
$\hat{x} : N \times D$
Output:
$\times D$
Update
$\mu_{j} = \frac{1}{N}\sum_{i=1}^{N}x_{i, j}$
$\sigma_{j}^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{i, j}\, -\, \mu_{j})^2$
$\hat{x_{i, j}} = \frac{x_{i, j}\, -\, \sigma_{j}}{\sqrt{\sigma_{j}^2\, +\, \varepsilon}}$
$y_{i, j} = \gamma_{j}\, \hat{x_{i, j}}\, +\, \beta_{j}$

3.3. learning rate

3.4. Hyperparameter search

grid search
random search

3.5. Optimization

3.5.1. SGD

$x_{t+1}= x_{t}\, -\,\alpha\nabla f(x_{t})$

while True:
    dx = compute_gradient(x)
    x += learning_rate * dx

3.5.2. SGD + Momentum

$v_{t+1}=\rho v_{t} \, +\, \nabla f(x_{t})$
$x_{t+1} = x_{t}\, -\, \alpha v_{t+1}$

build up ‘velocity’ as a running mean of gradients
Rho gives ‘friction’, typically rho = 0.0 or 0.99

vx = 0
while True:
    dx = compute_gradient(x)
    vx = rho * vx + dx
    x += learning_rate * vx

3.5.3. Nesternov Momentum

$v_{t+1}= \rho v_{t} - \alpha \nabla f(x_{t} + \rho v_{t})$
$x_{t+1} = x_{t} \, +\, v_{t+1}$

3.5.4. AdaGrad

grad_squard = 0
while True:
    dx = compute_gradient(X)
    grad_squard += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squard) +1e-7)

added element-wise scaling of the gradient based on the historical sum of squares in each dimension
not so common in solving questions

3.5.5. RMSProp

grad_squard = 0
while True:
    dx = compute_gradient(X)
    grad_squard = decay_rate * grad_squared + (1 -              decay_rate) * dx *dx
    x -= learning_rate * dx / (np.sqrt(grad_squard) + 1e-       7)

SGD + Momentum could do better than RMSProp, which could be a litte different and better than original SGD

3.5.6. Adam

first_moment = 0
second_moment = 0
for t in range(num_iterations):
    dx = compute _gradient(x)
    # Momentum
    first_moment = beta1 * first_moment + (1 - beta1) * dx
    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
    # bias correstion
    first_unbias = first_moment / (1 - beta1 ** t)
    second_unbias = second_moment / (1 - beta2 ** t)
    # AdaGrad/ RMSprop
    x -= learning_rate * first_moment / ...(np.sqrt(second_moment) + 1e-7) # 1e-7 is to avoid we could divide by zero

at first time, we initilise second_moment as zero, even though we run after one time, the second_moment could be very close to zero.
bias correction for the fact that first and second moment estimates start at zero.
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a good starting point for many models.

3.6. Sanity checks Tips

this is the description of what kinds of tips we need to use during sanity check
- Look for correct loss at chance performance
- Overfit a tiny subset of data

4. 代码实现

filter大小应该是个奇数，并且是个方阵
该函数首先确保每个滤波器的深度等于图像通道的数目，代码如下。if语句首先检查图像与滤波器是否有一个深度通道，若存在，则检查其通道数是否相等，如果匹配不成功，则报错

不吃草的小猪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Neural Network， CNN 简介

文章目录1. Activation FCN1.1.常用于全链接层1.1.1. Sigmoid FCN1.1.2. tanh FCN1.2. 常用于卷积层1.3. Other actification fcn1.3.1. dead ReLu and active ReLu1.3.2. Leaky ReLU1.3.3. Parametric Rectifier(PReLU)1.3.4. Exponen...
复制链接

扫一扫