BP神经网络工作原理

最新推荐文章于 2024-03-02 01:14:21 发布

林大虫子

最新推荐文章于 2024-03-02 01:14:21 发布

阅读量389

点赞数

分类专栏：机器学习文章标签：神经网络梯度下降算法

本文链接：https://blog.csdn.net/west_609/article/details/78720342

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

The whole BP Neural Network computation repeat below procedures：

Forward Propagation
Compute cost
Backward Propagation
Update Parameters

Today I want to do a summary for them per my understanding since have learned long time ago but sometimes cannot remember much detail of each step.

Firstly, let’s get an overview understanding by below picture about the whole computation procedure about FP and BP:

这里写图片描述

Some diagram notations here:

Every Rectangle represents a single hidden layer in NN
Black rectangles represents Forward Propagation computation sequence
Red rectangles represents Backward Propagation computation sequence
The formulas inside each rectangle is the computation performs for each layer in FP or BP and we will discuss them in later sections

See here Denotation for all the denotations used in the above picture.

Forward Propagation

Forward Propagation is the sequence which computes from Left (input $X$ ) of the NN to the right ( $\hat Y$ ).

For a NN with depth $L$ , actually it repeats below computation procedure for the first $L-1$ layer:

Input: output of previous layer $A^{[\ell-1]}$
Compute linear output $Z^{[\ell]}$
$Z [ℓ] = W [ℓ] A [ℓ - 1] (1)$ $\bbox[yellow,5px,border:1px solid red] { Z^{[\ell]} = W^{[\ell]}A^{[\ell-1]} } \tag{1}$
Output: activation of current layer $A^{[\ell]}$ . Here $g$ represents activation function, like Relu, tanh, sigmoid, we use relu here for illustration:

$A [ℓ] = g (Z [ℓ]) (2)$ $\bbox[yellow,5px,border:1px solid red] { A^{[\ell]} = g(Z^{[\ell]}) } \tag{2}$

the activation output of each layer is the input of the next layer, it’s like a chain.

As in the layer $L$ we need to compute the possibility that the $i$ training example belongs to Label $Y$ , so we use sigmoid function in the last layer. The sigmoid function maps the infinite input to a range [0, 1] which meets our request. So in our program, we can do a map for this possibility, say >=0.5 is means the image is a cat image.

But in practice, we will also store the intermediate output as well as parameter for each layer in a cache, as they are needed when doing BP, so as you can see in the above picture:

l i n e a r_c a c h e [ℓ] = (A [ℓ - 1], W [ℓ], b [ℓ]) a c t i v a t i o n_c a c h e [ℓ] = (Z [ℓ])

$linear\_cache^{[\ell]} = (A^{[\ell-1]}, W^{[\ell]}, b^{[\ell]}) \\ activation\_cache^{[\ell]} = (Z^{[\ell]})$

python numpy implementation

formula (1):

def linear_forward(A, W, b):
...
Z = np.dot(W, A) + b
return Z

Compute Cost (Loss/Error)

Here the cost function defines “How well our algorithm performs when our prediction is $\hat Y$ while the actual class is $Y$ ”. The less the cost is, the better our algorithm works.

From another angle, you can think of it as the error between our prediction and the actual value, lower cost value means our prediction has much higher accuracy, thus works much better.

The cross-entropy cost $J$ can be computed using the following formula:

J = - 1 m \sum i = 1 m (y (i) log (a [L] (i)) + (1 - y (i)) log (1 - a [L] (i))) (3)

$\bbox[yellow,5px,border:1px solid red] { J = -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) } \tag{3}$

python numpy implementation

formula (3):

def compute_cost(AL, Y):
...
    m = Y.shape[1]
    cost = -np.sum(np.multiply(Y, np.log(AL)) + np.multiply((1-Y), np.log(1-AL)))/m

    cost = np.squeeze(cost)
    return cost

Backward Propagation

BP is critical in the whole NN algorithm, It computes parameter partial derivative with respect to Cost function $J$ . And these parameter derivatives will be used to update the all the parameters ( $W, b$ ) in all layers which will introduce in later sections. the updated parameters will be used in next iteration to compute the cost again which will reduce the cost comparing to cost in previous iteration.

The BP algorithm does computation from right most layer $L$ to the left most layer (first layer) - see the picture at the beginning.

Within each layer (a red rectangle in the picture), the BP algorithm computes two kinds of derivatives: activation derivative and linear parameter derivative

Activation derivative

d Z [ℓ] = d A [ℓ] * g' (Z [ℓ]) (4)

$\bbox[yellow,5px,border:1px solid red] { dZ^{[\ell]} = dA^{[\ell]} * g^{'}(Z^{[\ell]}) } \tag {4}$

g′ $g^{'}$ is the derivative which depends on which activation function used in current layer (sigmoid/relu/tanh).

Linear derivative

$dZ^{[\ell]}$ will be used to compute below three outputs:

d W [l] = \partial L \partial W [ l ] = 1 m d Z [l] A [l - 1] T (5)

$\bbox[yellow,5px,border:1px solid red] { dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} } \tag{5}$

d b [l] = \partial L \partial b [ l ] = 1 m \sum i = 1 m d Z [l] (i) (6)

$\bbox[yellow,5px,border:1px solid red] { db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)} }\tag{6}$

d A [l - 1] = \partial L \partial A [ l - 1 ] = W [l] T d Z [l] (7)

$\bbox[yellow,5px,border:1px solid red] { dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} }\tag{7}$

a bit trick here is the initial derivative $dA^{[L]}$ in BP computation graph of the $L$ layer, it’s not computed by formula (7), the formula for this initial parameter is

d A [L] = - (Y A [ L ] - 1 - Y 1 - A [ L ]) (8)

$\bbox[yellow,5px,border:1px solid red] { dA^{[L]} = -(\frac Y {A^{[L]}} - \frac {1-Y} {1-A^{[L]}}) }\tag{8}$

python numpy implementation

formula (5/6/7):

def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = np.dot(dZ, A_prev.T)/m
    db = np.sum(dZ, axis=1, keepdims=True)/m
    dA_prev = np.dot(W.T, dZ)

    return (dA_prev, dW, db)

Update parameters

Since all parameter derivatives in all layers are now available, so we can get the updated parameters by below formulas using gradient decent:

W [l] = W [l] - α d W [l]

$\bbox[yellow,5px,border:1px solid red] { W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} }$

b [l] = b [l] - α d b [l]

$\bbox[yellow,5px,border:1px solid red] { b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} }$

Summary

By now it looks like the whole NP algorithm is more clear, it repeat below steps. And after each iteration, the cost should be reduced.

$\require{AMScd}$

F o r w a r d P r o p a g a t i o n W [ℓ] / b [ℓ] ↑ ⏐ ⏐ U p d a t e P a r a m e t e r s - \to - - - - - A [L] \leftarrow - - - - - - d W [ℓ] / d b [ℓ] C o m p u t e C o s t ⏐ ↓ ⏐ J B a c k w a r d P r o p a g a t i o n

$\begin{CD} Forward Propagation @>A^{[L]}>> Compute Cost\\ @A W^{[\ell]/}b^{[\ell]} A A @VV J V\\ Update Parameters @<<dW^{[\ell]}/ db^{[\ell]}< Backward Propagation \end{CD}$

Actually the number of iteration is some hyper parameter with gradient decent algorithm in some opensource ML framework, like tensorflow. The more iterations we go through this algorithm we may get lower cost, best fit parameters and also higher prediction accuracy in training-set for our model, but may also cause overfiting.

林大虫子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
BP神经网络工作原理

The whole BP Neural Network computation repeat below procedures：Forward Propagation Compute costBackward PropagationUpdate ParametersToday I want to do a summary for them per my understanding sinc
复制链接

扫一扫

专栏目录