python实现神经网络梯度下降算法_多层神经网络的反向传播公式(使用随机梯度下降法)...

Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:

def sigmoid_prime(z):

return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):

a = np.array(input_vector, ndmin=2).T

y = np.array(target_vector, ndmin=2).T

# forward

A = [a]

for k in range(3):

a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity

A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, tmp) # (1)

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

It works, but:

the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.

It is much better (i.e. the convergence is much better) when the line (1) is replaced by:

delta = np.dot(self.weights[k].T, delta) # (2)

delta = np.dot(self.weights[k].T, delta)

instead of:

delta = np.dot(self.weights[k].T, tmp)

(With the notations of this article, it is:

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

)

These 2 arguments seem to be concordant: code (2) is better than code (1).

However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):

Question: which one is correct: the implementation (1) or (2)?

In LaTeX:

$$C = \frac{1}{2} (a^L - y)^2$$

$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$

$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$

$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$

$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$

解决方案

I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:

the maths written in LaTeX in the question are correct

the code (1) is the correct one, and it agrees with the math computations:

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, tmp)

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

code (2) is wrong:

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, delta) # WRONG HERE

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

should be

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))

Now the difficult part that took me days to realize:

Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong

... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).

Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.

Now solved!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值