python实现神经网络梯度下降算法_多层神经网络的反向传播公式（使用随机梯度下降法）...

最新推荐文章于 2022-10-14 19:52:50 发布

weixin_39664431

最新推荐文章于 2022-10-14 19:52:50 发布

阅读量135

点赞数

文章标签： python实现神经网络梯度下降算法

Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:

def sigmoid_prime(z):

return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):

a = np.array(input_vector, ndmin=2).T

y = np.array(target_vector, ndmin=2).T

# forward

A = [a]

for k in range(3):

a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity

A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, tmp) # (1)

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

It works, but:

the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.

It is much better (i.e. the convergence is much better) when the line (1) is replaced by:

delta = np.dot(self.weights[k].T, delta) # (2)

delta = np.dot(self.weights[k].T, delta)

instead of:

delta = np.dot(self.weights[k].T, tmp)

(With the notations of this article, it is:

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

)

These 2 arguments seem to be concordant: code (2) is better than code (1).

However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):

Question: which one is correct: the implementation (1) or (2)?

In LaTeX:

$$C = \frac{1}{2} (a^L - y)^2$$

$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$

$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$

$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$

$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$

解决方案

I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:

the maths written in LaTeX in the question are correct

the code (1) is the correct one, and it agrees with the math computations:

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, tmp)

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

code (2) is wrong:

delta = a - y

for k in [2, 1, 0]:

tmp = delta * sigmoid_prime(A[k+1])

delta = np.dot(self.weights[k].T, delta) # WRONG HERE

self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

should be

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))

Now the difficult part that took me days to realize:

Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong

... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).

Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.

Now solved!

weixin_39664431

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python实现神经网络梯度下降算法_多层神经网络的反向传播公式（使用随机梯度下降法）...

Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:def sigmoid_prime(z):return z * (1-z...
复制链接

扫一扫