gradient checking（梯度检验）

最新推荐文章于 2025-03-17 10:19:51 发布

天泽28

最新推荐文章于 2025-03-17 10:19:51 发布

阅读量5.4k

点赞数 8

分类专栏：机器学习&深度学习文章标签：深度神经网络梯度检验 gradient checking

本文链接：https://blog.csdn.net/u012328159/article/details/80232585

版权

机器学习&深度学习专栏收录该内容

86 篇文章

订阅专栏

Gradient Checking（梯度检验）

我们有时候实现完backward propagation，我们不知道自己实现的backward propagation到底是不是完全正确的（这篇博客只面向自己手撸的网络，直接搬砖的不需要考虑这个问题…），因此，通常要用梯度检验来检查自己实现的bp是否正确。其实所谓梯度检验，就是自己实现下导数的定义，去求w和b的导数（梯度），然后去和bp求到的梯度比较，如果差值在很小的范围内，则可以认为我们实现的bp没问题。

先来回顾下，导数的定义：

具体应用到神经网络的梯度检验中，因为我们没法做到 $\Delta x \rightarrow 0$ ，只能取一个比较小的数，所以为了结果更精确，我们可以把上面的公式稍微变下形：

通常设置 $\varepsilon = e^{-7}$ 即可。
具体到神经网络中，我们做梯度检验的步骤通常为：

把 $W^{[1]},b^{[1]},......,W^{[L]},b^{[L]}$ 转化成向量 $\theta$ 。
同样把 $dW^{[1]},db^{[1]},......,dW^{[L]},db^{[L]}$ 转化成向量 $d\theta$ 。
接下来是实现导数定义：

接下来我们比较 $d\theta_{approx}$ 与 $d\theta$ 是否大致相等，主要是计算两个向量之间的欧式距离：

通常设置阈值 $threshold = e^{-7}$ 。如果difference小于阈值，则认为实现的bp没问题。

关于上面的步骤，我们来看看代码实现，

1.把 $W^{[1]},b^{[1]},......,W^{[L]},b^{[L]}$ 转化成向量 $\theta$ 。

#convert parameter into vector
def dictionary_to_vector(parameters):
	"""
	Roll all our parameters dictionary into a single vector satisfying our specific required shape.
	"""
	count = 0
	for key in parameters:
		# flatten parameter
		new_vector = np.reshape(parameters[key], (-1, 1))#convert matrix into vector
		if count == 0:#刚开始时新建一个向量
			theta = new_vector
		else:
			theta = np.concatenate((theta, new_vector), axis=0)#和已有的向量合并成新向量
		count = count + 1

	return theta

2.把 $dW^{[1]},db^{[1]},......,dW^{[L]},db^{[L]}$ 转化成向量 $d\theta$ 。
注：这个地方一定要注意bp求得的gradients字典的存储顺序是{dWL,dbL,…dW2,db2,dW1,db1}，因为后面要求欧式距离，所以一定要把顺序转化为[dW1,db1,…dWL,dbL]。在这个地方踩过坑，花了很长时间才找出bug。

#convert gradients into vector
def gradients_to_vector(gradients):
	"""
	Roll all our parameters dictionary into a single vector satisfying our specific required shape.
	"""
	# 因为gradient的存储顺序是{dWL,dbL,....dW2,db2,dW1,db1}，
	#为了统一采用[dW1,db1,...dWL,dbL]方面后面求欧式距离（对应元素）
	L = len(gradients) // 2
	keys = []
	for l in range(L):
		keys.append("dW" + str(l + 1))
		keys.append("db" + str(l + 1))
	count = 0
	for key in keys:
		# flatten parameter
		new_vector = np.reshape(gradients[key], (-1, 1))#convert matrix into vector
		if count == 0:#刚开始时新建一个向量
			theta = new_vector
		else:
			theta = np.concatenate((theta, new_vector), axis=0)#和已有的向量合并成新向量
		count = count + 1

	return theta

第三步、第四步的代码如下：

def gradient_check(parameters, gradients, X, Y, layer_dims, epsilon=1e-7):
	"""
	Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

	Arguments:
	parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
	grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
	x -- input datapoint, of shape (input size, 1)
	y -- true "label"
	epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
	layer_dims -- the layer dimension of nn
	Returns:
	difference -- difference (2) between the approximated gradient and the backward propagation gradient
	"""

	parameters_vector = dictionary_to_vector(parameters)  # parameters_values
	grad = gradients_to_vector(gradients)
	num_parameters = parameters_vector.shape[0]
	J_plus = np.zeros((num_parameters, 1))
	J_minus = np.zeros((num_parameters, 1))
	gradapprox = np.zeros((num_parameters, 1))

	# Compute gradapprox
	for i in range(num_parameters):
		thetaplus = np.copy(parameters_vector)
		thetaplus[i] = thetaplus[i] + epsilon
		AL, _ = forward_propagation(X, vector_to_dictionary(thetaplus,layer_dims))
		J_plus[i] = compute_cost(AL,Y)

		thetaminus = np.copy(parameters_vector)
		thetaminus[i] = thetaminus[i] - epsilon
		AL, _ = forward_propagation(X, vector_to_dictionary(thetaminus, layer_dims))
		J_minus[i] = compute_cost(AL,Y)
		gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)

	numerator = np.linalg.norm(grad - gradapprox)
	denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
	difference = numerator / denominator

	if difference > 2e-7:
		print(
			"\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
	else:
		print(
			"\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

	return difference

这里在每一次计算bp时，还要把向量转化成矩阵，具体实现如下：

#convert vector into dictionary
def vector_to_dictionary(theta, layer_dims):
	"""
    Unroll all our parameters dictionary from a single vector satisfying our specific required shape.
    """
	parameters = {}
	L = len(layer_dims)  # the number of layers in the network
	start = 0
	end = 0
	for l in range(1, L):
		end += layer_dims[l]*layer_dims[l-1]
		parameters["W" + str(l)] = theta[start:end].reshape((layer_dims[l],layer_dims[l-1]))
		start = end
		end += layer_dims[l]*1
		parameters["b" + str(l)] = theta[start:end].reshape((layer_dims[l],1))
		start = end
	return parameters

还是拿sklearn中自带的breast_cancer数据集，来测试下，自己实现的bp到底对不对，顺便也是测下我们的gradient checking的代码对不对，测试结果如下： >Your backward propagation works perfectly fine! difference = 5.649104934345307e-11

可以看到我们实现的bp求到的梯度和用导数定义实现的梯度之间的差距是 $e^{-11}$ 这个数量级，所以我们实现的bp正确无误。
完整的代码已放到github上：gradient_checking.py

cs231n中也有讲解关于gradient cheking的资料，不过它的 $\ error$ 定义和ng讲的有些细微的差别，不过不是什么大问题，只是定义不一样而已，只要改变相应的阈值即可。具体见：CS231n Convolutional Neural Networks for Visual Recognition