Tensorflow中的各种梯度处理gradient

最新推荐文章于 2024-05-26 09:54:10 发布

edward_zcl

最新推荐文章于 2024-05-26 09:54:10 发布

阅读量7.4k

点赞数 8

分类专栏：人工智能-神经网络 Python使用技巧

本文链接：https://blog.csdn.net/edward_zcl/article/details/90345318

版权

人工智能-神经网络同时被 2 个专栏收录

175 篇文章 25 订阅

订阅专栏

Python使用技巧

151 篇文章 19 订阅

订阅专栏

最近其实一直想自己手动创建op，这样的话好像得懂tensorflow自定义api/op的规则，设计前向与反向，注册命名，注意端口以及文件组织，最后可能还要需要重新编译才能使用。这一部分其实记得tensorflow官网上(可能是老版)有过介绍，但是当时没有仔细研究，也可能写的不够清晰，打算之后再专门写一篇博客介绍。本文主要介绍不自定义op的前提下，实现最大自由度的梯度计算与处理。

一、`tf.gradients`和`tf.stop_gradient()`以及高阶导数

这一部分可以参考：
https://blog.csdn.net/u012436149/article/details/53905797
https://blog.csdn.net/u012871493/article/details/71841709
https://blog.csdn.net/Invokar/article/details/86565232

gradient

tensorflow中有一个计算梯度的函数tf.gradients(ys, xs)，要注意的是，xs中的x必须要与ys相关，不相关的话，会报错。
代码中定义了两个变量w1， w2，但res只与w1相关

#wrong
import tensorflow as tf

w1 = tf.Variable([[1,2]])
w2 = tf.Variable([[3,4]])

res = tf.matmul(w1, [[2],[1]])

grads = tf.gradients(res,[w1,w2])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    re = sess.run(grads)
    print(re)

错误信息
TypeError: Fetch argument None has invalid type

# right
import tensorflow as tf

w1 = tf.Variable([[1,2]])
w2 = tf.Variable([[3,4]])

res = tf.matmul(w1, [[2],[1]])

grads = tf.gradients(res,[w1])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    re = sess.run(grads)
    print(re)
#  [array([[2, 1]], dtype=int32)]

对于grad_ys的测试：

import tensorflow as tf

w1 = tf.get_variable('w1', shape=[3])
w2 = tf.get_variable('w2', shape=[3])

w3 = tf.get_variable('w3', shape=[3])
w4 = tf.get_variable('w4', shape=[3])

z1 = w1 + w2+ w3
z2 = w3 + w4

grads = tf.gradients([z1, z2], [w1, w2, w3, w4], grad_ys=[tf.convert_to_tensor([2.,2.,3.]),
                                                          tf.convert_to_tensor([3.,2.,4.])])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(grads))

[array([ 2.,  2.,  3.],dtype=float32),
 array([ 2.,  2.,  3.], dtype=float32), 
 array([ 5.,  4.,  7.], dtype=float32), 
 array([ 3.,  2.,  4.], dtype=float32)]

可以看出，grad_ys 代表的是 ys 的头梯度

tf.stop_gradient()

阻挡节点BP的梯度

import tensorflow as tf

w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)

a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)

# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)
gradients = tf.gradients(b, xs=[w1, w2])
print(gradients)
#输出
#[None, <tf.Tensor 'gradients/Mul_1_grad/Reshape_1:0' shape=() dtype=float32>]

可见，一个节点被 stop之后，这个节点上的梯度，就无法再向前BP了。由于w1变量的梯度只能来自a节点，所以，计算梯度返回的是None。

import tensorflow as tf
x=tf.constant([2.0,2.1,3.2,4.1])
G = tf.get_default_graph()
with G.gradient_override_map({"Sign": "Identity"}):
    E = tf.stop_gradient(tf.reduce_mean(tf.abs(x)))
    y = tf.sign(x / E)
    #y = tf.sign(x / E)

grad=tf.gradients(y,x)
sess=tf.Session()
print(sess.run(grad))
print(sess.run(2.1))
print(sess.run(2.4))
print(sess.run(2.5))
print(sess.run(2.51))
print(sess.run(0.5))
print(sess.run(-2.1))
print(sess.run(-2.4))
print(sess.run(-2.5))
print(sess.run(-2.51))
tf.argmax()
tf.maximum()
import numpy as np
np.max()

a = tf.Variable(1.0)
b = tf.Variable(1.0)

c = tf.add(a, b)

c_stoped = tf.stop_gradient(c)

d = tf.add(a, b)

e = tf.add(c_stoped, d)

gradients = tf.gradients(e, xs=[a, b])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(gradients))

虽然 c节点被stop了，但是a，b还有从d传回的梯度，所以还是可以输出梯度值的。

其次，在某些特殊用途，比如域适应，元学习，以及强化学习中，可能会有一些特殊的用法，比如这里的DQN：
https://blog.csdn.net/u013745804/article/details/79589514
https://blog.csdn.net/zbrwhut/article/details/83341869
在这情况下，网络可能是分步骤进行的，有多次run的过程，比如输入数据的准备程序，GAN，强化学习这种。。

import tensorflow as tf

w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)
a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)

# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)

opt = tf.train.GradientDescentOptimizer(0.1)

gradients = tf.gradients(b, xs=tf.trainable_variables())

tf.summary.histogram(gradients[0].name, gradients[0])# 这里会报错，因为gradients[0]是None
#其它地方都会运行正常，无论是梯度的计算还是变量的更新。总觉着tensorflow这么设计有点不好，
#不如改成流过去的梯度为0
train_op = opt.apply_gradients(zip(gradients, tf.trainable_variables()))

print(gradients)
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(train_op))
    print(sess.run([w1, w2]))

高阶导数

tensorflow 求高阶导数可以使用 tf.gradients 来实现

import tensorflow as tf

with tf.device('/cpu:0'):
    a = tf.constant(1.)
    b = tf.pow(a, 2)
    grad = tf.gradients(ys=b, xs=a) # 一阶导
    print(grad[0])
    grad_2 = tf.gradients(ys=grad[0], xs=a) # 二阶导
    grad_3 = tf.gradients(ys=grad_2[0], xs=a) # 三阶导
    print(grad_3)

with tf.Session() as sess:
    print(sess.run(grad_3))

Note: 有些 op，tf 没有实现其高阶导的计算，例如 tf.add …, 如果计算了一个没有实现高阶导的 op的高阶导， gradients 会返回 None。

另外这可以联系到这里，https://blog.csdn.net/edward_zcl/article/details/89338166

二、梯度修剪`apply_gradients`和`compute_gradients`

这一部分可以参考：
https://blog.csdn.net/hekkoo/article/details/53896598?utm_source=blogxgwz1

本文的由来是因为我想使用一个step function作为我的loss
function,但是直接使用会导致gradient不能计算，而之前在看tensorflow相关文档时，发现minimize可看作compute_gradients和apply_gradients二者之和，换言之，我们可以先计算gradients，进行处理后，再apply_gradients.

本来一开始打算自己去实现的，但由于tensorflow刚入门，碰了很多壁，最后在知乎上搜索时搜到分布式Tensorflow的梯度累积与异步更新，看到里面的代码，才弄明白该怎么弄

定义
1
gradient_all = optimizer.compute_gradients(loss)
计算全部gradient

2
grads_vars = [v for (g,v) in gradient_all if g is not None]
得到可进行梯度计算的变量

3
gradient = optimizer.compute_gradients(loss, grads_vars)
得到所需梯度

4
grads_holder = [(tf.placeholder(tf.float32, shape=g.get_shape()), v) for (g,v) in gradient]
生成holder

5
train_op = optimizer.apply_gradients(grads_holder)
继续进行BP算法

应用
1
gradient_result = sess.run(gradient, feed_dict={x:x_i,y_:y_real})
生成结果，计算loss与gradient
2

grads_dict={}
for i in range(len(gradient_result)):
k = grads_holder[i][0] # 取出holder，用于后面的feed_dict
grads_dict[k] = DealTheGradientFunction(gradient_result[i][0]) # 自由处理梯度

3
_ = sess.run(train_op,feed_dict=grads_dict)
继续更新权值

以上我们主要可以看出，在使用梯度之前可以先计算梯度，做相应的处理后，再进行应用梯度，这两部分组合起来就实现了自定义的梯度操作。但是还是很有限，这主要用于防止梯度爆炸与梯度消失，对梯度进行裁剪而已。你还可以参考：
https://blog.csdn.net/NockinOnHeavensDoor/article/details/80632677#%E6%A2%AF%E5%BA%A6%E4%BF%AE%E5%89%AA%E4%B8%BB%E8%A6%81%E9%81%BF%E5%85%8D%E8%AE%AD%E7%BB%83%E6%A2%AF%E5%BA%A6%E7%88%86%E7%82%B8%E5%92%8C%E6%B6%88%E5%A4%B1%E9%97%AE%E9%A2%98

我这里还有一份写的很好的博客，很全，可以作为进一步学习：
https://blog.csdn.net/lenbow/article/details/52218551

所以说compute_gradient与gradient是一致的，要想裁剪，需要得到梯度，处理之后再应用。
比如：
https://blog.csdn.net/u012436149/article/details/53006953
这里其实我有联想到了我的另外一篇博客，
https://blog.csdn.net/edward_zcl/article/details/89418268
https://blog.csdn.net/diligent_321/article/details/53130913

其实在异步通信，异步更新，多gpu并行，以及分布中，经常需要这样做。这就涉及到优化器了，低级版的参考：https://www.cnblogs.com/marsggbo/p/10056057.html
https://blog.csdn.net/c2a2o2/article/details/65633147
高级版的参考：https://blog.csdn.net/u014665013/article/details/84404204
另外这个小哥总结的流程很实用，也可以看看：
https://blog.csdn.net/weixin_36474809/article/details/88031720

另外我好像在哪里见过gradient cancelling，不知道是迁移学习的领域，还是在量化网络(基于keras)的一个github上面看到的，也可能是我记错了吧。。

三、`tf.get_default_graph().gradient_override_map`

关于这一点，我已经讲过过很多次了，可以参见我的博客：https://blog.csdn.net/edward_zcl/article/details/89338166

再举个栗子吧。

import tensorflow as tf

sess=tf.Session()

a=tf.constant([-1.0,-0.5,0.0,0.5,1.0])

with tf.get_default_graph().gradient_override_map({"Relu": "Identity"}):
    result = tf.nn.relu(a)

grad=tf.gradients(result, a)

print(sess.run(grad))
 

import tensorflow as tf

sess=tf.Session()

a=tf.constant([-1.0,-0.5,0.0,0.5,1.0])

with tf.get_default_graph().gradient_override_map({"Relu": "TL_Sign_QuantizeGrad"}):
    result = tf.nn.relu(a)

grad=tf.gradients(result, a)

print(sess.run(grad))

其实tensorflow中所有算子(操作)都是定义了一套对应表，或者称为map标志代号，需要的时候直接索引对应的操作标识符就可以了。。

我的博客实例中，你甚至可以自己定义函数，去反向传播梯度，但是很有限，你只能很简单处理一下梯度，跟上面的功能类似，用于替代某些不可导，或者自定义的梯度反传行为。

四、`tf.py_func`+`tf.RegisterGradient`+`tf.get_default_graph().gradient_override_map`

这个组合可以说非常强大了，这里面你需要弄懂tensorflow里面的map符号机制，装饰器，lamba表达式，tf，numpy数据类型，成员函数与数据成员(更进一步是属性attrubute，property等)。总之有了这个函数，可以说很强了，好像不用编译，就可以实现自定义的操作，甚至是大型的网络层。。反而现在好奇的是，到底还需不需要编译了，或者那些自定义操作，还需要编译的是什么需求呢。关于编译源码实现自定义操作官网之前有讲(老版了。。)，之后再详谈。。

先给出一段代码，师弟有一天请教我的，我看了半天，大致上看懂了，当然还有一些细节没看，现在发现这个函数tf.py_func真的是太强了。

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops

# define common custom relu function
#def my_relu_def(x, threshold1=0.05):
#    if x<threshold:
#        return 0.0
#    else:
#        return x
# 
#def my_relu_grad_def(x, threshold=0.05):
#    if x<threshold:
#        return 0.0
#    else:
#        return 1.2
        
def my_relu_def(x, threshold1=3, threshold2=-3):
    if x<threshold2:
        return -3.0
    elif x<threshold1:
        return x
    else:
        return 3.0

def my_relu_grad_def(x, threshold1=3, threshold2=-3):
    if x<threshold2:
        return 0.0
    elif x<threshold1:
        return 1.0
    else:
        return 0.0
 
# making a common function into a numpy function
my_relu_np = np.vectorize(my_relu_def)
my_relu_grad_np = np.vectorize(my_relu_grad_def)
# numpy uses float64 but tensorflow uses float32
my_relu_np_32 = lambda x: my_relu_np(x).astype(np.float32)
my_relu_grad_np_32 = lambda x: my_relu_grad_np(x).astype(np.float32)


def my_relu_grad_tf(x, name=None):
    with ops.name_scope(name, "my_relu_grad_tf", [x]) as name:
        y = tf.py_func(my_relu_grad_np_32,
                       [x],
                       [tf.float32],
                       name=name,
                       stateful=False)
        return y[0]

def my_py_func(func, inp, Tout, stateful=False, name=None, my_grad_func=None):
    # Need to generate a unique name to avoid duplicates:
    random_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
    tf.RegisterGradient(random_name)(my_grad_func)  # see _my_relu_grad for grad example
    g = tf.get_default_graph()
    with g.gradient_override_map({"PyFunc": random_name, "PyFuncStateless": random_name}):
        return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
 
# The grad function we need to pass to the above my_py_func function takes a special form:
# It needs to take in (an operation, the previous gradients before the operation)
# and propagate(i.e., return) the gradients backward after the operation.
def _my_relu_grad(op, pre_grad):
    x = op.inputs[0]
    cur_grad = my_relu_grad_tf(x)
    next_grad = pre_grad * cur_grad
    return next_grad

def my_relu_tf(x, name=None):
    with ops.name_scope(name, "my_relu_tf", [x]) as name:
        y = my_py_func(my_relu_np_32,
                       [x],
                       [tf.float32],
                       stateful=False,
                       name=name,
                       my_grad_func=_my_relu_grad)  # <-- here's the call to the gradient
        return y[0]

with tf.Session() as sess:
    x = tf.constant([-3, -4, 1, 40])
    y = my_relu_tf(x)
    tf.global_variables_initializer().run()
    print (x.eval())
    print (y.eval())
    print (tf.gradients(y, [x])[0].eval())

好像是实现了自定义的一个阈值函数，也不知来源是哪里，找到了个类似的https://blog.csdn.net/mmc2015/article/details/71250090

关于tf.py_func这个函数，可以先看看这个
https://www.jianshu.com/p/bac384d34c47
https://blog.csdn.net/DaVinciL/article/details/80615526
上面两个链接讲的其实还行，就是有个地方有点冲突，就是这个tf.py_func属不属于计算图的一部分，我觉得这个看你怎么用，既可以耦合进入，也可以独立于计算图。分别对应于是否实现对应的梯度函数，可能得加上stop_gradient之类的。。

总之，这个函数确实是增大了tensorflow这个静态图框架的灵活性，虽然tensorflow还在不断发展，好像已经有lite，server，eager，甚至支持动态图什么的。。反正很多新功能。。
在这里还是先认为这种框架是静态计算图吧。参考以下链接作进一步理解：
https://blog.csdn.net/DaVinciL/article/details/80615526
https://blog.csdn.net/aaon22357/article/details/82996436
https://blog.csdn.net/weixin_41950276/article/details/83590058 (这个人可能用的python版本不对。。)
https://blog.csdn.net/aaon22357/article/details/82996436
这个：
https://blog.csdn.net/tiankongtiankong01/article/details/80568311
的话，讲的更加细致，全面，基本总结了这个函数的基本用法，numpy与tensor的互动，操作相互弥补，以及两者的区别(条件判断，size提前获取等)。但美中不足的是，如果我们想把这个tf.py_func嵌入我们计算图中呢，并且进行端到端的模型训练，我们应该怎么做，这个时候就需要看看上面链接了，https://blog.csdn.net/mmc2015/article/details/71250090，
https://blog.csdn.net/caorui_nk/article/details/82898200， 通过前面的讲解应该能看个99%理解，唯一一点就是

with g.gradient_override_map({"PyFunc": random_name, "PyFuncStateless": random_name}):
        return tf.py_func(func, inp, Tout, stateful=stateful, name=name)

这句话，到底是为了什么要加上两个map呢。另外有兴趣可以了解一下其他参数stateful，这个参数也是有点迷。
https://blog.csdn.net/u011436429/article/details/80420700
https://blog.csdn.net/xiaoYAN174/article/details/79090382
通过这两个链接，应该可以知道，他们如果不使用梯度的话，是使用在fastrcnn中的rpn中的，专门用于复杂计算。但这里不涉及反向传播，只适用于计算的转换。

以上。

edward_zcl

关注

8
点赞
踩
22

收藏

觉得还不错? 一键收藏
12
评论
Tensorflow中的各种梯度处理gradient

最近其实一直想自己手动创建op，这样的话好像得懂tensorflow自定义api/op的规则，设计前向与反向，注册命名，注意端口以及文件组织，最后可能还要需要重新编译才能使用。这一部分其实记得tensorflow官网上(可能是老版)有过介绍，但是当时没有仔细研究，也可能写的不够清晰，打算之后再专门写一篇博客介绍。本文主要介绍不自定义op的前提下，实现最大自由度的梯度计算与处理。一、tf.gr...
复制链接

扫一扫