最近在写RL的时候发现一个有意思的现象。话不多说上简化版的代码:
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
x = tf.constant(3.0)
with tf.GradientTape(persistent = False) as tape:
tape.watch(x)
y = x*x*x*x
with tf.GradientTape(persistent = False) as tape:
dy = tape.gradient(target=y,sources=x)
print(dy)
和
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
x = tf.constant(3.0)
with tf.GradientTape(persistent = False) as tape:
tape.watch(x)
y = x*x*x*x
dy = tape.gradient(target=y,sources=x)
print(dy)
结果上的不同,上面的结果是None,下面的结果是正常的微分。为什么会出现这种情况呢?
上个代码由于又加了一层tape,导致tape的属性已经改变,
<tensorflow.python.eager.backprop.GradientTape object at 0x7fbfd1303110>
<tensorflow.python.eager.backprop.GradientTape object at 0x7fbfd1022290>
这里可以看到两次的tape地址并不一样,因此第二次tape的下面其实并没有检测到上面的计算过程。因此会报错None,在写RL中一个很复杂的双层循环中遇到了这个问题,如下面所示:
import gym
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Sequential,optimizers
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
import matplotlib.pyplot as plt
learning_rate = 0.001
gamma = 0.98
x = []
y = []
class Policy(keras.Model):
def __init__(self):
super(Policy, self).__init__()
self.layer = layers.Dense(128,kernel_initializer='he_normal')
self.out = layers.Dense(2,kernel_initializer='he_normal')
self.data = []
self.optimizer = optimizers.Adam(lr=learning_rate)
def call(self,input_shape):
out1 = self.layer(input_shape)
out2 = self.out(out1)
return out2
def put_data(self,item):
self.data.append(item)
def train_net(self,tape,score,step):
#print(self.data)
R = 0
for r, prob,a,t in self.data[::-1]:
label = tf.one_hot(a,depth = 2)
R = r + gamma**(t-1) * R
#loss = -log_prob * R
prob = tf.constant(prob)
label = tf.expand_dims(label,axis = 0)
loss = tf.compat.v1.losses.sigmoid_cross_entropy(label, prob, weights = r)
loss = (score/step)*loss
#print(loss)
with tape.stop_recording():
grads = tape.gradient(loss, self.trainable_variables)
#print(grads)
#print(loss)
#print(self.trainable_variables)
#print('************************************************')
#grads = grads * score/step
self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
self.data = []
pi = Policy()
for _ in range(10000):
score =0
step = 0
env = gym.make("CartPole-v1")
s = env.reset() # 显示当前时间戳的游戏画面
with tf.GradientTape(persistent=True) as tape:
print(tape)
for t in range(1,501):
s = tf.constant(s,dtype=tf.float32)
s = tf.expand_dims(s,axis = 0)
prob = pi(s)
a = tf.random.categorical(prob,1)[0]
a = int(a)
s_prime, r, done, info = env.step(a)
#print(prob,prob[0][a])
pi.put_data((r,prob,a,t))
s = s_prime
score += r
step += 1
if done:
break
pi.train_net(tape,score,step)
del tape
print(_,score)
x.append(_)
y.append(score)
plt.plot(x,y,linewidth=5)
plt.show()
这里如果某一次遇到break的话,那么就直接结束内层循环,然后self.data还会保存这次未送入到train函数中的数据,那么接下来就会在append上_层循环中的第一次数据,也就是list中有两组数据传入train函数,由于是逆向取数据,train在抓取后面的也就是_循环的新数据的时候是没问题的。但是执行上一次循环的数据的时候就会出现问题,因为此时此刻母函数的tape已经更新了,并不是上一层的tape了,所以上一层的数据根本没有被监视,因此会出现None的报错。