BCQ
文章解读
1.主要问题
-
data absent 数据缺失问题,由于离线强化只依靠数据集,并不能包含所有的数据;
-
distribution shift分布偏移,分布偏移最主要的原因是learned policy 和 behavior policy之间的偏移,这也是offlineRL相比于Online RL在不能交互学习的情况下造成的。
2.采用了哪些方法与框架
2.1策略约束
如果能够使得(s,a)尽可能的与数据集D中的数据相似,就可以解决上述问题。
为此提出了以下解决思路。
(1).最小化所选择的动作和数据集中存在的工作的距离
(2).能转移到和数据集中状态相似的状态
(3).最大化值函数
其中最为重要的的就是第一点,只有保证了第一布的成功,才能准确的估计第二步和第三步。
2.1.1 解决外推误差
首先给MDP MB中的数据转移PB给出了定义:
接下来提出了一个误差函数的定义,用于解释为何会出现这些误差(其中Q为我们需要推导的实际价值函数,QB是依据数据集推导的价值函数):
通过推导可以化为以下形式:
通过这个推导我们可以发现当PM与PB的概率一样的时候,可以得出结论EMDP=0,从而使得以下公式误差的值也为0:
有了以上的推论,当eMDP=0,且初始状态S0存在的时候,可以将批策略约束与Q-Learning(BCQL)结合,得到以下公式:
通过以上的推导公式可以得到以下两个定理:
一.学习率为α,通过对环境标准的采样,BCQL 可以收敛到最优动作值函数 。
二.给定确定性 MDP 和 coherent 数据集B,学习率为α,BCQL 将会收敛到
,其中是最优 batch-constrained 策略。
2.2BCQ算法
将 BCQL 算法拓展到连续环境,提出了 BCQ 算法。其中为了满足 Batch-constrained 的条件,BCQ 利用了一个生成模型VAE。对于给定的状态,BCQ 利用生成模型来生成与 batch 相似的动作集合,并通过 Q 网络来选择价值最高的动作。另外,还对价值估计过程增加了对未来稀有的状态进行惩罚,与 Clipped Double Q-learning 算法类似。最后,BCQ 能学到与数据集的状态动作对访问分布相似的策略。
对于给定的状态(s,a)和数据集D中的数据状态对相似度生成概率函数来减小外推误差所造成的错误。但是直接估计比较困难,所以提出了用来模拟P函数。在这里我们所用的方式是用VAE模型来近似,并且与一起选择动作估值最大的函数。但是在此还加入了扰动模型来增加他的探索度,其中扰动模型是服从再加上一范围限制,最终得到了以下的行为策略:
在上述的式子之中n和的数据决定了采用的是模仿学习还是强化学习。当n=1,=0时,该算法是模仿学习及一比一的还原数据集D中的策略,但是当且时BCQ 算法就类似于Q-learning 算法.扰动模型的训练和 DDPG算法的训练目标类似:
在最后,BCQ 采用了 Clipped Double Q-learning 算法 对动作值进行估计,也就是训练两个动作值网络,从中选取取它们的最小值作为动作值的估计。改进 Clipped Double Q-learning 算法,对两个动作值采用新的结合方式:
最终得到的伪代码图如下所示:
中间VAE部分的模型推理如下所示:
VAE Gω由两个网络定义,编码器Eω1(s,a)和解码器Dω2(s,z),其中ω={ω1,ω2}。编码器获取状态-动作对,并输出高斯分布N(µ,σ)的平均值µ和标准偏差σ。状态s,连同从高斯中采样的潜在向量z一起,被传递到解Dω2(s,z),该解码器输出动作。马网络遵循默认架构(图10),有两个大小为750的隐藏层,而不是400和300相对于重建的均方误差以及KL正则化项进行训练:
注意到两种分布的高斯形式,KL散度项可以简化:
3.代码部分
本代码主要由四个部分组成:main,DDPG,BCQ,utils
3.1 main
main主要分为三个部分。
第一部分:主要是利用DDPG跑出我们所需要的100w个数据
第二部分:主要是对于BCQ的训练
第三部分:从跑出来的数据集中提取数据用于BCQ的训练,同时设置好相关的参数
3.1.1 interact_with_environment
//主要用于与环境交互产生所需要的数据
def interact_with_environment(env, state_dim, action_dim, max_action, device, args):
# For saving files
setting = f"{args.env}_{args.seed}"
buffer_name = f"{args.buffer_name}_{setting}"
# Initialize and load policy
policy = DDPG.DDPG(state_dim, action_dim, max_action, device)#, args.discount, args.tau)
if args.generate_buffer: policy.load(f"./models/behavioral_{setting}")
# Initialize buffer
replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)
evaluations = []
state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num = 0
# Interact with the environment for max_timesteps
for t in range(int(args.max_timesteps)):
episode_timesteps += 1
# Select action with noise
if (
(args.generate_buffer and np.random.uniform(0, 1) < args.rand_action_p) or
(args.train_behavioral and t < args.start_timesteps)
):
action = env.action_space.sample()
else:
action = (
policy.select_action(np.array(state))
+ np.random.normal(0, max_action * args.gaussian_std, size=action_dim)
).clip(-max_action, max_action)
# Perform action
next_state, reward, done, _ = env.step(action)
done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0
# Store data in replay buffer
replay_buffer.add(state, action, next_state, reward, done_bool)
state = next_state
episode_reward += reward
# Train agent after collecting sufficient data
if args.train_behavioral and t >= args.start_timesteps:
policy.train(replay_buffer, args.batch_size)
if done:
# +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
# Reset environment
state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num += 1
# Evaluate episode
if args.train_behavioral and (t + 1) % args.eval_freq == 0:
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/behavioral_{setting}", evaluations)
policy.save(f"./models/behavioral_{setting}")
# Save final policy
if args.train_behavioral:
policy.save(f"./models/behavioral_{setting}")
# Save final buffer and performance
else:
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/buffer_performance_{setting}", evaluations)
replay_buffer.save(f"./buffers/{buffer_name}")
3.1.2 train_BCQ
//用于训练BCQ的代码
def train_BCQ(state_dim, action_dim, max_action, device, args):
# For saving files
setting = f"{args.env}_{args.seed}"
buffer_name = f"{args.buffer_name}_{setting}"
# Initialize policy
policy = BCQ.BCQ(state_dim, action_dim, max_action, device, args.discount, args.tau, args.lmbda, args.phi)
# Load buffer
replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)
replay_buffer.load(f"./buffers/{buffer_name}")
evaluations = []
episode_num = 0
done = True
training_iters = 0
while training_iters < args.max_timesteps:
pol_vals = policy.train(replay_buffer, iterations=int(args.eval_freq), batch_size=args.batch_size)
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/BCQ_{setting}", evaluations)
training_iters += args.eval_freq
print(f"Training iterations: {training_iters}")
3.1.3 main
f __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--env", default="Hopper-v3") # OpenAI gym 的环境名称environment name
parser.add_argument("--seed", default=0, type=int) # 设置Gym, PyTorch and Numpy seeds
parser.add_argument("--buffer_name", default="Robust") # 保存的文件名称
parser.add_argument("--eval_freq", default=5e3, type=float) # 更新频率
parser.add_argument("--max_timesteps", default=1e6, type=int) # 训练的最大步长
parser.add_argument("--start_timesteps", default=25e3, type=int)# 运行的最大步长(或者说是缓冲区大小)
parser.add_argument("--rand_action_p", default=0.3, type=float) # 批处理中选取随机动作的概率
parser.add_argument("--gaussian_std", default=0.3, type=float) # 高斯分布噪声的标准差
parser.add_argument("--batch_size", default=100, type=int) # 从数据集中抽取的最小样本数量
parser.add_argument("--discount", default=0.99) # 奖励折扣参数
parser.add_argument("--tau", default=0.005) # 目标网络更新参数
parser.add_argument("--lmbda", default=0.75) # 在BCQ中 clipped double Q-learning的权重
parser.add_argument("--phi", default=0.05) # BCQ中的扰动最大超参数
parser.add_argument("--train_behavioral", action="store_true") # If true, train behavioral (DDPG)
parser.add_argument("--generate_buffer", action="store_true") # If true, generate buffer
args = parser.parse_args()
print("---------------------------------------")
if args.train_behavioral:
print(f"Setting: Training behavioral, Env: {args.env}, Seed: {args.seed}")
elif args.generate_buffer:
print(f"Setting: Generating buffer, Env: {args.env}, Seed: {args.seed}")
else:
print(f"Setting: Training BCQ, Env: {args.env}, Seed: {args.seed}")
print("---------------------------------------")
if args.train_behavioral and args.generate_buffer:
print("Train_behavioral and generate_buffer cannot both be true.")
exit()
if not os.path.exists("./results"):
os.makedirs("./results")
if not os.path.exists("./models"):
os.makedirs("./models")
if not os.path.exists("./buffers"):
os.makedirs("./buffers")
env = gym.make(args.env)
env.seed(args.seed)
env.action_space.seed(args.seed)
torch.manual_seed(args.seed)
np.random.seed(args.seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if args.train_behavioral or args.generate_buffer:
interact_with_environment(env, state_dim, action_dim, max_action, device, args)
else:
train_BCQ(state_dim, action_dim, max_action, device, args
3.2 DDPG
第一部分:Actor网络的搭建,用于选择出动作a
第二部分:Critic网络的构建,同与价值及函数Q的评估
第三部分:利用Actor和Critic训练同时进行参数的更新
3.2.1 Actor
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
a = F.relu(self.l2(a))
return self.max_action * torch.tanh(self.l3(a))
3.2.2 Critic
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
def forward(self, state, action):
q = F.relu(self.l1(torch.cat([state, action], 1)))
q = F.relu(self.l2(q))
return self.l3(q)
3.2.3 DDPG
class DDPG(object):
//初始化参数
def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005):
self.actor = Actor(state_dim, action_dim, max_action).to(device)
self.actor_target = copy.deepcopy(self.actor)
self.actor_optimizer = torch.optim.Adam(self.actor.parameters())
self.critic = Critic(state_dim, action_dim).to(device)
self.critic_target = copy.deepcopy(self.critic)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters())
self.discount = discount
self.tau = tau
self.device = device
//选择动作
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1)).to(self.device)
return self.actor(state).cpu().data.numpy().flatten()
//训练
def train(self, replay_buffer, batch_size=100):
# Sample replay buffer
state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)
# 计算目标Q值
target_Q = self.critic_target(next_state, self.actor_target(next_state))
target_Q = reward + (not_done * self.discount * target_Q).detach()
# 计算当前Q值
current_Q = self.critic(state, action)
# 计算critic loss
critic_loss = F.mse_loss(current_Q, target_Q)
# 优化Critic参数
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 计算动作loss
actor_loss = -self.critic(state, self.actor(state)).mean()
# Optimize the actor
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 更新参数
for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
//保存数据
def save(self, filename):
torch.save(self.critic.state_dict(), filename + "_critic")
torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")
torch.save(self.actor.state_dict(), filename + "_actor")
torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")
//加载模型
def load(self, filename):
self.critic.load_state_dict(torch.load(filename + "_critic"))
self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))
self.critic_target = copy.deepcopy(self.critic)
self.actor.load_state_dict(torch.load(filename + "_actor"))
self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))
self.actor_target = copy.deepcopy(self.actor)
3.3BCQ
BCQ主要分为了四个部分:
第一部分:Actor网络的搭建,用于选择出动作a
第二部分:Critic网络的构建,同与价值及函数Q的评估
第三部分:VAE网络的搭建,用于对数据集D中的行为策略进行模拟,同时给出相应的数据
第四部分:训练
3.3.1Actor
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action, phi=0.05):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
self.phi = phi
def forward(self, state, action):
a = F.relu(self.l1(torch.cat([state, action], 1)))
a = F.relu(self.l2(a))
a = self.phi * self.max_action * torch.tanh(self.l3(a))
return (a + action).clamp(-self.max_action, self.max_action)
3.3.2Critic
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
self.l4 = nn.Linear(state_dim + action_dim, 400)
self.l5 = nn.Linear(400, 300)
self.l6 = nn.Linear(300, 1)
def forward(self, state, action):
q1 = F.relu(self.l1(torch.cat([state, action], 1)))
q1 = F.relu(self.l2(q1))
q1 = self.l3(q1)
q2 = F.relu(self.l4(torch.cat([state, action], 1)))
q2 = F.relu(self.l5(q2))
q2 = self.l6(q2)
return q1, q2
def q1(self, state, action):
q1 = F.relu(self.l1(torch.cat([state, action], 1)))
q1 = F.relu(self.l2(q1))
q1 = self.l3(q1)
return q1
3.3.3VAE
class VAE(nn.Module):
def __init__(self, state_dim, action_dim, latent_dim, max_action, device):
super(VAE, self).__init__()
self.e1 = nn.Linear(state_dim + action_dim, 750)
self.e2 = nn.Linear(750, 750)
//均值
self.mean = nn.Linear(750, latent_dim)
//方差
self.log_std = nn.Linear(750, latent_dim)
self.d1 = nn.Linear(state_dim + latent_dim, 750)
self.d2 = nn.Linear(750, 750)
self.d3 = nn.Linear(750, action_dim)
self.max_action = max_action
self.latent_dim = latent_dim
self.device = device
def forward(self, state, action):
z = F.relu(self.e1(torch.cat([state, action], 1)))
z = F.relu(self.e2(z))
mean = self.mean(z)
# Clamped for numerical stability
log_std = self.log_std(z).clamp(-4, 15)
std = torch.exp(log_std)
z = mean + std * torch.randn_like(std)
//选出动作
u = self.decode(state, z)
return u, mean, std
//解码器用于选出动作
def decode(self, state, z=None):
# When sampling from the VAE, the latent vector is clipped to [-0.5, 0.5]
if z is None:
z = torch.randn((state.shape[0], self.latent_dim)).to(self.device).clamp(-0.5,0.5)
a = F.relu(self.d1(torch.cat([state, z], 1)))
a = F.relu(self.d2(a))
return self.max_action * torch.tanh(self.d3(a))
3.3.4BCQ
class BCQ(object):
def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005, lmbda=0.75, phi=0.05):
latent_dim = action_dim * 2
self.actor = Actor(state_dim, action_dim, max_action, phi).to(device)
self.actor_target = copy.deepcopy(self.actor)
self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-3)
self.critic = Critic(state_dim, action_dim).to(device)
self.critic_target = copy.deepcopy(self.critic)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=1e-3)
self.vae = VAE(state_dim, action_dim, latent_dim, max_action, device).to(device)
self.vae_optimizer = torch.optim.Adam(self.vae.parameters())
self.max_action = max_action
self.action_dim = action_dim
self.discount = discount
self.tau = tau
self.lmbda = lmbda
self.device = device
//选出动作
def select_action(self, state):
with torch.no_grad():
state = torch.FloatTensor(state.reshape(1, -1)).repeat(100, 1).to(self.device)
action = self.actor(state, self.vae.decode(state))
q1 = self.critic.q1(state, action)
ind = q1.argmax(0)
return action[ind].cpu().data.numpy().flatten()
def train(self, replay_buffer, iterations, batch_size=100):
for it in range(iterations):
# Sample replay buffer / batch
state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)
#编码器训练
recon, mean, std = self.vae(state, action)
recon_loss = F.mse_loss(recon, action)
KL_loss = -0.5 * (1 + torch.log(std.pow(2)) - mean.pow(2) - std.pow(2)).mean()
vae_loss = recon_loss + 0.5 * KL_loss
self.vae_optimizer.zero_grad()
vae_loss.backward()
self.vae_optimizer.step()
# Critic Training
with torch.no_grad():
# Duplicate next state 10 times
next_state = torch.repeat_interleave(next_state, 10, 0)
# Compute value of perturbed actions sampled from the VAE
target_Q1, target_Q2 = self.critic_target(next_state, self.actor_target(next_state, self.vae.decode(next_state)))
# Soft Clipped Double Q-learning
target_Q = self.lmbda * torch.min(target_Q1, target_Q2) + (1. - self.lmbda) * torch.max(target_Q1, target_Q2)
# Take max over each action sampled from the VAE
target_Q = target_Q.reshape(batch_size, -1).max(1)[0].reshape(-1, 1)
target_Q = reward + not_done * self.discount * target_Q
current_Q1, current_Q2 = self.critic(state, action)
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Pertubation Model / Action Training
sampled_actions = self.vae.decode(state)
perturbed_actions = self.actor(state, sampled_actions)
# Update through DPG
actor_loss = -self.critic.q1(state, perturbed_actions).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
//更新目标网络
for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
3.4utils
主要用于数据的存储和读取,大多是这一套模板
3.4.1 ReplayBuffer
class ReplayBuffer(object):
def __init__(self, state_dim, action_dim, device, max_size=int(1e6)):
self.max_size = max_size
//数据所在编号
self.ptr = 0
//存储数据数目
self.size = 0
self.state = np.zeros((max_size, state_dim))
self.action = np.zeros((max_size, action_dim))
self.next_state = np.zeros((max_size, state_dim))
self.reward = np.zeros((max_size, 1))
self.not_done = np.zeros((max_size, 1))
self.device = device
//添加新数据
def add(self, state, action, next_state, reward, done):
self.state[self.ptr] = state
self.action[self.ptr] = action
self.next_state[self.ptr] = next_state
self.reward[self.ptr] = reward
self.not_done[self.ptr] = 1. - done
self.ptr = (self.ptr + 1) % self.max_size
self.size = min(self.size + 1, self.max_size)
//抽取数据
def sample(self, batch_size):
//所抽取数据的编号
ind = np.random.randint(0, self.size, size=batch_size)
return (
torch.FloatTensor(self.state[ind]).to(self.device),
torch.FloatTensor(self.action[ind]).to(self.device),
torch.FloatTensor(self.next_state[ind]).to(self.device),
torch.FloatTensor(self.reward[ind]).to(self.device),
torch.FloatTensor(self.not_done[ind]).to(self.device)
)
//数据保存
def save(self, save_folder):
np.save(f"{save_folder}_state.npy", self.state[:self.size])
np.save(f"{save_folder}_action.npy", self.action[:self.size])
np.save(f"{save_folder}_next_state.npy", self.next_state[:self.size])
np.save(f"{save_folder}_reward.npy", self.reward[:self.size])
np.save(f"{save_folder}_not_done.npy", self.not_done[:self.size])
np.save(f"{save_folder}_ptr.npy", self.ptr)
//模型加载
def load(self, save_folder, size=-1):
reward_buffer = np.load(f"{save_folder}_reward.npy")
# Adjust crt_size if we're using a custom size
size = min(int(size), self.max_size) if size > 0 else self.max_size
self.size = min(reward_buffer.shape[0], size)
self.state[:self.size] = np.load(f"{save_folder}_state.npy")[:self.size]
self.action[:self.size] = np.load(f"{save_folder}_action.npy")[:self.size]
self.next_state[:self.size] = np.load(f"{save_folder}_next_state.npy")[:self.size]
self.reward[:self.size] = reward_buffer[:self.size]
self.not_done[:self.size] = np.load(f"{save_folder}_not_done.npy")[:self.size]
4.总结
此章节主要对BCQ采用的方法作出总结,同时对这些方法做出一些拓展,探索其他的可能,并作出对比。
4.1 数据采取方法
此论文用了DDPG作为采集数据的方式,为何采取此方法主要原因有以下几点:
- DDPG是off policy算法可以采用经验池进行优化,很契合离线强化的数据集采样方式,同时可以与BCQ算法作对比,来展示为何普通的off policy算法不可以直接应用到离线强化上。
- BCQ算法的一些内容是基于DDPG的内容上做改进的,和DDPG有共通之处,例如:
但是可以看到也与TD3的算法非常的类似:
选取动作方面都采用了类似的DPG加一个扰动的思想。
4.2 批约束思想
BCQ为了解决out-of-distribution和distribution shift的问题而提出来的约束想法,主要思想是将行为策略与实际策略结合在一起,将选取的状态,动作对尽可能的限制在已知的数据集B之中。为此采取了批约束的想法,通过VAE模拟和扰动模型的加入来进行训练,但是这很受数据集好坏的约束,因为BCQ探索的策略大多与数据集之中的策略高度相似。所以针对这一点,我们可以提出一些想法,BCQ之中是完全的模拟数据集的状态转移概率,那我们是否可以只保证我们所挑选的状态,动作对在数据集中存在,但是概率却是随意的不必强行匹配,建立一个散度来设定一个范围,保证一定约束的同时,在设立一个阈值确保不会超出这个范围。虽然这个可以保证了不会出现out-of-distribution问题的存在,但distribution shift问题还是会出现的,所以如何解决分布偏移问题又是我们所学要考虑的一个点。
4.3 Q值估计的改进
我们可以看到他对于Q值估计的函数也作了改进:
这个公式是对Clipped Double Q-learning 公式的改进,
他所作的改进是取得两个值的凸组合,同谁给予最小值上更高的权重,这样在削减了高估的同时也能减小不常见状态的影响。