(九)强化学习——带基线的策略梯度,REINFORCEMENT with baseline,Advantage Actor-Critic(A2C)

1.前言
        上一节推导了策略梯度,分享了两种策略梯度算法REINFORCEMENT算法,和Actor-Critic算法(八)强化学习——策略梯度,REINFORCEMENT算法,Actor-Critic。方法在理论上是正确的,但是在实践中效果并不理想。本节介绍的带基线的策略梯度(PolicyGradient with Baseline) 可以大幅提升策略梯度方法的表现。使用基线(Baseline) 之后,REINFORCE变成REINFORCE with Baseline,Actor-Critic变成Advantage Actor-Critic (A2C)。
2. 基线(Baseline)
在这里插入图片描述

策略梯度定理

在这里插入图片描述

带基线的策略梯度定理

        我们可以看见梯度策略定理与带基线的策略梯度定理的区别。减了一个b,这里的b是任意的函数,但是b不能依赖于A。把b作为动作价值函数Qπ(S,A)的基线,对策略梯度没有影响,至于为什么对梯度策略没有影响,这里就不证明了,参考北大王树森的课程。
定理中的策略梯度表示成了期望的形式,我们对期望做蒙特卡洛近似。从环境中观测到一个状态s,然后根据策略网络抽样得到a∼π(·|s;θ)。那么策略梯度∇θJ(θ)可以近似为下面的随机梯度:
在这里插入图片描述
不论b的取值是0还是Vπ(s),得到的随机梯度gb(s,a;θ)都是∇θJ(θ)的无偏估计:
在这里插入图片描述
虽然b的取值对ES,A[gb(S,A;θ)]毫无影响,但是b对随机梯度gb(s,a;θ)是有影响的。用不同的b,得到的方差为:
在这里插入图片描述如果b很接近Qπ(s,a)关于a的均值,那么方差会比较小。所以一般用b=Vπ(s)做为基线。

3.带基线的REINFORCE算法
        带基线的REINFORCE需要两个神经网络,一个是策略网π(a|s;θ),一个是价值网络v(s;w)。策略网络和之前是一样的,输入是状态s,输出是一个向量,每一个元素表示一个动作的概率。此处的价值网络v(s;w)与之前使用的价值网络q(s,a;w)区别较大。此处的v(s;w)是对状态价值Vπ的近似,而非对动作价值Qπ的近似。输入状态是s,输出是一个实数。价值网络没有起到“评委”的作用,只是作为基线而已,目的在于降低方差,加速收敛。
训练流程:
在这里插入图片描述
在这里插入图片描述

4.Advantage Actor-Critic (A2C)
        A2C属于Actor-Critic 方法。有一个策略网络π(a|s;θ),相当于演员,用于控制智能体运动。还有一个价值网络v(s;w),相当于评委,他的评分可以帮助策略网络(演员)改进技术。两个神经网络的结构与带基线的REINFORCE算法中的完全相同,但是训练更新网络参数的方式不同。
        A2C中策略网络(演员)和价值网络(评委)的关系如图示。智能体由策略网络π控制,与环境交互,并收集状态、动作、奖励。策略网络(演员)基于状态st做出动作at。价值网络(评委)基于st、st+1、rt算出TD误差δt。策略网络(演员)依靠δt来判断自己动作的好坏,从而改进自己的演技(即参数θ)。
在这里插入图片描述
训练步骤:
在这里插入图片描述

5.总结
        本节分享了带基线REINFORCEMENT算法和A2C算,是在之前的策略梯度上加了一个基线的REINFORCEMENT和AC改进方法。我们一般用Vπ(s)做为基线。对于AC和A2A,他们的策略网络结构都相同,但价值网络结构不同。AC的价值函数是用的Qπ,而A2C用的是Vπ。这导致他们的更新方式也不同。
参考资料
深度强化学习,王树森 张志华 著

  • 26
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是基于Tensorflow的《Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor》版本的SAC强化学习算法的Python代码: ```python import tensorflow as tf import numpy as np import gym # Create actor network class Actor(tf.keras.Model): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.layer1 = tf.keras.layers.Dense(256, activation='relu') self.layer2 = tf.keras.layers.Dense(256, activation='relu') self.mu_layer = tf.keras.layers.Dense(action_dim, activation='tanh') self.sigma_layer = tf.keras.layers.Dense(action_dim, activation='softplus') self.max_action = max_action def call(self, state): x = self.layer1(state) x = self.layer2(x) mu = self.mu_layer(x) * self.max_action sigma = self.sigma_layer(x) + 1e-4 return mu, sigma # Create two critic networks class Critic(tf.keras.Model): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.layer1 = tf.keras.layers.Dense(256, activation='relu') self.layer2 = tf.keras.layers.Dense(256, activation='relu') self.layer3 = tf.keras.layers.Dense(1, activation=None) def call(self, state, action): x = tf.concat([state, action], axis=1) x = self.layer1(x) x = self.layer2(x) x = self.layer3(x) return x # Create Soft Actor-Critic (SAC) Agent class SACAgent: def __init__(self, state_dim, action_dim, max_action): self.actor = Actor(state_dim, action_dim, max_action) self.critic1 = Critic(state_dim, action_dim) self.critic2 = Critic(state_dim, action_dim) self.target_critic1 = Critic(state_dim, action_dim) self.target_critic2 = Critic(state_dim, action_dim) self.max_action = max_action self.alpha = tf.Variable(0.1, dtype=tf.float32, name='alpha') self.gamma = 0.99 self.tau = 0.005 self.optimizer_actor = tf.keras.optimizers.Adam(learning_rate=3e-4) self.optimizer_critic1 = tf.keras.optimizers.Adam(learning_rate=3e-4) self.optimizer_critic2 = tf.keras.optimizers.Adam(learning_rate=3e-4) def get_action(self, state): state = np.expand_dims(state, axis=0) mu, sigma = self.actor(state) dist = tfp.distributions.Normal(mu, sigma) action = tf.squeeze(dist.sample(1), axis=0) return action.numpy() def update(self, replay_buffer, batch_size): states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size) with tf.GradientTape(persistent=True) as tape: # Compute actor loss mu, sigma = self.actor(states) dist = tfp.distributions.Normal(mu, sigma) log_pi = dist.log_prob(actions) q1 = self.critic1(states, actions) q2 = self.critic2(states, actions) q_min = tf.minimum(q1, q2) alpha_loss = -tf.reduce_mean(self.alpha * (log_pi + self.target_entropy)) actor_loss = -tf.reduce_mean(tf.exp(self.alpha) * log_pi * q_min) # Compute critic loss next_mu, next_sigma = self.actor(next_states) next_dist = tfp.distributions.Normal(next_mu, next_sigma) next_actions = tf.clip_by_value(next_dist.sample(1), -self.max_action, self.max_action) target_q1 = self.target_critic1(next_states, next_actions) target_q2 = self.target_critic2(next_states, next_actions) target_q = tf.minimum(target_q1, target_q2) target_q = rewards + self.gamma * (1.0 - dones) * (target_q - tf.exp(self.alpha) * next_dist.entropy()) q1_loss = tf.reduce_mean(tf.square(q1 - target_q)) q2_loss = tf.reduce_mean(tf.square(q2 - target_q)) critic_loss = q1_loss + q2_loss + alpha_loss # Compute gradients and update weights actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables) critic1_grads = tape.gradient(critic_loss, self.critic1.trainable_variables) critic2_grads = tape.gradient(critic_loss, self.critic2.trainable_variables) self.optimizer_actor.apply_gradients(zip(actor_grads, self.actor.trainable_variables)) self.optimizer_critic1.apply_gradients(zip(critic1_grads, self.critic1.trainable_variables)) self.optimizer_critic2.apply_gradients(zip(critic2_grads, self.critic2.trainable_variables)) # Update target networks for w, w_target in zip(self.critic1.weights, self.target_critic1.weights): w_target.assign(self.tau * w + (1 - self.tau) * w_target) for w, w_target in zip(self.critic2.weights, self.target_critic2.weights): w_target.assign(self.tau * w + (1 - self.tau) * w_target) # Update alpha alpha_grad = tape.gradient(alpha_loss, self.alpha) self.alpha.assign_add(1e-4 * alpha_grad) def save(self, filename): self.actor.save_weights(filename + '_actor') self.critic1.save_weights(filename + '_critic1') self.critic2.save_weights(filename + '_critic2') def load(self, filename): self.actor.load_weights(filename + '_actor') self.critic1.load_weights(filename + '_critic1') self.critic2.load_weights(filename + '_critic2') # Create replay buffer class ReplayBuffer: def __init__(self, max_size): self.max_size = max_size self.buffer = [] self.position = 0 def add(self, state, action, reward, next_state, done): if len(self.buffer) < self.max_size: self.buffer.append(None) self.buffer[self.position] = (state, action, reward, next_state, done) self.position = (self.position + 1) % self.max_size def sample(self, batch_size): indices = np.random.choice(len(self.buffer), batch_size, replace=False) states, actions, rewards, next_states, dones = [], [], [], [], [] for idx in indices: state, action, reward, next_state, done = self.buffer[idx] states.append(np.array(state, copy=False)) actions.append(np.array(action, copy=False)) rewards.append(reward) next_states.append(np.array(next_state, copy=False)) dones.append(done) return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), np.array(next_states), np.array(dones, dtype=np.uint8) # Create environment and agent env = gym.make('Pendulum-v0') state_dim = env.observation_space.shape[0] action_dim = env.action_space.shape[0] max_action = float(env.action_space.high[0]) agent = SACAgent(state_dim, action_dim, max_action) replay_buffer = ReplayBuffer(1000000) # Train agent max_episodes = 1000 max_steps = 500 batch_size = 256 update_interval = 1 target_entropy = -action_dim for episode in range(max_episodes): state = env.reset() total_reward = 0 for step in range(max_steps): action = agent.get_action(state) next_state, reward, done, _ = env.step(action) replay_buffer.add(state, action, reward, next_state, done) if len(replay_buffer.buffer) > batch_size: agent.update(replay_buffer, batch_size) state = next_state total_reward += reward if done: break print("Episode:", episode, "Total Reward:", total_reward) ``` 请注意,以上代码仅供参考,并且需要根据具体环境和参数进行调整和完善。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值