强化学习ActorCritic

最新推荐文章于 2024-04-13 14:00:00 发布

JasonSparrow_1

最新推荐文章于 2024-04-13 14:00:00 发布

阅读量465

点赞数

分类专栏： RLpython 文章标签： RL

本文链接：https://blog.csdn.net/JasonSparrow_1/article/details/88232753

版权

RLpython 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

ActorCritic

Actor负责进行动作的奖惩，而Critic将对奖惩进行评估，从而对下一步的奖惩做出影响
Actor的算法基础是PolicyGradients，Critic的算法基础是Q-learning
Actor只能回合更新，而Critic部分可以单步更新
缺点就是空间的连续性，从而导致神经网络学不到东西，相关性比较强
为了解决这种问题，DeepMind将ActorCritic和DQN结合，从而产生了Deep Deterministic Policy Gradient

DDPG

在这里插入图片描述

Deep方面，网络相比较DQN复杂一点点
Deterministic方面，在连续动作上生成一个动作值

网络设计上面，Critic部分需要利用Actor的网络输出结果进行最后的状态值输出

Actor和Critic两部分的网络构建

	with tf.variable_scope('Actor'):
        # input s, output a
        self.a = self._build_net(S, scope='eval_net', trainable=True)

        # 输出的a_是用于给Critic网络提供参考，从而进行更好的预测下一步的action
        self.a_ = self._build_net(S_, scope='target_net', trainable=False)
        
     with tf.variable_scope('Critic'):
        # Input (s, a), output q
        self.a = tf.stop_gradient(a)    # stop critic update flows to actor
        self.q = self._build_net(S, self.a, 'eval_net', trainable=True)

        # 此处的a_正是从Actor中得到的
        self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)

Actor的学习方式及更新内容
根据下图的网络结构，Actor根据Critic得到了a_grad，并利用S计算出了下一步的a，
在这里插入图片描述

	 def choose_action(self, s):
        s = s[np.newaxis, :]    # single state
        return self.sess.run(self.a, feed_dict={S: s})[0]  # single action

    def add_grad_to_graph(self, a_grads):
        with tf.variable_scope('policy_grads'):
            # ys = policy;
            # xs = policy's parameters;
            # a_grads = the gradients of the policy to get more Q
            # tf.gradients will calculate dys/dxs with a initial gradients for ys, so this is dq/da * da/dparams
            self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads)

    with tf.variable_scope('A_train'):
        opt = tf.train.AdamOptimizer(-self.lr)  # (- learning rate) for ascent policy
        self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params))

	# 在Critic class中
	with tf.variable_scope('a_grad'):
		# 这个a是choose_action中计算出来的
        self.a_grads = tf.gradients(self.q, a)[0]

Critic中的网络结构和代码
相比Actor中的内容，Critic就比较简单和常规
利用TD_error进行反向误差传播，target_q进行reward计算
在这里插入图片描述

	# 其中的self.q_是根据Actor的target_q计算出来的
	with tf.variable_scope('target_q'):
        self.target_q = R + self.gamma * self.q_

    with tf.variable_scope('TD_error'):
        self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q))

    with tf.variable_scope('C_train'):
        self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

A3C

通过在三个环境中进行学习，将reward反馈到最终要执行的环境中，从而得出更优的结果
基于python多线程
中央大脑有global_net，而每个玩家有一个local_net，不定时将local_net的参数更新到global_net中
在这里插入图片描述

同步的实现

	 with tf.name_scope('sync'):
            with tf.name_scope('pull'):
            	# 从global中获取参数集合
                self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]
                self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]
            with tf.name_scope('push'):
           		# 将worker中的参数更新到global参数中
                self.update_a_op = OPT_A.apply_gradients(zip(self.a_grads, globalAC.a_params))
                self.update_c_op = OPT_C.apply_gradients(zip(self.c_grads, globalAC.c_params))

利用正态分布进行action选择

	with tf.name_scope('choose_a'):  # use local params to choose action
        self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=[0, 1]), A_BOUND[0], A_BOUND[1])

python多线程

	COORD = tf.train.Coordinator()
	worker_threads = []
    for worker in workers:
        job = lambda: worker.work()
        t = threading.Thread(target=job)
        t.start()
        worker_threads.append(t)
    COORD.join(worker_threads)

worker的数据更新

	for ep_t in range(MAX_EP_STEP):
            # if self.name == 'W_0':
            #     self.env.render()
            a = self.AC.choose_action(s)
            s_, r, done, info = self.env.step(a)
            done = True if ep_t == MAX_EP_STEP - 1 else False
			
			# 添加缓存
            ep_r += r
            buffer_s.append(s)
            buffer_a.append(a)
            buffer_r.append((r+8)/8)    # normalize

            if total_step % UPDATE_GLOBAL_ITER == 0 or done:   # update global and assign to local net
                if done:
                    v_s_ = 0   # terminal 根据原算法
                else:
                    v_s_ = SESS.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]
                buffer_v_target = []
                # 计算TD error
                for r in buffer_r[::-1]:    # reverse buffer r
                    v_s_ = r + GAMMA * v_s_
                    buffer_v_target.append(v_s_)
                buffer_v_target.reverse()

                buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)
                feed_dict = {
                    self.AC.s: buffer_s,
                    self.AC.a_his: buffer_a,
                    self.AC.v_target: buffer_v_target,
                }
                # 推送更新到global
                self.AC.update_global(feed_dict)
                # 清空缓存
                buffer_s, buffer_a, buffer_r = [], [], []
                # 获取最新的global参数
                self.AC.pull_global()

JasonSparrow_1

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
强化学习ActorCritic

ActorCriticActor负责进行动作的奖惩，而Critic将对奖惩进行评估，从而对下一步的奖惩做出影响Actor的算法基础是PolicyGradients，Critic的算法基础是Q-learningActor只能回合更新，而Critic部分可以单步更新缺点就是空间的连续性，从而导致神经网络学不到东西，相关性比较强为了解决这种问题，DeepMind将ActorCritic和DQN...
复制链接

扫一扫