Actor-Critic
一句话概括:结合了Policy Gradient(Actor)和Function Approximation(Critic).Actor基于概率选择,Critic基于Actor的行为评判行为的得分,Actor根据Critic的评分修改行为的概率。
优点:可以进行单步更新,比传统的policy Gradient要快。
缺点:取决于Critic的价值判断,但是Critic难以收敛,再加上Actor的更新,就更难收敛。为了解决这个问题,Google Deepmind提出了Actor Critic的升级版,Deep Deterministic Policy Gradient.后者融合了DQN的优势,解决了收敛难的问题.
Actor与Critic结构:
class Actor(object):
def __init__(self, sess, n_features, n_actions, lr=0.001):
self.sess = sess
self.s = tf.placeholder(tf.float32, [1, n_features], "state")
self.a = tf.placeholder(tf.int32, None, "act")
self.td_error = tf.placeholder(tf.float32, None, "td_error") # TD_error
with tf.variable_scope('Ac