Actor-Critic

算法思想

​ Actor-Critic是一种基于策略和价值的强化学习算法,原来 Actor-Critic 的 Actor 的前生是 Policy Gradients ,这能让它毫不费力地在连续动作中选取合适的动作,,而 Q-learning 做这件事会瘫痪。原来 Actor Critic 中的 Critic 的前生是 Q-learning 或者其他的以值(value)为基础的学习法 , 能进行单步更新,而传统的 Policy Gradients 则是回合更新,这降低了学习效率。

​ Actor 和 Critic 他们都能用不同的神经网络来代替。其中Actor基于策略函数,负责生成动作(Action)并和环境交互,Critic基于价值函数,负责评估Actor的表现,Critic通过学习环境和奖励之间的关系, 能看到现在所处状态的潜在奖励, 所以用它来指点 Actor 便能使 Actor 每一步都在更新, 如果使用单纯的 Policy Gradients,Actor 只能等到回合结束才能开始更新。即在Actor-Critic算法中,我们要做两组近似,一组是策略函数的近似,另一组是价值函数的近似。

总的来说,就是Critic通过Q网络计算状态的最优价值 v t v_t vt, 而Actor利用 v t v_t vt这个最优价值迭代更新策略函数的参数 θ \theta θ,进而选择动作,并得到反馈和新的状态,Critic使用反馈和新的状态更新Q网络参数 w w w, 在后面Critic会使用新的网络参数 w w w来帮Actor计算状态的最优价值 v t v_t vt
在这里插入图片描述

缺点

​ Actor-Critic 涉及到了两个神经网络,而且每次都是在连续状态中更新参数,每次参数更新前后都存在相关性,导致神经网络只能片面的看待问题,甚至导致神经网络学不到东西。

算法实现

  • Actor 网络
class Actor():
    def __init__(self, env):
        # init some parameters
        self.state_dim = env.observation_space.shape[0]  # observation特征数量
        self.action_dim = env.action_space.n  # action特征数量

        self.model = tf.keras.Sequential([
            tf.keras.layers.Dense(128, activation='relu', use_bias=True),
            tf.keras.layers.Dense(self.action_dim, use_bias=True)
        ])

        actor_optimizer = tf.keras.optimizers.Adam(1e-3)
        self.model.compile(
            loss='categorical_crossentropy',
            optimizer=actor_optimizer
        )

    # 选择动作
    def choose_action(self, observation):
        prob_weights = self.model.predict(observation[np.newaxis, :])
        prob_weights = tf.nn.softmax(prob_weights)
        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.numpy().ravel())
        return action

    # 学习
    def learn(self, state, action, td_error):
        s = state[np.newaxis, :]
        one_hot_action = np.zeros(self.action_dim)
        one_hot_action[action] = 1
        a = one_hot_action[np.newaxis, :]
        self.model.fit(s, td_error * a, verbose=0)

    def saveModel(self):
        path = os.path.join('model', '_'.join([File, ALG_NAME, ENV_NAME]))
        if not os.path.exists(path):
            os.makedirs(path)
        self.model.save_weights(os.path.join(path, 'actor.tf'), save_format='tf')
        print('Saved weights.')

    def loadModel(self):
        path = os.path.join('model', '_'.join([File, ALG_NAME, ENV_NAME]))
        if os.path.exists(path):
            self.model.load_weights(os.path.join(path, 'actor.tf'))
            print('Load weights!')
        else:
            print("No model file find, please train model first...")

  • Critic 网络
class Critic():
    def __init__(self, env):
        self.state_dim = env.observation_space.shape[0]  # observation特征数量
        self.action_dim = env.action_space.n  # action特征数量

        self.model = tf.keras.Sequential([
            tf.keras.layers.Dense(128, activation='relu', use_bias=True),
            tf.keras.layers.Dense(1, use_bias=True)
        ])

        critic_optimizer = tf.keras.optimizers.Adam(1e-3)
        self.model.compile(loss='mse', optimizer=critic_optimizer)

    def train_Q_network(self, state, reward, next_state, done):
        s, s_ = state[np.newaxis, :], next_state[np.newaxis, :]
        V = self.model(s)
        V_ = self.model(s_)

        td_error = reward + GAMMA * V_ - V

        value_target = reward if done else reward + GAMMA * np.array(self.model(s_))[0][0]
        value_target = [[value_target]]
        self.model.fit(s, np.array(value_target), verbose=0)

        return td_error

    def saveModel(self):
        path = os.path.join('model', '_'.join([File, ALG_NAME, ENV_NAME]))
        if not os.path.exists(path):
            os.makedirs(path)
        self.model.save_weights(os.path.join(path, 'critic.tf'), save_format='tf')
        print('Saved weights.')

    def loadModel(self):
        path = os.path.join('model', '_'.join([File, ALG_NAME, ENV_NAME]))
        if os.path.exists(path):
            self.model.load_weights(os.path.join(path, 'critic.tf'))
            print('Load weights!')
        else:
            print("No model file find, please train model first...")
  • main
# Hyper Parameters
GAMMA = 0.95  # discount factor
LEARNING_RATE = 0.01  # learning rate
File = 'ActorCritic-1.1'
ALG_NAME = 'AC'
ENV_NAME = 'CartPole-v0'

def main():
    EPISODE = 200  # Episode limitation
    STEP = 1000  # Step limitation in an episode

    env = gym.make(ENV_NAME)
    actor = Actor(env)
    critic = Critic(env)
    all_rewards = []
    for episode in range(EPISODE):
        # initialize task
        state = env.reset()
        total_reward = 0
        # Train
        for step in range(STEP):
            action = actor.choose_action(state)  # e-greedy action for train
            next_state, reward, done, _ = env.step(action)
            td_error = critic.train_Q_network(state, reward, next_state,
                                              done)  # gradient = grad[r + gamma * V(s_) - V(s)]
            actor.learn(state, action, td_error)  # true_gradient = grad[logPi(s,a) * td_error]
            total_reward += reward
            state = next_state
            if done:
                all_rewards.append(total_reward)
                break
        print("total_reward---------------", total_reward)
    plt.plot(all_rewards)
    plt.show()
    actor.saveModel()
    critic.saveModel()

参考文章

https://www.cnblogs.com/pinard/p/10272023.html

https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/intro-AC/

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值