(10-8)强化推荐学习:A3C算法

本文详细介绍了A3C(AsynchronousAdvantageActor-Critic)算法,一种结合深度学习和强化学习的模型,尤其在连续动作空间的环境中。文章阐述了A3C的并行工作线程、异步训练、策略与值函数估计以及在推荐系统中的实际应用。
摘要由CSDN通过智能技术生成

10.8  A3C算法

A3C (Asynchronous Advantage Actor-Critic)是一种结合了深度学习和强化学习的算法,用于解决连续动作空间的强化学习问题。A3C算法使用深度神经网络同时估计策略和值函数,并通过异步训练多个并行智能体来提高学习效率和稳定性。

10.8.1  A3C算法介绍

A3C算法的核心思想是通过并行化多个工作线程,使每个线程在不同的环境状态下进行交互,从而增加样本的多样性和数据的利用效率。每个工作线程根据当前状态选择动作,并将状态、动作和奖励发送到全局Critic网络进行更新。这样,每个线程都可以独立地学习,并根据自己的经验来改善策略。

在A3C算法中,每个工作线程都可以异步地更新Critic网络的参数,这种异步性有助于避免梯度下降过程中的竞争条件,并提高了算法的效率和收敛性。此外,A3C还引入了一个优势函数(Advantage Function),用于评估每个动作相对于平均动作的优势,以进一步优化策略更新。

A3C算法的优点包括高效的并行化训练、对大规模环境和复杂任务的适应性以及对连续时间和状态空间的支持。它已经在各种任务上取得了显著的成果,包括游戏玩法、机器人控制和自动驾驶等领域。

总之,A3C是一种并行化的强化学习算法,通过多个工作线程的异步交互和参数更新,能够有效地训练深度神经网络来学习在连续时间和状态空间中进行决策的任务。

10.8.2  使用A3C算法训练推荐系统

请看下面的实例,功能是使用A3C算法训练一个简单的推荐系统代理,通过与环境的交互和优化来提高推荐系统的性能。

源码路径:daima/10/a3c.py

# 定义A3C代理类
class A3CAgent:
    def __init__(self, num_actions, state_size):
        self.num_actions = num_actions
        self.state_size = state_size
        self.optimizer = Adam(learning_rate=0.001)

        self.actor, self.critic = self.build_models()

    def build_models(self):
        # 构建Actor模型
        actor_input = tf.keras.Input(shape=self.state_size)
        actor_dense1 = Dense(64, activation='relu')(actor_input)
        actor_dense2 = Dense(64, activation='relu')(actor_dense1)
        actor_output = Dense(self.num_actions, activation='softmax')(actor_dense2)
        actor = Model(inputs=actor_input, outputs=actor_output)

        # 构建Critic模型
        critic_input = tf.keras.Input(shape=self.state_size)
        critic_dense1 = Dense(64, activation='relu')(critic_input)
        critic_dense2 = Dense(64, activation='relu')(critic_dense1)
        critic_output = Dense(1)(critic_dense2)
        critic = Model(inputs=critic_input, outputs=critic_output)

        return actor, critic

    def get_action(self, state):
        probabilities = self.actor.predict(np.array([state]))[0]
        action = np.random.choice(self.num_actions, p=probabilities)
        return action

    def train(self, states, actions, rewards):
        discounted_rewards = self.calculate_discounted_rewards(rewards)

        with tf.GradientTape() as tape:
            actor_outputs = self.actor(states)
            critic_outputs = self.critic(states)
            advantages = discounted_rewards - critic_outputs

            actor_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=actions, logits=actor_outputs))
            critic_loss = tf.reduce_mean(tf.square(advantages))

            total_loss = actor_loss + critic_loss

        actor_gradients = tape.gradient(total_loss, self.actor.trainable_variables)
        self.optimizer.apply_gradients(zip(actor_gradients, self.actor.trainable_variables))

    def calculate_discounted_rewards(self, rewards):
        discounted_rewards = np.zeros_like(rewards)
        running_reward = 0
        for t in reversed(range(len(rewards))):
            running_reward = rewards[t] + running_reward * 0.99  # discount factor: 0.99
            discounted_rewards[t] = running_reward
        return discounted_rewards


# 创建一个简单的推荐系统环境
class RecommendationEnv(gym.Env):
    def __init__(self):
        self.num_users = 10
        self.num_items = 5
        self.state_size = self.num_users + self.num_items
        self.action_space = gym.spaces.Discrete(self.num_items)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(self.state_size,))

    def reset(self):
        state = np.zeros(self.state_size)
        state[:self.num_users] = np.random.randint(0, 2, size=self.num_users)  # 用户兴趣
        self.current_user = np.random.randint(0, self.num_users)  # 当前用户
        return state

    def step(self, action):
        reward = 0
        if action == self.current_user:
            reward = 1  # 推荐正确的物品,奖励为1

        state = np.zeros(self.state_size)
        state[:self.num_users] = np.random.randint(0, 2, size=self.num_users)  # 用户兴趣
        self.current_user = np.random.randint(0, self.num_users)  # 当前用户

        done = False  # 没有结束条件
        return state, reward, done, {}


# 训练A3C代理
def train_recommendation_system():
    env = RecommendationEnv()
    agent = A3CAgent(num_actions=env.num_items, state_size=env.state_size)

    episodes = 1000
    episode_rewards = []

    for episode in range(episodes):
        state = env.reset()
        done = False
        total_reward = 0

        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.train(np.array([state]), np.array([action]), np.array([reward]))

            state = next_state
            total_reward += reward

        episode_rewards.append(total_reward)
        print(f"Episode {episode + 1}: Reward = {total_reward}")

    return agent, episode_rewards


# 运行训练过程
agent, episode_rewards = train_recommendation_system()

# 打印训练过程中每个episode的总奖励
print("Episode Rewards:", episode_rewards)

在上述代码中,首先定义了类A3CAgent,其中包括构建Actor和Critic模型的方法,以及获取动作、训练和计算折扣奖励的方法。接下来,创建了一个简单的推荐系统环境,包括状态空间、动作空间和状态转移函数。最后,定义了训练推荐系统的函数,并通过多个episode迭代训练代理,并打印每个episode的总奖励。在训练过程中,代理通过与环境交互获取状态、选择动作,并根据奖励信号来更新模型参数。每个episode都会重置环境,并在每个时间步上执行动作选择、状态转移和训练操作。将训练过程中的每个episode的总奖励被记录下来,并最终打印输出。执行后会输出以下内容:

Episode 1: Reward = <total_reward>
Episode 2: Reward = <total_reward>
...
Episode n: Reward = <total_reward>

其中,<total_reward>表示每个episode的总奖励,即代理在一个episode中完成推荐任务并获得的奖励。训练过程中的每个episode的总奖励将被打印出来,以便观察代理的性能随着训练的进展而如何变化。

  • 5
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
以下是一个简单的 A3C 实现代码,仅供参考: ```python import gym import numpy as np import tensorflow as tf import threading global_episode = 0 global_rewards = [] global_episodes = 10000 episode_rewards = tf.keras.metrics.Mean('episode_rewards', dtype=tf.float32) class A3C(tf.keras.Model): def __init__(self, state_size, action_size): super(A3C, self).__init__() self.state_size = state_size self.action_size = action_size self.dense1 = tf.keras.layers.Dense(64, activation='relu') self.dense2 = tf.keras.layers.Dense(64, activation='relu') self.policy_logits = tf.keras.layers.Dense(action_size) self.values = tf.keras.layers.Dense(1) def call(self, inputs): x = self.dense1(inputs) x = self.dense2(x) logits = self.policy_logits(x) values = self.values(x) return logits, values class Agent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.global_model = A3C(state_size, action_size) self.global_model(tf.keras.Input(shape=(state_size,))) self.opt = tf.optimizers.Adam(learning_rate=.0001, clipnorm=1.0) self.gamma = 0.99 self.tau = .125 def train(self, state, action, reward, next_state, done): with tf.GradientTape() as tape: logits, value = self.global_model(tf.convert_to_tensor(state[None, :], dtype=tf.float32)) next_logits, next_value = self.global_model(tf.convert_to_tensor(next_state[None, :], dtype=tf.float32)) advantage = reward + self.gamma * next_value[0] * (1 - int(done)) - value[0] value_loss = advantage ** 2 policy = tf.nn.softmax(logits) entropy = tf.reduce_sum(policy * tf.math.log(policy)) policy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=action, logits=logits) total_loss = tf.reduce_mean(.5 * value_loss + policy_loss - .01 * entropy) grads = tape.gradient(total_loss, self.global_model.trainable_variables) self.opt.apply_gradients(zip(grads, self.global_model.trainable_variables)) def get_action(self, state): logits, _ = self.global_model(tf.convert_to_tensor(state[None, :], dtype=tf.float32)) probs = tf.nn.softmax(logits) action = np.random.choice(self.action_size, p=probs.numpy()[0]) return action def sync(self, local_model): for local, global_ in zip(local_model.trainable_variables, self.global_model.trainable_variables): global_.assign(self.tau * local + (1 - self.tau) * global_) def test(env, agent): state = env.reset() done = False total_reward = 0 while not done: action = agent.get_action(state) next_state, reward, done, _ = env.step(action) state = next_state total_reward += reward return total_reward def train(global_agent, num_episodes, lock): global global_episode, global_rewards env = gym.make('CartPole-v0') agent = Agent(env.observation_space.shape[0], env.action_space.n) for ep in range(num_episodes): state = env.reset() done = False episode_reward = 0 while not done: action = agent.get_action(state) next_state, reward, done, _ = env.step(action) agent.train(state, action, reward, next_state, done) state = next_state episode_reward += reward with lock: global_rewards.append(episode_reward) global_episode += 1 episode_rewards(episode_reward) print("Episode: {}, Reward: {}".format(global_episode, episode_reward)) agent.sync(agent) if global_episode % 100 == 0: test_reward = test(env, agent) print("Test Reward: {}".format(test_reward)) if __name__ == '__main__': lock = threading.Lock() global_agent = Agent(4, 2) threads = [] for i in range(4): t = threading.Thread(target=train, args=(global_agent, global_episodes//4, lock)) threads.append(t) for thread in threads: thread.start() for thread in threads: thread.join() ``` 在这个实现中,我们首先定义了一个 A3C 模型和一个 Agent 类,其中 A3C 模型有两个输出:一个是策略 logits,一个是状态值估计。Agent 类负责在环境中与模型进行交互,以及使用梯度下降更新模型。 我们使用了一个简单的 CartPole 环境来测试模型。在训练过程中,我们创建了四个线程来并行地训练模型,每个线程都有自己的 local 模型。每个 episode 结束时,local 模型的参数会同步到 global 模型中。 此外,我们还定义了一个 test 函数来测试模型的性能。在每个训练周期的末尾,我们都会调用这个函数来评估模型在测试集上的表现。 请注意,这只是一个简单的实现,无法保证在所有环境中都能正常运行。如果你想要在自己的项目中使用 A3C,建议参考一些开源的实现,如 Tensorflow 的官方实现。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值