强化学习 - Proximal Policy Optimization (PPO)

什么是机器学习

Proximal Policy OptimizationPPO)是一种用于解决强化学习问题的策略梯度方法。PPO的目标是通过优化代理的策略函数来最大化期望累积奖励。它通过引入一个剪切项clipping term)来限制新策略相对于旧策略的更新幅度,从而提高训练的稳定性。

以下是一个使用 Python 和 TensorFlow/Keras 实现简单的 Proximal Policy OptimizationPPO)的示例。在这个例子中,我们将使用 OpenAI GymCartPole 环境。

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
import gym

# 定义Proximal Policy Optimization Agent
class PPOAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99  # 折扣因子
        self.learning_rate = 0.001
        self.clip_ratio = 0.2
        self.epochs = 10

        # 构建演员(Actor)网络
        self.actor = self.build_actor()
        
        # 构建评论家(Critic)网络
        self.critic = self.build_critic()

    def build_actor(self):
        state_input = Input(shape=(self.state_size,))
        dense1 = Dense(64, activation='relu')(state_input)
        dense2 = Dense(64, activation='relu')(dense1)
        output = Dense(self.action_size, activation='softmax')(dense2)
        model = Model(inputs=state_input, outputs=output)
        model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
        return model

    def build_critic(self):
        state_input = Input(shape=(self.state_size,))
        dense1 = Dense(64, activation='relu')(state_input)
        dense2 = Dense(64, activation='relu')(dense1)
        output = Dense(1, activation='linear')(dense2)
        model = Model(inputs=state_input, outputs=output)
        model.compile(loss='mean_squared_error', optimizer=Adam(lr=self.learning_rate))
        return model

    def get_action(self, state):
        state = np.reshape(state, [1, self.state_size])
        action_prob = self.actor.predict(state)[0]
        action = np.random.choice(self.action_size, p=action_prob)
        return action, action_prob

    def get_value(self, state):
        state = np.reshape(state, [1, self.state_size])
        value = self.critic.predict(state)[0][0]
        return value

    def train(self, states, actions, rewards, values, dones):
        advantages, target_values = self.compute_advantages_and_targets(rewards, values, dones)

        # 训练演员网络
        old_action_probs = self.actor.predict(np.vstack(states))
        old_action_probs = np.array([probs[action] for probs, action in zip(old_action_probs, actions)])
        action_one_hot = tf.keras.utils.to_categorical(actions, self.action_size)
        action_probs = self.actor.predict(np.vstack(states))
        action_probs = np.array([probs[action] for probs, action in zip(action_probs, actions)])
        ratio = np.exp(np.log(action_probs) - np.log(old_action_probs))
        clipped_ratio = np.clip(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
        actor_loss = -np.mean(np.minimum(ratio * advantages, clipped_ratio * advantages))
        self.actor.train_on_batch(np.vstack(states), action_one_hot, sample_weight=ratio / clipped_ratio)

        # 训练评论家网络
        self.critic.train_on_batch(np.vstack(states), target_values)

    def compute_advantages_and_targets(self, rewards, values, dones):
        advantages = np.zeros_like(rewards, dtype=np.float32)
        target_values = np.zeros_like(rewards, dtype=np.float32)
        running_add = 0
        for t in reversed(range(len(rewards))):
            running_add = running_add * self.gamma * (1 - dones[t]) + rewards[t]
            advantages[t] = running_add - values[t]
            target_values[t] = running_add
        return advantages, target_values

# 初始化环境和Agent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = PPOAgent(state_size, action_size)

# 训练PPO Agent
num_episodes = 1000
for episode in range(num_episodes):
    state = env.reset()
    total_reward = 0
    states, actions, rewards, values, dones = [], [], [], [], []
    for time in range(500):  # 限制每个episode的步数,防止无限循环
        # env.render()  # 如果想可视化训练过程,可以取消注释此行
        action, action_prob = agent.get_action(state)
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        value = agent.get_value(state)
        states.append(state)
        actions.append(action)
        rewards.append(reward)
        values.append(value)
        dones.append(done)
        state = next_state
        if done:
            print("Episode: {}, Total Reward: {}".format(episode + 1, total_reward))
            agent.train(states, actions, rewards, values, dones)
            break

# 关闭环境
env.close()

在这个例子中,我们定义了一个简单的Proximal Policy Optimization(PPO)Agent,包括演员(Actor)和评论家(Critic)两个神经网络。在训练过程中,我们通过PPO算法来更新演员网络和评论家网络的参数。

请注意,PPO算法的实现可能因问题的复杂性而有所不同,可能需要更多的技术和调整,如归一化奖励、使用更复杂的神经网络结构等。

  • 19
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值