什么是机器学习
策略梯度方法(Policy Gradient Methods)是一类用于解决强化学习问题的算法,其目标是直接学习策略函数,而不是值函数。这种方法的核心思想是通过最大化或最小化策略的期望累积回报来更新策略参数。
以下是一个使用 Python 和 TensorFlow/Keras
实现策略梯度方法(REINFORCE
算法)的简单教程。在这个例子中,我们将使用 OpenAI Gym 的 CartPole
环境。
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import gym
# 定义策略梯度Agent
class PolicyGradientAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.gamma = 0.99 # 折扣因子
self.learning_rate = 0.01
self.model = self.build_model()
def build_model(self):
model = Sequential()
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
def get_action(self, state):
state = np.reshape(state, [1, self.state_size])
action_prob = self.model.predict(state)[0]
action = np.random.choice(self.action_size, p=action_prob)
return action
def train(self, state, action, reward):
action_one_hot = tf.keras.utils.to_categorical(action, self.action_size)
discounted_rewards = self.discount_rewards(reward)
discounted_rewards -= np.mean(discounted_rewards)
discounted_rewards /= np.std(discounted_rewards)
action_one_hot *= discounted_rewards
state = np.reshape(state, [1, self.state_size])
self.model.train_on_batch(state, action_one_hot)
def discount_rewards(self, rewards):
discounted_rewards = np.zeros_like(rewards, dtype=np.float32)
running_add = 0
for t in reversed(range(len(rewards))):
running_add = running_add * self.gamma + rewards[t]
discounted_rewards[t] = running_add
return discounted_rewards
# 初始化环境和Agent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = PolicyGradientAgent(state_size, action_size)
# 训练策略梯度Agent
num_episodes = 1000
for episode in range(num_episodes):
state = env.reset()
total_reward = 0
states, actions, rewards = [], [], []
for time in range(500): # 限制每个episode的步数,防止无限循环
# env.render() # 如果想可视化训练过程,可以取消注释此行
action = agent.get_action(state)
next_state, reward, done, _ = env.step(action)
total_reward += reward
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
if done:
print("Episode: {}, Total Reward: {}".format(episode + 1, total_reward))
agent.train(states, actions, rewards)
break
# 关闭环境
env.close()
在这个例子中,我们使用了策略梯度方法的一个简单形式,REINFORCE
算法。Agent的策略函数是一个神经网络,接收状态作为输入并输出在每个动作上的概率。在每个时间步,Agent根据策略函数采样动作,并通过策略梯度算法来更新网络参数。
请注意,REINFORCE
算法使用了discounted rewards
,即对未来的回报进行了折扣。这有助于更重视较早的动作,从而引导Agent更好地进行学习。
此外,实际应用中可能需要更多的技术和调整,如基线(baseline
)、归一化奖励、使用更复杂的神经网络结构等。