机器学习专栏（80）：策略梯度算法实战——从理论到CartPole完美平衡

Sonal_Lynn

已于 2025-05-17 00:14:47 修改

阅读量625

点赞数 21

分类专栏：人工智能专题文章标签：算法 python 深度学习机器学习

于 2025-05-17 00:14:14 首次发布

本文链接：https://blog.csdn.net/Conan_0728/article/details/148019665

版权

6.1 基于Transformer的策略网络

一、REINFORCE算法深度解析

1.1 算法数学原理

策略梯度定理的数学表达：

关键步骤分解：

1.2 核心代码实现

class PolicyGradientAgent:
    def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.95):
        self.model = self.build_model(state_dim, action_dim)
        self.optimizer = tf.keras.optimizers.Adam(lr)
        self.gamma = gamma

    def build_model(self, state_dim, action_dim):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation='relu', input_shape=(state_dim,)),
            tf.keras.layers.Dense(action_dim, activation='softmax')
        ])
        return model

    def get_action(self, state):
        prob = self.model.predict(state[np.newaxis], verbose=0)[0]
        return np.random.choice(len(prob), p=prob)

    def train(self, states, actions, rewards):
        discounted_rewards = self.discount_and_normalize(rewards)
        
        with tf.GradientTape() as tape:
            probs = self.model(states)
            action_probs = tf.reduce_sum(probs * tf.one_hot(actions, depth=2), axis=1)
            loss = -tf.reduce_mean(tf.math.log(action_probs) * discounted_rewards)
        
        grads = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))

二、关键函数实现细节

2.1 折扣回报计算优化

def discount_rewards(self, rewards):
    discounted = np.zeros_like(rewards, dtype=np.float32)
    running_add = 0
    for t in reversed(range(len(rewards))):
        running_add = running_add * self.gamma + rewards[t]
        discounted[t] = running_add
    return (discounted - np.mean(discounted)) / (np.std(discounted) + 1e-8)