深度强化学习系列: “奖励函数”的设计和设置（reward shaping）

最新推荐文章于 2025-04-03 15:58:41 发布

@RichardWang

最新推荐文章于 2025-04-03 15:58:41 发布

阅读量6.8w

点赞数 104

分类专栏：深度强化学习文章标签：奖励函数设置 reward奖励函数设置强化学习奖励设置奖励函数设计j RL奖励函数设置与设计

本文链接：https://blog.csdn.net/gsww404/article/details/80803295

版权

概述

前面已经讲了好几篇关于强化学习的概述、算法(DPG->DDPG),也包括对环境OpenAI gym的安装，baseline算法的运行和填坑，虽然讲了这么多，算法也能够正常运行还取得不错的效果，但是一直以来忽略了一个非常重要的话题，那就是强化学习的**《奖励函数》**的设置。

1、 Gym-Pendulum-v0例子分析奖励函数

为什么要讲强化学习的概述呢？也许在我们以前运行的算法中我们并没有直接接触到奖励函数的设置，而是直接调用了接口函数，下面以ddpg算法（点击查看DDPG算法）运行’Pendulum-v0’环境(下图，让摆杆立起来)为例子:
这里写图片描述

for j in range(MAX_EP_STEPS):
        if RENDER:
            env.render()
        # Add exploration noise
        a = ddpg.choose_action(s)
        # add randomness to action selection for exploration
        a = np.clip(np.random.normal(a, var), -2, 2)    
        s_, r, done, info = env.step(a)
        ddpg.store_transition(s, a, r / 10, s_)

这是episode运行过程，其中根据环境观察值得到action（ddpg.choose_action(s)）后, 直接将动作传给了env.step()函数了，于是我们就非常直接的得到了对应的奖励(Reward)、下一个状态( $S^{‘}$ )，随后直接开始考虑改善核心算法。

但是，我们忽略去研究强化学习中非常重要的奖励函数的设置，因为它会决定强化学习算法的收敛速度和程度。
那么到底env.step()背后是怎样的呢？

 def step(self, action, **kwargs):
        self._observation, reward, done, info = self.env.step(action)
        self._observation = np.clip(self._observation, self.env.observation_space.low, self.env.observation_space.high)
        return self.observation, reward, done, info

其中第三行np.clip()函数就是，大于max取max,小于min取min(查看clip()函数 )，然后继续追踪第二行self.env.step(action)，得到下面代码

 def step(self,u):
        th, thdot = self.state # th := theta
        g = 10.
        m = 1.
        l = 1.
        dt = self.dt
        u = np.clip(u, -self.max_torque, self.max_torque)[0]
        self.last_u = u # for rendering
        costs = angle_normalize(th)**2 + .1*thdot**2 + .001*(u**2)

        newthdot = thdot + (-3*g/(2*l) * np.sin(th + np.pi) + 3./(m*l**2)*u) * dt
        newth = th + newthdot*dt
        newthdot = np.clip(newthdot, -self.max_speed, self.max_speed) #pylint: disable=E1111

        self.state = np.array([newth, newthdot])
        return self._get_obs(), -costs