《reinforcement learning：an introduction》第三章《Finite Markov Decision Processes》总结

最新推荐文章于 2022-01-16 14:41:19 发布

mmc2015

最新推荐文章于 2022-01-16 14:41:19 发布

阅读量1.4k

点赞数 2

分类专栏：（深度）增强学习文章标签：增强学习 sutton RL reinforcement learni an introduction

本文链接：https://blog.csdn.net/mmc2015/article/details/75015070

版权

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

应用RL解决实际问题，目前已有的算法总的来说还是可以的，主要是要设计好能够反映问题本质的state/reward（action通常比较明确）：

Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science.

state表示：尽量符合markov property，当前的state尽量能够总结history的有用信息（一般情况下不可能是immediate sensations；也不会是complete history of all past sensations）；

action：task specific，粒度一定要合适

reward设计：一定要反应我们的目标。通常情况，对于实现我们的goal的action，反馈一个正值，对于不想看到的action，可以反馈一个负值（To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion ）；

reward设计一定要反映goal，有个很好的例子，exercise 3.5，reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着，不管你花了多少timestep，最后只要走出来了，得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西，反正在里面兜兜转转，只要最终走出去就好了。相应的，对于走迷宫这样的任务，应该对于每一个timestep，做一个负的惩罚reward，这样agent就会学会尽快离开maze，从而获得更大的reward。。。（使用discount也可以，不过估计效果不明显）

reward设计是要考虑值的绝对大小还是相对大小？？？有个很好的例子，exercise 3.9/3.10，对于continuing RL task来说，reward之间的【相对差距】才是关键（都是负数页没有关系的）！！！（adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k }）。对于episodic RL task来说，reward之间的【相对差距、本身绝对大小】都是关键！！！（Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k）。

reward hypothesis ：RL最本质的出发点

what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!

maximization the reward signal is one of the most distinctive features of RL

might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,....

reward hypothesis是

最低0.47元/天解锁文章

mmc2015

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
4
评论
《reinforcement learning：an introduction》第三章《Finite Markov Decision Processes》总结

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。应用RL解决实际问题，目前已有的算法总的来说还是可以的，主要是要设计好能够反映问题本质的state/rewa
复制链接

扫一扫