《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结


Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science. 

state表示:尽量符合markov property,当前的state尽量能够总结history的有用信息(一般情况下不可能是immediate sensations;也不会是complete history of all past sensations);

action:task specific,粒度一定要合适

reward设计:一定要反应我们的目标。通常情况,对于实现我们的goal的action,反馈一个正值,对于不想看到的action,可以反馈一个负值(To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion );

reward设计一定要反映goal,有个很好的例子,exercise 3.5,reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着,不管你花了多少timestep,最后只要走出来了,得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西,反正在里面兜兜转转,只要最终走出去就好了。相应的,对于走迷宫这样的任务,应该对于每一个timestep,做一个负的惩罚reward,这样agent就会学会尽快离开maze,从而获得更大的reward。。。(使用discount也可以,不过估计效果不明显)

reward设计是要考虑值的绝对大小还是相对大小???有个很好的例子exercise 3.9/3.10,对于continuing RL task来说,reward之间的【相对差距】才是关键(都是负数页没有关系的)!!!adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k })。对于episodic RL task来说,reward之间的【相对差距、本身绝对大小】都是关键!!!(Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k)。

reward hypothesis :RL最本质的出发点

    what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!

    maximization the reward signal is one of the most distinctive features of RL

    might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,.... 

    reward hypothesis是

