《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。



应用RL解决实际问题,目前已有的算法总的来说还是可以的,主要是要设计好能够反映问题本质的state/reward(action通常比较明确):

Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science. 

state表示:尽量符合markov property,当前的state尽量能够总结history的有用信息(一般情况下不可能是immediate sensations;也不会是complete history of all past sensations);

action:task specific,粒度一定要合适

reward设计:一定要反应我们的目标。通常情况,对于实现我们的goal的action,反馈一个正值,对于不想看到的action,可以反馈一个负值(To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion );

reward设计一定要反映goal,有个很好的例子,exercise 3.5,reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着,不管你花了多少timestep,最后只要走出来了,得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西,反正在里面兜兜转转,只要最终走出去就好了。相应的,对于走迷宫这样的任务,应该对于每一个timestep,做一个负的惩罚reward,这样agent就会学会尽快离开maze,从而获得更大的reward。。。(使用discount也可以,不过估计效果不明显)

reward设计是要考虑值的绝对大小还是相对大小???有个很好的例子exercise 3.9/3.10,对于continuing RL task来说,reward之间的【相对差距】才是关键(都是负数页没有关系的)!!!adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k })。对于episodic RL task来说,reward之间的【相对差距、本身绝对大小】都是关键!!!(Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k)。



reward hypothesis :RL最本质的出发点

    what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!

    maximization the reward signal is one of the most distinctive features of RL

    might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,.... 

    reward hypothesis是

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Gam

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值