《Reinforcement Learning: An Introduction》 读书笔记 - 目录
Agent-Environment Interface
- agent
- learner and decision maker
- environment
- 与agent交互,包括所有agent之外的东西
- environment’s state
- St∈S S t ∈ S
- action
- At∈A(s) A t ∈ A ( s )
- reward
- Rt∈R⊂R R t ∈ R ⊂ R
MDP
几个要素
- state, action, reward集合 S,A,R S , A , R
- 在
Finite MDP
中,这几个集合都是有限集
- 在
- p(s′,r|s,a)=P(St=s′,Rt=r|St−1=s,At−1=a) p ( s ′ , r | s , a ) = P ( S t = s ′ , R t = r | S t − 1 = s , A t − 1 = a )
- Markov性质,简化问题
- 只考虑最近的一次action
- St−1 S t − 1 中其实仍然可以包含 St−2 S t − 2 及以前的信息
- 在此基础上,还可以得到几个相关的,如:
- 状态转移概率 p(s|s,a) p ( s | s , a )
- 期望收益 r(s,a),r(s,a,s′) r ( s , a ) , r ( s , a , s ′ )
- Markov性质,简化问题
- 例子
- recycling robot
- recycling robot
目标
- agent的目标是最大化 E(∑f(Rt)) E ( ∑ f ( R t ) )
- reward hypothesis:
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
一些概念
- episode
- episodic task
- 有终止的 或者说 一段一段的
- continuing task
- 无限的 或者 不确定能否结束的(?)
- episodic task
- discounted return
- Gt=Rt+1+γRt+2+γ2Rt+