Markov Processes
MDP被用来描述强化学习的可完全观测的环境。几乎所有的强化学习问题可以用MDP来描述,Optimal control primarily deals with continuous MDPs. Partially observable problems can be converted into MDPs. Bandits are MDPs with one state.
Markov性质:未来只和现在有关,和过去无关,也就是现在的状态捕捉到过去状态的所有信息。
Markov Process(MP)/Markov Chain
由状态集合和和状态转移概率矩阵组成,即
⟨S,P⟩
⟨
S
,
P
⟩
Markov Reward Processes
Markov reward process(MRP)
带有值的马尔科夫链。也就是在原来的基础上,每个状态多了一个对应的值,以及多了一个用于计算reward时的discount factor
折扣因子
γ
γ
,即
⟨S,P,R,γ⟩
⟨
S
,
P
,
R
,
γ
⟩
- value function: 表示状态s的长期收益,即
Bellman Equation
value function可以被分解成两部分,immediate reward Rt+1 R t + 1 和discounted value of successor state γv(St+1) γ v ( S t + 1 ) .
v(s)=E[Gt|St=s]=E[Rt+1+γRt+2+γ2Rt+3+...|St=s]=E[Rt+1+γ(Rt+2+γRt+3+...)|St=s]=E[Rt+1+γGt+1|St=s]=E[Rt+1+γv(St+1)|St=s](3)(4)(5)(6)(7) (3) v ( s ) = E [ G t | S t = s ] (4) = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . | S t = s ] (5) = E [ R t + 1 + γ ( R t + 2 + γ R t + 3 + . . . ) | S t = s ] (6) = E [ R t + 1 + γ G t + 1 | S t = s ] (7) = E [ R t + 1 + γ v ( S t + 1 ) | S t = s ]
本来是要计算对之后所有时刻的reward的求和,现在只要利用下个时刻状态的value function即可。将上式的期望写具体得到:( Pss′ P s s ′ 是状态转移概率)
v(s)=Rt+1+γ∑s′∈SPss′v(s′) v ( s ) = R t + 1 + γ ∑ s ′ ∈ S P s s ′ v ( s ′ )
写成矩阵:
v=R+Pv v = R + P v
即:
⎡⎣⎢⎢v(1)⋮v(n)⎤⎦⎥⎥=⎡⎣⎢⎢R1⋮Rn⎤⎦⎥⎥+γ⎡⎣⎢⎢P11⋮Pn1⋯⋯P1nPnn⎤⎦⎥⎥⎡⎣⎢⎢v(1)⋮v(n)⎤⎦⎥⎥(8) (8) [ v ( 1 ) ⋮ v ( n ) ] = [ R 1 ⋮ R n ] + γ [ P 11 ⋯ P 1 n ⋮ P n 1 ⋯ P n n ] [ v ( 1 ) ⋮ v ( n ) ]
为了求得,可以解上述的等式:
v=(1−γP)−1R v = ( 1 − γ P ) − 1 R
但求逆的复杂度为o(n^3),太高,对大规模MRPs有一些迭代方法可以求解,如,Dynamic programming,Monte-Carlo evaluation,Temporal-Difference learning.
Markov Decision Processes
Markov Decision Processes(MDP)
是带有决策的MRP,即在MRP上上多了采取行动,即
⟨S,A,P,R,γ⟩
⟨
S
,
A
,
P
,
R
,
γ
⟩
,It is an environment in which all states are Markov.
带来的变化就是:MDP考虑了动作,即系统下个状态不仅和当前的状态有关,也和当前采取的动作有关。有一点要注意的是:在某个状态执行完某个动作后,不一定是到达一个固定的状态,还可能有多种可能性,如下图所示:
定义:
举个例子(0.2,0.4,0.4那里执行动作后有多个状态,其他的为一个状态)
Policy
策略是在状态给定的情况下行动的分布,即 π(a|s)=P[At=a|St=s] π ( a | s ) = P [ A t = a | S t = s ]
注意:
- 策略完全定义了agent的行为
- MDP的策略取决于当前状态,而不是history,也就是与时间无关,是静态的, At=π(⋅|St),∀t>0 A t = π ( ⋅ | S t ) , ∀ t > 0
- 给定Policy,那么MDP包含了MRP和MP
需要注意的是,在这里,概率转移矩阵的元素需要经过计算,s->s’的转移概率为采取所有能到s’的行动a,对 Pass′ P s s ′ a 加权求和。
Value Function
MDP的value function有两个,state-value function和action-value function
- state-value function
是从状态s开始,然后采取策略 π π 的期望收益,即 vπ(s)=Eπ[Gt|St=s] v π ( s ) = E π [ G t | S t = s ]
- action-value function
是从状态s开始、采取行动a后,然后采取策略 π π 的期望收益,即 qπ(s,a)=Eπ[Gt|St=s,At=a] q π ( s , a ) = E π [ G t | S t = s , A t = a ]
Bellman Expectation Equation
和MRP一样,value function可以被分解成两个部分:
在状态s采取行动a会有reward: Ras R s a
- state-value function
vπ(s)=Eπ[Gt|St=s]=Eπ[Rt+1+γvπ(St+1)|St=s]=∑a∈Aπ(a|s)⟮Ras+γ∑s′∈SPass′vπ(s′)⟯(89)(90)(91) (89) v π ( s ) = E π [ G t | S t = s ] (90) = E π [ R t + 1 + γ v π ( S t + 1 ) | S t = s ] (91) = ∑ a ∈ A π ( a | s ) ⟮ R s a + γ ∑ s ′ ∈ S P s s ′ a v π ( s ′ ) ⟯
其中最后一步,
先对每个动作求回报,再将这些动作的回报求和。
action-value function
qπ(s,a)=Eπ[Gt|St=s,At=a]=Eπ[Rt+1+γqπ(St+1,At+1)|St=s,At=a]=Ras+γ∑s′∈SPass′∑a′∈Aπ(a′|s′)qπ(s′,a′)(92)(93)(94) (92) q π ( s , a ) = E π [ G t | S t = s , A t = a ] (93) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) | S t = s , A t = a ] (94) = R s a + γ ∑ s ′ ∈ S P s s ′ a ∑ a ′ ∈ A π ( a ′ | s ′ ) q π ( s ′ , a ′ )
两者相互转化:
写成矩阵:
vπ=Rπ+γPπvπ v π = R π + γ P π v π
解得:
vπ=(1−γPπ)−1Rπ v π = ( 1 − γ P π ) − 1 R π
- state-value function
Optimal Value Function
最优价值函数是关于策略的,即所以策略中使其最大的就是最优价值函数。
我们解决MDP就是为了得到价值函数的最大值。
Optimal Policy
定理:
- 对任何MDP存在一个最优策略 π∗ π ∗
- 所有最优策略都能达到最优价值函数,即 vπ∗(s)=v∗(s) v π ∗ ( s ) = v ∗ ( s )
- 所有最优策略都能达到最优动作价值函数,即 qπ∗(s,a)=q∗(s,a) q π ∗ ( s , a ) = q ∗ ( s , a )
寻找最优策略
就是找使每个action-value function最大的action
Bellman Optimality Equation
state-value function
v∗(s)=maxaq∗(s,a)=maxaRas+γ∑s′∈SPass′v∗(s′)(95)(96) (95) v ∗ ( s ) = max a q ∗ ( s , a ) (96) = max a R s a + γ ∑ s ′ ∈ S P s s ′ a v ∗ ( s ′ )action-value function
q∗(s,a)=maxaRas+γ∑s′∈SPass′v∗(s′)=maxaRas+γ∑s′∈SPass′maxa′q∗(s′,a′)(97)(98) (97) q ∗ ( s , a ) = max a R s a + γ ∑ s ′ ∈ S P s s ′ a v ∗ ( s ′ ) (98) = max a R s a + γ ∑ s ′ ∈ S P s s ′ a max a ′ q ∗ ( s ′ , a ′ )
Bellman Optimality Equation是非线性的,一般没有解析解。
解决办法有:Value Iteration、Policy Iteration、Q-learning、Sarsa
Extensions to MDPs
Infinite and continuous MDPs
Countably infinite state and/or action spaces
StraightforwardContinuous state and/or action spaces
Closed form for linear quadratic model (LQR)
- Continuous time
Requires partial differential equations
Hamilton-Jacobi-Bellman (HJB) equation
Limiting case of Bellman equation as time-step ->0Partially observable MDPs(POMDPs)
Undiscounted, average reward MDPs
…