RL(2):马尔科夫决策过程

在这里插入图片描述

把扫地机器人简化成以下条件:
状态序列:
{ 0 , 1 , 2 , 3 , 4 , 5 } \{0,1,2,3,4,5\} {0,1,2,3,4,5}

行为序列:
{ − 1 , + 1 } \{-1,+1\} {1,+1}

转移函数:
f ˉ ( 0 , ± 1 ) = 0 , f ˉ ( 1 , + 1 ) = 2 , f ˉ ( 1 , − 1 ) = 0 , f ˉ ( 2 , + 1 ) = 3 , f ˉ ( 2 , − 1 ) = 1 \bar f(0,\pm 1)=0 , \bar f(1,+ 1)=2, \bar f(1,- 1)=0, \bar f(2,+ 1)=3, \bar f(2,- 1)=1 fˉ(0,±1)=0,fˉ(1,+1)=2,fˉ(1,1)=0,fˉ(2,+1)=3,fˉ(2,1)=1
f ˉ ( 3 , + 1 ) = 4 , f ˉ ( 3 , − 1 ) = 2 , f ˉ ( 4 , + 1 ) = 5 , f ˉ ( 4 , − 1 ) = 4 , f ˉ ( 5 , ± 1 ) = 5 \bar f(3,+1)=4, \bar f(3,- 1)=2, \bar f(4,+ 1)=5, \bar f(4,- 1)=4, \bar f(5,\pm 1)=5 fˉ(3,+1)=4,fˉ(3,1)=2,fˉ(4,+1)=5,fˉ(4,1)=4,fˉ(5,±1)=5

奖励函数:
ρ ( 0 , ± 1 , 0 ) = 0 , ρ ( 1 , + 1 , 2 ) = 0 , ρ ( 1 , − 1 , 0 ) = 1 , ρ ( 2 , + 1 , 3 ) = 0 , ρ ( 2 , − 1 , 1 ) = 0 \rho(0,\pm1,0)=0, \rho(1,+1,2)=0, \rho(1,-1,0)=1, \rho(2,+1,3)=0, \rho(2,-1,1)=0 ρ(0,±1,0)=0,ρ(1,+1,2)=0,ρ(1,1,0)=1,ρ(2,+1,3)=0,ρ(2,1,1)=0
ρ ( 3 , + 1 , 4 ) = 0 , ρ ( 3 , − 1 , 2 ) = 0 , ρ ( 4 , + 1 , 5 ) = 5 , ρ ( 4 , − 1 , 3 ) = 0 , ρ ( 5 , ± 1 , 5 ) = 0 \rho(3,+1,4)=0, \rho(3,-1,2)=0, \rho(4,+1,5)=5, \rho(4,-1,3)=0, \rho(5,\pm1,5)=0 ρ(3,+1,4)=0,ρ(3,1,2)=0,ρ(4,+1,5)=5,ρ(4,1,3)=0,ρ(5,±1,5)=0

1. 定义

定义1.1:

Argument of the maximum——假设给定函数 f ( x ) f(x) f(x)有最大 M M M,则 f ( x ) f(x) f(x)达到 M M M x x x值集合表示为
arg max ⁡ x f ( x ) \argmax _{x} f(x) xargmaxf(x)

定义1.2:

一个随机变量的期望值——假设随机变量 X X X x 1 x_1 x1的概率为 p 1 p_1 p1, x 2 x_2 x2的概率为 p 2 p_2 p2,……,则 X X X的期望为
E [ X ] = x 1 p 1 + x 2 p 2 + . . . + x k p k \mathbb E[X]=x_1p_1+x_2p_2+...+x_kp_k E[X]=x1p1+x2p2+...+xkpk
E \mathbb E E被称为exception operator,并具有以下性质:
E [ X + Y ] = E [ X ] + E [ Y ] E [ X + c ] = E [ X ] + c E [ c X ] = c E [ X ] \mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y]\\ \mathbb E[X+c]=\mathbb E[X]+c\\ \mathbb E[cX]=c\mathbb E[X] E[X+Y]=E[X]+E[Y]E[X+c]=E[X]+cE[cX]=cE[X]

定义1.3:

L ∞ L^\infty L-norm:向量 x = [ x 1 , x 2 , . . . , x n ] T \textbf x=[x_1,x_2,...,x_n]^T x=[x1,x2,...,xn]T L ∞ L^\infty L-norm,用 ∣ ∣ x ∣ ∣ ∞ ||\textbf x||_\infty x表示,是 x \textbf x x中最大的元素。
∣ ∣ x ∣ ∣ ∞ = Δ max ⁡ i ∣ x i ∣ ||\textbf x||_\infty \overset{\Delta}{=} \max_i|x_i| x=Δimaxxi

2. 马尔科夫决策的要素

2.1 状态,动作,转换和奖励

状态:

S = { s 1 , s 2 , . . . , s ∣ S ∣ } S=\{s_1,s_2,...,s_{|S|}\} S={s1,s2,...,sS}

动作:

A ( s ) = { a 1 , a 2 , . . . , a ∣ A ∣ } A(s)=\{a_1,a_2,...,a_|A|\} A(s)={a1,a2,...,aA}

转换函数:

确定的转换函数:
f ˉ : S × A → S f ˉ ( s , a ) = s ′ \bar f : S×A\rightarrow S \\ \bar f(s,a)=s' fˉ:S×ASfˉ(s,a)=s
不确定的转换函数:
f ˉ : S × A × S → [ 0 , 1 ] f ( s , a , s ′ ) = P ( S t + 1 = s ′ ∣ S t = s , A t = a ) = p s s ′ a \bar f: S×A×S\rightarrow [0,1] \\ f(s,a,s')=\mathbb P(S_{t+1}=s' | S_t=s,A_t=a)=p^a_{ss'} fˉ:S×A×S[0,1]f(s,a,s)=P(St+1=sSt=s,At=a)=pssa

奖励函数:

确定的奖励函数:
R t = ρ ˉ ( S t − 1 , A t − 1 , S t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ) = ρ ˉ ( s , a , s ′ ) = r ρ ˉ : S × A × S → R R_t=\bar\rho(S_{t-1},A_{t-1},S_t|S_{t-1}=s,A_{t-1}=a,S_t=s')=\bar\rho(s,a,s')=r \\ \bar\rho:S×A×S\rightarrow R Rt=ρˉ(St1,At1,StSt1=s,At1=a,St=s)=ρˉ(s,a,s)=rρˉ:S×A×SR

状态决定的奖励函数: ρ ( r ∣ s ) \rho(r|s) ρ(rs)
ρ : R × S → [ 0 , 1 ] \rho:R×S\rightarrow [0,1] ρ:R×S[0,1]
r ( s ) = E [ R t ∣ S t − 1 = s ] = r 1 ρ ( r 1 ∣ s ) + . . . + r m ρ ( r m ∣ s ) = ∑ r r ρ ( r ∣ s ) r(s)=\mathbb E[R_t|S_{t-1}=s]=r_1\rho(r_1|s)+...+r_m\rho(r_m|s)=\sum_rr\rho(r|s) r(s)=E[RtSt1=s]=r1ρ(r1s)+...+rmρ(rms)=rrρ(rs)

状态和动作决定的奖励函数: ρ ( r ∣ s , a ) \rho(r|s,a) ρ(rs,a)
ρ : R × S × A → [ 0 , 1 ] \rho:R×S×A\rightarrow [0,1] ρ:R×S×A[0,1]
r ( s , a ) = E [ R t ∣ S t − 1 = s , A t − 1 = a ] = r 1 ρ ( r 1 ∣ s , a ) + . . . + r m ρ ( r m ∣ s , a ) = ∑ r r ρ ( r ∣ s , a ) \begin{aligned} r(s,a)&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]\\ &=r_1\rho(r_1|s,a)+...+r_m\rho(r_m|s,a)\\ &=\sum_rr\rho(r|s,a) \end{aligned} r(s,a)=E[RtSt1=s,At1=a]=r1ρ(r1s,a)+...+rmρ(rms,a)=rrρ(rs,a)

2.2 四参数的 p p p函数

p ( s ′ , r ∣ s , a ) = f ( s , a , s ′ ) ⋅ ρ ( r ∣ s ′ ) p : S × R × S × A → [ 0 , 1 ] p(s',r|s,a)=f(s,a,s') · \rho(r|s')\\ p:S×R×S×A\rightarrow [0,1] p(s,rs,a)=f(s,a,s)ρ(rs)p:S×R×S×A[0,1]

利用上式可以推导:

概率转换函数

f ( s , a , s ′ ) = P S t = s ′ ∣ S t − 1 = s , A t − 1 = a = ∑ r ∈ R p ( s ′ , r ∣ s , a ) f(s,a,s')=\mathbb P{S_t=s'|S_{t-1}=s,A_{t-1}=a}=\sum_{r\in R}p(s',r|s,a) f(s,a,s)=PSt=sSt1=s,At1=a=rRp(s,rs,a)

状态决定的奖励函数

ρ ( r ∣ s ′ ) = f ( s , a , s ′ ) ⋅ ρ ( r ∣ s ′ ) = ∑ r ∈ R p ( s ′ , r ∣ s , a ) ⋅ ρ ( r ∣ s ′ ) \rho(r|s')=f(s,a,s')·\rho(r|s')=\sum_{r\in R}p(s',r|s,a)·\rho(r|s') ρ(rs)=f(s,a,s)ρ(rs)=rRp(s,rs,a)ρ(rs)
ρ ( r ∣ s ′ ) = p ( s ′ , r ∣ s , a ) ∑ r ∈ R p ( s ′ , r ∣ s , a ) \rho(r|s')=\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)} ρ(rs)=rRp(s,rs,a)p(s,rs,a)

状态和动作决定的奖励函数

ρ ( r ∣ s , a ) = ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) \rho(r|s,a)=\sum_{s'\in S}p(s',r|s,a) ρ(rs,a)=sSp(s,rs,a)

基于状态的期望奖励

r ( s ) = E [ R t ∣ S t = s ] = ∑ r r ρ ( r ∣ s ) r(s)=\mathbb E[R_t|S_t=s]=\sum_rr\rho(r|s) r(s)=E[RtSt=s]=rrρ(rs)

基于状态和动作的期望奖励

r ( s , a ) = E [ R t ∣ S t − 1 = s , A t − 1 = a ] = ∑ r ∈ R ∑ s ′ ∈ S r p ( s ′ , r ∣ s , a ) r(s,a)=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]=\sum_{r\in R}\sum_{s'\in S}rp(s',r|s,a) r(s,a)=E[RtSt1=s,At1=a]=rRsSrp(s,rs,a)

基于状态-动作-下一状态的期望奖励

r ( s , a , s ′ ) = E [ R t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ] = ∑ r r ρ ( r ∣ s ′ ) = ∑ r r p ( s ′ , r ∣ s , a ) ∑ r ∈ R p ( s ′ , r ∣ s , a ) \begin{aligned} r(s,a,s')&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a,S_t=s']\\ &=\sum_rr\rho(r|s')\\ &=\sum_rr\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)}\\ \end{aligned} r(s,a,s)=E[RtSt1=s,At1=a,St=s]=rrρ(rs)=rrrRp(s,rs,a)p(s,rs,a)

2.3 Return

奖励的总和:
G t = Δ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . = ∑ k = 0 ∞ γ k R t + k + 1 G_t\overset{\Delta}{=}R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\sum_{k=0}^\infty\gamma^k R_{t+k+1} Gt=ΔRt+1+γRt+2+γ2Rt+3+...=k=0γkRt+k+1

2.4 策略

π : S × A → [ 0 , 1 ] π ( a ∣ s ) = P { A t = a ∣ S t = s } \pi:S×A\rightarrow[0,1]\\ \pi(a|s)=\mathbb P\{A_t=a|S_t=s\} π:S×A[0,1]π(as)=P{At=aSt=s}

3. 价值函数和贝尔曼方程

状态价值函数:

v π = Δ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , 对 于 所 有 的 s ∈ S \begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s] ,对于所有的s\in S\\ \end{aligned} vπ=ΔEπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s],sS

动作价值函数:

q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] \begin{aligned} q_\pi(s,a)&=\mathbb E_\pi[G_t|S_t=s,A_t=a]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s,A_t=a] \\ \end{aligned} qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]

贝尔曼方程:

v π = Δ E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . ∣ S t = s ] = E π [ R t + 1 ∣ S t = s ] + γ E π [ ( R t + 2 + γ R t + 3 + . . . ) ∣ S t = s ] = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] \begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...|S_t=s] \\ &=\mathbb E_\pi [R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[(R_{t+2}+\gamma R_{t+3}+...)|S_t=s]\\ &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s] \end{aligned} vπ=ΔEπ[GtSt=s]=Eπ[Rt+1+γRt+2+γ2Rt+3+...St=s]=Eπ[Rt+1St=s]+γEπ[(Rt+2+γRt+3+...)St=s]=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]
E π [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r r ⋅ p ( s ′ , r ∣ s , a ) \mathbb E_\pi[R_{t+1}|S_t=s]=\sum_a\pi(a|s)\sum_{s'}\sum_rr·p(s',r|s,a) Eπ[Rt+1St=s]=aπ(as)srrp(s,rs,a)
E π [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) E π [ G t + 1 ∣ S t + 1 = s ′ ] = ∑ a ) π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) v π ( s ′ ) \begin{aligned} \mathbb E_\pi[G_{t+1}|S_t=s]&=\sum_a\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)\mathbb E_\pi [G_{t+1}|S_{t+1}=s']\\ &=\sum_{a)}\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)v_\pi(s')\\ \end{aligned} Eπ[Gt+1St=s]=aπ(as)srp(s,rs,a)Eπ[Gt+1St+1=s]=a)π(as)srp(s,rs,a)vπ(s)

v π ( s ) = E π [ R t + 1 ∣ S t = s ] + γ E π [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] \begin{aligned} v_\pi(s) &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s]\\ &=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]\\ \end{aligned} vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s]=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]
q π ( s , a ) = Δ E π [ G t ∣ S t , A t = a ] = E π [ R t + 1 ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] E π [ R t + 1 ∣ S t = s , A t = a ] = ∑ s ′ ∑ r r ⋅ p ( s ′ , r ∣ s , a ) E π [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) π ( a ′ ∣ s ′ ) E π [ G t + 1 ∣ S t + 1 = s ′ , A t + 1 = a ′ ] = ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) \begin{aligned} q_\pi(s,a)&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t,A_t=a] \\ &=\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ \mathbb E_\pi [R_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rr·p(s',r|s,a) \\ \mathbb E_\pi[G_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')\mathbb E_\pi[G_{t+1}|S_{t+1}=s',A_{t+1}=a']\\ &=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')q_\pi(s',a')\\ \end{aligned} qπ(s,a)Eπ[Rt+1St=s,At=a]Eπ[Gt+1St=s,At=a]=ΔEπ[GtSt,At=a]=Eπ[Rt+1St=s,At=a]+γEπ[Gt+1St=s,At=a]=srrp(s,rs,a)=srp(s,rs,a)π(as)Eπ[Gt+1St+1=s,At+1=a]=srp(s,rs,a)π(as)qπ(s,a)
q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) ] q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \pi(a'|s')q_\pi(s',a')] qπ(s,a)=s,rp(s,rs,a)[r+γπ(as)qπ(s,a)]
q π q_\pi qπ贝尔曼方程:
q π ( s , a ) = E π [ R t + 1 ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ q π ( s ′ , a ′ ) ] \begin{aligned} q_\pi(s,a)&\overset{}{=}\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma q_\pi(s',a')]\\ \end{aligned} qπ(s,a)=Eπ[Rt+1St=s,At=a]+γEπ[Gt+1St=s,At=a]=s,rp(s,rs,a)[r+γqπ(s,a)]

v π ( s ) v_\pi(s) vπ(s) q π ( s , a ) q_\pi(s,a) qπ(s,a)的关系:
v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_a\pi(a|s)q_\pi(s,a) vπ(s)=aπ(as)qπ(s,a)

如果策略 π \pi π是确定的,即 π ( a ∣ s ) = 1 \pi(a|s)=1 π(as)=1
v π ( s ) = q π ( s , π ( s ) ) v_\pi(s)=q_\pi(s,\pi(s)) vπ(s)=qπ(s,π(s))
可以得出
q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')] qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]

4. 最优策略和最优值函数

最优状态函数和最优动作函数:

v ∗ ( s ) = Δ m a x π v π ( s ) q ∗ ( s , a ) = Δ m a x π q π ( s , a ) v_*(s)\overset{\Delta}{=}\underset{\pi}{\rm{max}}v_\pi(s) \\ q_*(s,a)\overset{\Delta}{=}\underset{\pi}{\rm{max}}q_\pi(s,a) v(s)=Δπmaxvπ(s)q(s,a)=Δπmaxqπ(s,a)

贝尔曼最优方程:

v ∗ ( s ) = m a x a ∈ A ( s ) q ∗ ( s , a ) = m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] q ∗ ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ m a x a ′ q ∗ ( s ′ , a ′ ) ] \begin{aligned} v_*(s)&=\underset{a\in A(s)}{\rm{max}}q_*(s,a)\\ &=\underset{a}{\rm{max}}\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ q_*(s,a)&=\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma \underset{a'}{\rm max}q_*(s',a')]\\ \end{aligned} v(s)q(s,a)=aA(s)maxq(s,a)=amaxs,rp(s,rs,a)[r+γv(s)]=s,rp(s,rs,a)[r+γv(s)]=s,rp(s,rs,a)[r+γamaxq(s,a)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值