Reinforcement Learning--Explanation to Formula (5.2)

The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
q π ( s , π ′ ( s ) ) = ∑ a π ′ ( a ∣ s ) q π ( s , a ) ∵ for all  π ( a ∣ s ) , there is  π ( a ∣ s ) = { 1 − ϵ + ϵ / ∣ A ( s ) ∣ if  a = A ∗ ϵ / ∣ A ( s ) ∣ if  a = ̸ A ∗ ∴ q π ( s , π ′ ( s ) ) = ∑ a ( a = ̸ A ∗ ) ϵ ∣ A ( s ) ∣ q π ( s , a ) + ( 1 − ϵ + ϵ ∣ A ( s ) ∣ ) q π ( s , a = A ∗ ) = ϵ ∣ A ( s ) ∣ ∑ a ( a = ̸ A ∗ ) q π ( s , a ) + ϵ ∣ A ( s ) ∣ q π ( s , a = A ∗ ) + ( 1 − ϵ ) q π ( s , a = A ∗ ) = ϵ ∣ A ( s ) ∣ ∑ a q π ( s , a ) + ( 1 − ϵ ) max ⁡ a q π ( s , a ) this is the second line of formula (5.2) q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\ \because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) = \begin{cases} 1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\ \epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\ \end{cases} \\ \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\ &=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\ &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)} \end{aligned} qπ(s,π(s))=aπ(as)qπ(s,a)for all π(as),there is π(as)={1ϵ+ϵ/A(s)ϵ/A(s)if a=Aif a≠Aqπ(s,π(s))=a(a≠A)A(s)ϵqπ(s,a)+(1ϵ+A(s)ϵ)qπ(s,a=A)=A(s)ϵa(a≠A)qπ(s,a)+A(s)ϵqπ(s,a=A)+(1ϵ)qπ(s,a=A)=A(s)ϵaqπ(s,a)+(1ϵ)amaxqπ(s,a)this is the second line of formula (5.2)
Consider value x x x, let
x = ∑ a [ π ( a ∣ s ) − ϵ ∣ A ( s ) ∣ ] q π ( s , a ) x =\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a) x=a[π(as)A(s)ϵ]qπ(s,a)
When a = ̸ A ∗ a = \not A^* a≠A, π ( a ∣ s ) = ϵ / ∣ A ( s ) ∣ \pi(a \mid s) = \epsilon/| \mathcal A(s) | π(as)=ϵ/A(s)
∴ x = [ π ( a = A ∗ ∣ s ) − ϵ ∣ A ( s ) ∣ ] q π ( s , a = A ∗ ) = [ 1 − ϵ + ϵ ∣ A ( s ) ∣ − ϵ ∣ A ( s ) ∣ ] q π ( s , a = A ∗ ) = ( 1 − ϵ ) q π ( s , a = A ∗ ) = ( 1 − ϵ ) max ⁡ a q π ( s , a ) ≤ max ⁡ a q π ( s , a ) \begin{aligned} \therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\ &= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\ &= ( 1 - \epsilon) q_\pi(s, a=A^*) \\ &= (1-\epsilon)\max_aq_\pi(s,a) \\ &\leq \max_a q_\pi(s,a) \end{aligned} x=[π(a=As)A(s)ϵ]qπ(s,a=A)=[1ϵ+A(s)ϵA(s)ϵ]qπ(s,a=A)=(1ϵ)qπ(s,a=A)=(1ϵ)amaxqπ(s,a)amaxqπ(s,a)
Also
x = ( 1 − ϵ ) ∑ a π ( a ∣ s ) − ϵ ∣ A ( s ) ∣ 1 − ϵ q π ( s , a ) x = (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) x=(1ϵ)a1ϵπ(as)A(s)ϵqπ(s,a)
∴ q π ( s , π ′ ( s ) ) = ϵ ∣ A ( s ) ∣ ∑ a q π ( s , a ) + ( 1 − ϵ ) max ⁡ a q π ( s , a ) ≥ ϵ ∣ A ( s ) ∣ ∑ a q π ( s , a ) + ( 1 − ϵ ) ∑ a π ( a ∣ s ) − ϵ ∣ A ( s ) ∣ 1 − ϵ q π ( s , a ) \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\ & \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) \end{aligned} qπ(s,π(s))=A(s)ϵaqπ(s,a)+(1ϵ)amaxqπ(s,a)A(s)ϵaqπ(s,a)+(1ϵ)a1ϵπ(as)A(s)ϵqπ(s,a)
This is the third line of formula (5.2). It’s clear to be understood now.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值