Reinforcement Learning Exercise 4.5

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q ∗ q_* q, analogous to that on page 80 for computing v ∗ v_* v. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

Here, we can use the result of exercise 3.17:
Q π ( s , a ) = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ ∑ a ′ Q π ( s ′ , a ′ ) π ( s ′ , a ′ ) ] P s , s ′ a Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a Qπ(s,a)=sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,sa
Then the algorithm which analogous to that on page 80 can be like this:
1 Initialization Q π ( s , a ) ∈ R a n d π ( s , a ) ∈ A ( s ) a r b i t r a r i l y s ∈ S a n d a ∈ A ( s ) 2 Policy Evaluation Loop: Δ ← 0 Loop for each  ( s , a )  pair: q ← Q π ( s , a ) Q π ( s , a ) ← ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ ∑ a ′ Q π ( s ′ , a ′ ) π ( s ′ , a ′ ) ] P s , s ′ a Δ ← max ⁡ ( Δ , ∣ q − Q π ( s , a ) ∣ ) until  Δ &lt; θ  (a small positive number determining the accuracy of estimation)  3 Policy Improvement p o l i c y - s t a b l e ← t r u e For each  ( a , s )  pair,  s ∈ S  and  a ∈ A ( s ) : o l d - a c t i o n ← π ( s , a ) π ( s ) ← argmax s , a Q π ( s , a ) If  o l d - a c t i o n = ̸ π ( s ) , then  p o l i c y - s t a b l e ← f a l s e If  p o l i c y - s t a b l e , then stop and return  Q π ≈ q ∗  and  π ≈ π ∗ ; else go to 2. \begin{aligned} &amp;\text{1 Initialization} \\ &amp;\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\ &amp;\text{2 Policy Evaluation} \\ &amp;\qquad \text{Loop:} \\ &amp;\qquad \qquad \Delta \leftarrow 0 \\ &amp;\qquad \text{Loop for each }(s,a)\text{ pair:} \\ &amp;\qquad \qquad q \leftarrow Q_\pi(s,a) \\ &amp;\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s&#x27;} R_{s,s&#x27;}^a P_{ss&#x27;}^a + \gamma \sum_{s&#x27;} \bigl[ \sum_{a&#x27;} Q_\pi(s&#x27;,a&#x27;) \pi(s&#x27;,a&#x27;) \bigr] P_{s,s&#x27;}^a \\ &amp; \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\ &amp;\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &amp;\text{3 Policy Improvement} \\ &amp;\qquad policy\text-stable \leftarrow true \\ &amp;\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\ &amp;\qquad \qquad old \text- action \leftarrow \pi(s, a) \\ &amp;\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\ &amp;\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\ &amp;\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\ \end{aligned} 1 InitializationQπ(s,a)Randπ(s,a)A(s)arbitrarilysSandaA(s)2 Policy EvaluationLoop:Δ0Loop for each (s,a) pair:qQπ(s,a)Qπ(s,a)sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,saΔmax(Δ,qQπ(s,a))until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stabletrueFor each (a,s) pair, sS and aA(s):old-actionπ(s,a)π(s)argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stablefalseIf policy-stable, then stop and return Qπq and ππ; else go to 2.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值