Reinforcement Learning Exercise 4.6

Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ \epsilon ϵ-soft, meaning that the probability of selecting each action in each state, s s s, is at least ϵ / ∣ A ( s ) ∣ \epsilon/|\mathcal A(s)| ϵ/A(s). Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v ∗ v_* v on page 80.

The algorithm on page 80 in section 4.2 is based on the assumption that the policy is deterministic. For a stochastic case, we can modify the algorithm like this:

1 Initialization V ( s ) ∈ R  and  π ( s ) ∈ A ( s )  arbitrarily  s ∈ S 2 Policy Evaluation Loop: Δ ← 0 Loop for each  s ∈ S : v ← V ( s ) V ( s ) ← ∑ s ′ , r π ( a ∣ s ) p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] Δ ← max ⁡ ( Δ , ∣ v − V ( s ) ∣ ) until  Δ &lt; θ  (a small positive number determining the accuracy of estimation)  3 Policy Improvement p o l i c y - s t a b l e ← t r u e For each  s ∈ S : o l d - a c t i o n ← π ( s ) π ( s ) ← argmax a ∑ s ′ , r π ( a ∣ s ) p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] If  o l d - a c t i o n = ̸ π ( s ) , then  p o l i c y - s t a b l e ← f a l s e If  p o l i c y - s t a b l e , then stop and return  V ≈ v ∗  and  π ≈ π ∗ ; else go to 2. \begin{aligned} &amp;\text{1 Initialization} \\ &amp;\qquad V(s) \in \mathbb R \text{ and }\pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &amp;\text{2 Policy Evaluation} \\ &amp;\qquad \text{Loop:} \\ &amp;\qquad \qquad \Delta \leftarrow 0 \\ &amp;\qquad \qquad \text{Loop for each }s \in \mathcal S: \\ &amp;\qquad \qquad \qquad v \leftarrow V(s) \\ &amp;\qquad \qquad \qquad V(s) \leftarrow \sum_{s&#x27;,r}\pi(a \mid s)p(s&#x27;,r \mid s,a) \Bigl [ r + \gamma V(s&#x27;)\Bigr ] \\ &amp;\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &amp;\qquad \qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &amp;\text{3 Policy Improvement} \\ &amp;\qquad policy\text-stable \leftarrow true \\ &amp;\qquad \text{For each }s \in \mathcal S: \\ &amp;\qquad \qquad old \text- action \leftarrow \pi(s) \\ &amp;\qquad \qquad \pi(s) \leftarrow \text{argmax}_a \sum_{s&#x27;,r}\pi(a \mid s)p(s&#x27;,r \mid s,a) \Bigl [ r + \gamma V(s&#x27;)\Bigr ] \\ &amp;\qquad \qquad \text{If }old\text-action =\not \pi(s) \text{, then }policy\text-stable \leftarrow false \\ &amp;\qquad \text{If } policy\text-stable \text{, then stop and return } V \approx v_* \text{ and }\pi \approx \pi_*\text{; else go to 2.} \\ \end{aligned} 1 InitializationV(s)R and π(s)A(s) arbitrarily sS2 Policy EvaluationLoop:Δ0Loop for each sS:vV(s)V(s)s,rπ(as)p(s,rs,a)[r+γV(s)]Δmax(Δ,vV(s))until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stabletrueFor each sS:old-actionπ(s)π(s)argmaxas,rπ(as)p(s,rs,a)[r+γV(s)]If old-action≠π(s), then policy-stablefalseIf policy-stable, then stop and return Vv and ππ; else go to 2.

Because the only policies are ϵ \epsilon ϵ-soft, the probability that the policy doesn’t select action a a a is ϵ ∣ A ( s ) ∣ ⋅ ( ∣ A ( s ) ∣ − 1 ) \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) A(s)ϵ(A(s)1). So,
π ( a ∣ s ) = 1 − ϵ ∣ A ( s ) ∣ ⋅ ( ∣ A ( s ) ∣ − 1 ) = 1 − ϵ + ϵ ∣ A ( s ) ∣ \begin{aligned} \pi(a \mid s) &amp;= 1 - \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) \\ &amp;= 1 - \epsilon + \frac {\epsilon}{|\mathcal A(s)|} \end{aligned} π(as)=1A(s)ϵ(A(s)1)=1ϵ+A(s)ϵ
Substitute this π ( a ∣ s ) \pi(a \mid s) π(as) into the algorithm, we can get the final result.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值