Reinforcement Learning Exercise 4.1

This blog discusses Example 4.1 from a reinforcement learning context, focusing on a 4x4 gridworld. With non-terminal states and four deterministic actions, the post explores the expected reward function and value functions under an equiprobable random policy. It calculates q-values for states 11 and 7 when taking the 'down' action." 112471661,7571177,C++ Primer:深入解析拷贝控制与对象管理,"['c++', '拷贝控制', '移动语义']
摘要由CSDN通过智能技术生成

Example 4.1 Consider the 4 × 4 4 \times 4 4×4 gridworld shown below.
在这里插入图片描述
The nonterminal states are S = { 1 , 2 , . . . , 14 } \mathcal S = \{1, 2, . . . , 14\} S={ 1,2,...,14}. There are four actions possible in each state, A = { u p , d o w n , r i g h t , l e f t } \mathcal A = \{up, down, right, left\} A={ up,down,right,left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. Thus, for instance, p ( 6 , . 1 ∣ 5 , r i g h t ) = 1 p(6,.1\mid 5, right) = 1 p(6,.15,right)=1, p ( 7 , . 1 ∣ 7 , r i g h t ) = 1 p(7,.1\mid 7, right) = 1 p(7,.17,right)=1, and p ( 10 , r ∣ 5 , r i g h t ) = 0 p(10, r \mid 5, right) = 0 p(10,r5,right)=0 for all r ∈ R r \in \mathcal R rR. This is an undiscounted, episodic task. The reward is − 1 -1 1 on all transitions until the terminal state is reached. The terminal state is shaded in the figure (although it is shown in two places, it is formally one state). The expected reward function is thus r ( s , a , s 0 ) = − 1 r(s, a, s0) = -1 r(s,a,s0)=1 for all states s s s, s ′ s' s and actions a a a. Suppose the agent follows the equiprobable random policy (all actions equally likely). The left side of Figure 4.1 shows the sequence of value functions v k {v_k} vk computed by iterative policy evaluation. The final estimate is in fact v π v_\pi vπ, which in this case gives for each state the negation of the expected number of steps from that state until termination.
Figure4.1

Exercise 4.1 In Example 4.1, if π \pi π is the equiprobable random policy, what is q π ( 11 , d o w n ) q_\pi(11, down) qπ(11,down)? What is q π ( 7 , d o w n ) q_\pi(7, down) qπ(7,down)?
Here, we can use the result of exercise 3.13 for calculation. The result of exercise 3.13 is :
q π ( s , a ) = ∑ s ′ { P s , s ′ a ⋅ [ R s , s ′ a + γ ⋅ v π ( s ′ ) ] } q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} qπ(s,a)=s{ Ps,sa[Rs,s

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值