Reinforcement Learning Exercise 6.6

Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are 1 6 \frac{1}{6} 61, 2 6 \frac{2}{6} 62, 3 6 \frac{3}{6} 63, 4 6 \frac{4}{6} 64 and 5 6 \frac{5}{6} 65, for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?

Example 6.2 Random walk
In this example we empirically compare the prediction abilities of TD(0) and constant- α \alpha α MC when applied to the following Markov reward process:
在这里插入图片描述
A Markov reward process, or MRP, is a Markov decision process without actions. We will often use MRPs when focusing on the prediction problem, in which there is no need to distinguish the dynamics due to the environment from those due to the agent. In this MRP, all episodes start in the center state, C, then proceed either left or right by one state on each step, with equal probability. Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, a reward of +1 occurs; all other rewards are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state. Thus, the true value of the center state is v π ( C ) v_{\pi}(C) vπ(C) = 0.5. The true values of all the states, A through E, are 1 6 \frac{1}{6} 61, 2 6 \frac{2}{6} 62, 3 6 \frac{3}{6} 63, 4 6 \frac{4}{6} 64 and 5 6 \frac{5}{6} 65.

In this case, for all status, there is π ( a ∣ s ) = 0.5 \pi(a \mid s) =0.5 π(as)=0.5 because left action and right action are taken in equal probability. And also there is p ( s ′ , r ∣ s , a ) = 1 p(s', r\mid s, a) = 1 p(s,rs,a)=1 for deterministic action, γ = 1 \gamma = 1 γ=1 for no discount.
Method 1:
Use Bellman equation for v π ( s ) v_\pi(s) vπ(s) (equation 3.14) directly, and Bellman equation can be simplified to:
v π ( s ) = 0.5 ∑ s ′ , r [ r + v π ( s ′ ) ] v_\pi(s) = 0.5 \sum_{s',r}\bigl [ r+v_\pi(s') \bigr] vπ(s)=0.5s,r[r+vπ(s)]
So, for state A, v π ( A ) = 0.5 [ 0 + v π ( t e r m i n a l ) + 0 + v π ( B ) ] = 0.5 v π ( B ) (1) v_\pi(A) = 0.5\bigl[ 0+v_\pi(terminal) + 0+ v_\pi(B)\bigr] = 0.5v_\pi(B)\qquad \text{(1)} vπ(A)=0.5[0+vπ(terminal)+0+vπ(B)]=0.5vπ(B)(1).
For state B, v π ( B ) = 0.5 [ 0 + v π ( A ) + 0 + v π ( C ) ] = 0.5 v π ( A ) + 0.5 v π ( C ) (2) v_\pi(B) = 0.5\bigl[ 0 + v_\pi(A) + 0 +v_\pi(C)\bigr] = 0.5v_\pi(A)+0.5v_\pi(C)\qquad \text{(2)} vπ(B)=0.5[0+vπ(A)+0+vπ(C)]=0.5vπ(A)+0.5vπ(C)(2)
And so on, we have:
v π ( C ) = 0.5 v π ( B ) + 0.5 v π ( D )  (3) v π ( D ) = 0.5 v π ( C ) + 0.5 v π ( E )  (4) v π ( E ) = 0.5 v π ( D ) + 0.5 (5) \begin{aligned} v_\pi(C) &= 0.5v_\pi(B) + 0.5v_\pi(D) \qquad \qquad \text{ (3)}\\ v_\pi(D) &= 0.5v_\pi(C) + 0.5v_\pi(E) \qquad \qquad \text{ (4)}\\ v_\pi(E) &= 0.5v_\pi(D) + 0.5 \qquad \qquad \qquad \quad\text{(5)}\\ \end{aligned} vπ(C)vπ(D)vπ(E)=0.5vπ(B)+0.5vπ(D) (3)=0.5vπ(C)+0.5vπ(E) (4)=0.5vπ(D)+0.5(5)
Solve equations from (1) to (5), we can obtain the state values A through E are 1 6 \frac{1}{6} 61, 2 6 \frac{2}{6} 62, 3 6 \frac{3}{6} 63, 4 6 \frac{4}{6} 64 and 5 6 \frac{5}{6} 65.

Method 2:
I can not find another way that totally different with method 1 yet. If anybody know, please tell me.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值