Reinforcement Learning Exercise 5.1 & 5.2

Exercise 5.1 Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower?
在这里插入图片描述
The estimated value function jump up for the last 2 rows in the rear is because the play sticks on 20 or 21, and in a much higher probability he would win the game. The diagram drop off for the whole last row on the left is because the dealer showed an ace card which decrease the probability for the play to win. The frontmost values are higher in the upper diagrams than in the lower is because in the upper diagrams, the player has an usable ace card, which increase the probability for the player to win.


Exercise 5.2 Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?

The results will be same. That’s because on the blackjack task, all the rewards are zero except the last step. So in an episode the return is only be changed in the last step, no matter in a method of first-visit or every-visit.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值