Exercise 5.1 Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower?
The estimated value function jump up for the last 2 rows in the rear is because the play sticks on 20 or 21, and in a much higher probability he would win the game. The diagram drop off for the whole last row on the left is because the dealer showed an ace card which decrease the probability for the play to win. The frontmost values are higher in the upper diagrams than in the lower is because in the upper diagrams, the player has an usable ace card, which increase the probability for the player to win.
Exercise 5.2 Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?
The results will be same. That’s because on the blackjack task, all the rewards are zero except the last step. So in an episode the return is only be changed in the last step, no matter in a method of first-visit or every-visit.