Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p p p and transitions to the terminal state with probability 1 − p 1-p 1−p. Let the reward be + 1 +1 +1 on all transitions, and let γ = 1 \gamma=1 γ=1. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?
For the first-visit estimator, only the first visit of a state is considered. So:
V ( S n o n t e r m i n a l ) = G ( S 0 ) = 1 ⋅ p + 0 ⋅ ( 1 − p ) = p \begin{aligned} V(S_{nonterminal}) &= G(S_0) \\ &= 1 \cdot p + 0 \cdot (1-p) \\ &= p \end{aligned} V(S