第六章 时间差分算法
Chapter 6 Temporal-Difference Learning
- TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.
- Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
6.1 TD Prediction
Constant-α MC (every visit) :
where Gt is the actual return following time t. Monte Carlo methods must wait until the end of the episode to determine the increment to V (St) (only then is Gt known), TD methods need wait only until the next time step.
The simplest TD method, known as TD(0):