一、Stochastic Actor-Critic
前言
-
some glaring issues with vanilla policy gradients(REINFORCE): noisy gradients and high variance.These issues contribute to the instability and slow convergence.
-
蒙特卡洛这种传统强化学习方法等到情景episode结束才更新参数, 时序差分(TD)学习每个时间步都进行参数更新.
-
Improving policy gradients: Reducing variance with a baseline