上节课和本节课内容
具体的:
Finite Difference Policy Gradient
Monte-Carlo Policy Gradient
Actor-Critic Policy Gradient
区别和联系:
Advantages of Policy-Based RL:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies(课件中有个Example: Aliased Gridworld,很好理解)
Disadvantages of Policy-Based RL :
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
Policy-Gradient RL 问题的数学化描述: