# 强化学习的学习之路（三十）_2021-01-30： Policy Optimazation 简介

#### Policy Optimazation

##### Valued-based和Policy-based RL对比
• Value-based RL
• to learn value function 去学价值函数
• implicit policy based on the value function 通过价值函数隐含地学出策略
• Policy-based RL
• no value function 没有价值函数
• to learn policy directly 直接去学策略
• Actor-critic
• to learn both policy and value function 即学价值函数，也学策略
• better convergence properties: we are guaranteed to converge on a local optimum (worst case) or global optimum (best case) 更好的收敛性，保证起码收敛到一个局部最优点
• Policy gradient is more effective in high-dimensional action space 在高维空间中更有效
• Policy gradient can learn stochastic policies, while value function can’t 基于策略的方法可以学出随机策略，而基于值的方法不行
• typically converges to a local optimum 总是收敛到局部最优点
• evaluating a policy has high variance 评估策略的时候总是方差很大
##### Policy Optimazation的方法
• Policy-based RL is an optimization problem that find θ \theta that maximizes J ( θ ) J(\theta)
• If J ( θ ) J(\theta) is differentiable, we can use gradient-based methods: 如果目标函数是可导的，那我们就可以用基于梯度的方式去求解基于策略的强化学习方法
• quasi-newton
• If J ( θ ) J(\theta) is non-differentiable or hard to compute the derivative, some derivative-free black-box optimization methods:
• Cross-entropy method (CEM)
• Hill climbing
• Evolution algorithm

11-27 18万+

04-24 6580
04-28 3133