将baseline运用到A2C得到Advantage Actor-Critic
- baseline通常是指一种参考策略或者性能水平,用来衡量其他策略或算法的性能
A2C
- Policy network(actor):
π
(
a
∣
s
;
θ
)
\pi(a|s;\mathbf{\theta})
π(a∣s;θ)
- 是Policy function π ( a ∣ s ) \pi(a|s) π(a∣s)的近似
- 控制agent
- Value network(Critic):
v
(
s
;
w
)
v(s;\mathbf{w})
v(s;w)
- 是state-value function V π ( s ) V_{\pi}(s) Vπ(s)的近似
- 是state s 好坏的评价
- action value function A 依赖于动作和状态
- value function V只依赖于状态
- V比A更容易训练
Training of A2C
- 每轮观测一个transition(st,at,rt,st+1)
- 计算TD target: y t = r t + γ ⋅ v ( s t + 1 ; w ) y_t=r_t+\gamma\cdot v(s_{t+1};\mathbf{w}) yt=rt+γ⋅v(st+1;w)
- 计算TD error: δ t = v ( s t ; w ) − y t \delta_t=v(s_t;\mathbf{w})-y_t δt=v(st;w)−yt
- 更新Policy network by: θ ← θ − β ⋅ δ t ⋅ ∂ ln π ( a t ∣ s t ; θ ) ∂ θ \mathbf{\theta}\leftarrow\mathbf{\theta}-\beta\cdot\delta_t\cdot\frac{\partial\ln\pi(a_t\mid s_t;\mathbf{\theta})}{\partial\mathbf{\theta}} θ←θ−β⋅δt⋅∂θ∂lnπ(at∣st;θ)
- 更新value network: w ← w − α ⋅ δ t ⋅ ∂ v ( s t ; w ) ∂ w \mathbf{w}\leftarrow\mathbf{w}-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;\mathbf{w})}{\partial \mathbf{w}} w←w−α⋅δt⋅∂w∂v(st;w)
[ 参考 ]:https://www.bilibili.com/video/BV1f34y1P7tu?p=16&vd_source=fdaf11557adf2f6bdc6ceed86a17b97e