actor 与 critic
- actor:策略,更新策略的过程
- critic:依据value estimation 的 policy evaluation
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at)
- 上式子就是actor算法
-
q
t
(
s
t
,
a
t
)
q_t\left(s_t, a_t\right)
qt(st,at)是critic方法
- 上节课说的采用蒙特卡洛的方法其就是REINFORCE
- 本节课采用的TD方法来进行解决就是actor-critic方法
【QAC伪代码】
-
目标:为了 maximizing J ( θ ) J(\theta) J(θ). 选择最优的策略
-
在每一步 t t t,首先依据 π ( a ∣ s t , θ t ) \pi\left(a \mid s_t, \theta_t\right) π(a∣st,θt)生成 a t a_t at,获得 r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1,然后依据 π ( a ∣ s t + 1 , θ t ) \pi\left(a \mid s_{t+1}, \theta_t\right) π(a∣st+1,θt)获得 a t + 1 a_{t+1} at+1… ( s t , a t , r t + 1 , S t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, S_{t+1}, a_{t+1}\right) (st,at,rt+1,St+1,at+1)
-
Critic(value update):
w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) \begin{aligned} & w_{t+1}=w_t+\alpha_w\left[r_{t+1}+\gamma q\left(s_{t+1}, a_{t+1}, w_t\right)-q\left(s_t, a_t, w_t\right)\right] \nabla_w q\left(s_t, a_t, w_t\right) \end{aligned} wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)−q(st,at,wt)]∇wq(st,at,wt) -
Actor(policy update):
θ t + 1 = θ t + α θ ∇ θ ln π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1}=\theta_t+\alpha_\theta \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q\left(s_t, a_t, w_{t+1}\right) θt+1=θt+αθ∇θlnπ(at∣st,θt)q(st,at,wt+1)
-
【Advantage actor-critic(A2C)】
在QAC的基础上增加一个片质量来减少估计的方差
∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] \begin{aligned} \nabla_\theta J(\theta) & =\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) q_\pi(S, A)\right] \\ & =\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left(q_\pi(S, A)-b(S)\right)\right] \end{aligned} ∇θJ(θ)=ES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]=ES∼η,A∼π[∇θlnπ(A∣S,θt)(qπ(S,A)−b(S))]
- 在其后增加一个 b ( S ) b(S) b(S)是一个偏执量,关于 S S S的函数其俩是相同的
问题1:为什么成立?
回答:因为:
E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) b(S)\right]=0 ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=0E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s , θ t ) ∇ θ ln π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ ∑ a ∈ A π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) b(S)\right] & =\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi\left(a \mid s, \theta_t\right) \nabla_\theta \ln \pi\left(a \mid s, \theta_t\right) b(s) \\ & =\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi\left(a \mid s, \theta_t\right) b(s) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi\left(a \mid s, \theta_t\right) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi\left(a \mid s, \theta_t\right) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) {{\nabla_\theta 1 }}=0 \end{aligned} ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=s∈S∑η(s)a∈A∑π(a∣s,θt)∇θlnπ(a∣s,θt)b(s)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θt)b(s)=s∈S∑η(s)b(s)a∈A∑∇θπ(a∣s,θt)=s∈S∑η(s)b(s)∇θa∈A∑π(a∣s,θt)=s∈S∑η(s)b(s)∇θ1=0
问题2:为什么考虑偏执量?
回答:因为对方差有影响,想找最好的偏执使得方差最少。
b ( s ) = E A ∼ π [ q ( s , A ) ] = v π ( s ) b(s)=\mathbb{E}_{A \sim \pi}[q(s, A)]=v_\pi(s) b(s)=EA∼π[q(s,A)]=vπ(s)
✌将偏置用到actor-critic中:
θ t + 1 = θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S ) ] ] ≐ θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \mathbb{E}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left[q_\pi(S, A)-v_\pi(S)\right]\right] \\ & \doteq \theta_t+\alpha \mathbb{E}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) \delta_\pi(S, A)\right] \end{aligned} θt+1=θt+αE[∇θlnπ(A∣S,θt)[qπ(S,A)−vπ(S)]]≐θt+αE[∇θlnπ(A∣S,θt)δπ(S,A)]
其中:
δ
π
(
S
,
A
)
≐
q
π
(
S
,
A
)
−
v
π
(
S
)
\delta_\pi(S, A) \doteq q_\pi(S, A)-v_\pi\left(S\right)
δπ(S,A)≐qπ(S,A)−vπ(S)
于是:
θ
t
+
1
=
θ
t
+
α
∇
θ
ln
π
(
a
t
∣
s
t
,
θ
t
)
δ
t
(
s
t
,
a
t
)
=
θ
t
+
α
∇
θ
π
(
a
t
∣
s
t
,
θ
t
)
π
(
a
t
∣
s
t
,
θ
t
)
δ
t
(
s
t
,
a
t
)
=
θ
t
+
α
(
δ
t
(
s
t
,
a
t
)
π
(
a
t
∣
s
t
,
θ
t
)
)
⏟
step size
∇
θ
π
(
a
t
∣
s
t
,
θ
t
)
\begin{aligned} \theta_{t+1} & =\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) \delta_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \frac{\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)} \delta_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \underbrace{\left(\frac{\delta_t\left(s_t, a_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)}\right)}_{\text {step size }} \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right) \end{aligned}
θt+1=θt+α∇θlnπ(at∣st,θt)δt(st,at)=θt+απ(at∣st,θt)∇θπ(at∣st,θt)δt(st,at)=θt+αstep size
(π(at∣st,θt)δt(st,at))∇θπ(at∣st,θt)
最优策略是逼近TD error:
δ
t
=
q
t
(
s
t
,
a
t
)
−
v
t
(
s
t
)
→
r
t
+
1
+
γ
v
t
(
s
t
+
1
)
−
v
t
(
s
t
)
\delta_t=q_t\left(s_t, a_t\right)-v_t\left(s_t\right) \rightarrow r_{t+1}+\gamma v_t\left(s_{t+1}\right)-v_t\left(s_t\right)
δt=qt(st,at)−vt(st)→rt+1+γvt(st+1)−vt(st)
✌A2C(TD actor-critic)伪代码:
-
目标:为了 maximizing J ( θ ) J(\theta) J(θ). 选择最优的策略
-
在每一步 t t t,首先依据 π ( a ∣ s t , θ t ) \pi\left(a \mid s_t, \theta_t\right) π(a∣st,θt)生成 a t a_t at,获得 r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1,
-
TD error(advantage function):
δ t = r t + 1 + γ v ( s t + 1 , w t ) − v ( s t , w t ) \delta_t=r_{t+1}+\gamma v\left(s_{t+1}, w_t\right)-v\left(s_t, w_t\right) δt=rt+1+γv(st+1,wt)−v(st,wt) -
Critic(value update):
w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1}=w_t+\alpha_w \delta_t \nabla_w v\left(s_t, w_t\right) wt+1=wt+αwδt∇wv(st,wt) -
Actor(policy update):
θ t + 1 = θ t + α θ δ t ∇ θ ln π ( a t ∣ s t , θ t ) \theta_{t+1}=\theta_t+\alpha_\theta \delta_t \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) θt+1=θt+αθδt∇θlnπ(at∣st,θt)
-
【Off-policy actor-critic】
✌重要性采样:
E X ∼ p 0 [ X ] = ∑ x p 0 ( x ) x = ∑ x p 1 ( x ) p 0 ( x ) p 1 ( x ) ⏟ f ( x ) x = E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_0}[X]=\sum_x p_0(x) x=\sum_x p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)}}_{f(x)} x=\mathbb{E}_{X \sim p_1}[f(X)] EX∼p0[X]=x∑p0(x)x=x∑p1(x)f(x) p1(x)p0(x)x=EX∼p1[f(X)]
- 如何求
E
X
∼
p
1
[
f
(
X
)
]
\mathbb{E}_{X \sim p_1}[f(X)]
EX∼p1[f(X)]?
f ˉ ≐ 1 n ∑ i = 1 n f ( x i ) , where x i ∼ p 1 \bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f\left(x_i\right), \quad \text { where } x_i \sim p_1 fˉ≐n1i=1∑nf(xi), where xi∼p1
于是:
E
X
∼
p
1
[
f
ˉ
]
=
E
X
∼
p
1
[
f
(
X
)
]
var
X
∼
p
1
[
f
ˉ
]
=
1
n
var
X
∼
p
1
[
f
(
X
)
]
\begin{aligned} \mathbb{E}_{X \sim p_1}[\bar{f}] & =\mathbb{E}_{X \sim p_1}[f(X)] \\ \operatorname{var}_{X \sim p_1}[\bar{f}] & =\frac{1}{n} \operatorname{var}_{X \sim p_1}[f(X)] \end{aligned}
EX∼p1[fˉ]varX∼p1[fˉ]=EX∼p1[f(X)]=n1varX∼p1[f(X)]
于是我们要求的可以写成:
E
X
∼
p
0
[
X
]
≈
f
ˉ
=
1
n
∑
i
=
1
n
f
(
x
i
)
=
1
n
∑
i
=
1
n
p
0
(
x
i
)
p
1
(
x
i
)
x
i
\mathbb{E}_{X \sim p_0}[X] \approx \bar{f}=\frac{1}{n} \sum_{i=1}^n f\left(x_i\right)=\frac{1}{n} \sum_{i=1}^n \frac{p_0\left(x_i\right)}{p_1\left(x_i\right)} x_i
EX∼p0[X]≈fˉ=n1i=1∑nf(xi)=n1i=1∑np1(xi)p0(xi)xi
-
p
0
(
x
i
)
p
1
(
x
i
)
\frac{p_0\left(x_i\right)}{p_1\left(x_i\right)}
p1(xi)p0(xi):叫做重要性权重
- p 1 ( x i ) = p 0 ( x i ) p_1\left(x_i\right)=p_0\left(x_i\right) p1(xi)=p0(xi):重要性权重是1, f ˉ \bar{f} fˉ是 f ˉ \bar{f} fˉ x ˉ \bar{x} xˉ
- p 0 ( x i ) ≥ p 1 ( x i ) p_0\left(x_i\right) \geq p_1\left(x_i\right) p0(xi)≥p1(xi):给它重要性权重
✌off-policy gradient:
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S , θ ) β ( A ∣ S ) ∇ θ ln π ( A ∣ S , θ ) q π ( S , A ) ] \nabla_\theta J(\theta)=\mathbb{E}_{S \sim \rho, A \sim \beta}\left[\frac{\pi(A \mid S, \theta)}{\beta(A \mid S)} \nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right] ∇θJ(θ)=ES∼ρ,A∼β[β(A∣S)π(A∣S,θ)∇θlnπ(A∣S,θ)qπ(S,A)]
✌off-policy actor-critic:
θ t + 1 = θ t + α θ π ( a t ∣ s t , θ t ) β ( a t ∣ s t ) ∇ θ ln π ( a t ∣ s t , θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1}=\theta_t+\alpha_\theta \frac{\pi\left(a_t \mid s_t, \theta_t\right)}{\beta\left(a_t \mid s_t\right)} \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right)\left(q_t\left(s_t, a_t\right)-v_t\left(s_t\right)\right) θt+1=θt+αθβ(at∣st)π(at∣st,θt)∇θlnπ(at∣st,θt)(qt(st,at)−vt(st))
其中:
q
t
(
s
t
,
a
t
)
−
v
t
(
s
t
)
≈
r
t
+
1
+
γ
v
t
(
s
t
+
1
)
−
v
t
(
s
t
)
≐
δ
t
(
s
t
,
a
t
)
q_t\left(s_t, a_t\right)-v_t\left(s_t\right) \approx r_{t+1}+\gamma v_t\left(s_{t+1}\right)-v_t\left(s_t\right) \doteq \delta_t\left(s_t, a_t\right)
qt(st,at)−vt(st)≈rt+1+γvt(st+1)−vt(st)≐δt(st,at)
于是可以重写为:
【Deterministic actor-critic(DPG)】
之前:
现在:
π
(
a
∣
s
,
θ
)
∈
[
0
,
1
]
\pi(a \mid s, \theta) \in[0,1]
π(a∣s,θ)∈[0,1]这个策略可以是随机的也可以是确定的,现在我们将deterministic策略定义为:
a
=
μ
(
s
,
θ
)
≐
μ
(
s
)
a=\mu(s, \theta) \doteq \mu(s)
a=μ(s,θ)≐μ(s)
这个呢是从状态空间到动作空间的映射,有时也写成
μ
(
s
)
\mu(s)
μ(s)
J ( θ ) = E [ v μ ( s ) ] = ∑ s ∈ S d 0 ( s ) v μ ( s ) J(\theta)=\mathbb{E}\left[v_\mu(s)\right]=\sum_{s \in \mathcal{S}} d_0(s) v_\mu(s) J(θ)=E[vμ(s)]=s∈S∑d0(s)vμ(s)
- 第一种是 d 0 ( s 0 ) = 1 d_0\left(s_0\right)=1 d0(s0)=1 and d 0 ( s ≠ s 0 ) = 0 d_0\left(s \neq s_0\right)=0 d0(s=s0)=0,这种情况下从一种状态开始进行优化
- 第二种是 d 0 d_0 d0是平稳分布的,它是off-policy的
使用梯度上升的方法:
θ
t
+
1
=
θ
t
+
α
θ
(
E
S
∼
ρ
μ
)
[
∇
θ
μ
(
S
)
(
∇
a
q
μ
(
S
,
a
)
)
∣
a
=
μ
(
S
)
]
\theta_{t+1}=\theta_t+\alpha_\theta\left(\mathbb{E}_{S \sim \rho_\mu}\right)\left[\left.\nabla_\theta \mu(S)\left(\nabla_a q_\mu(S, a)\right)\right|_{a=\mu(S)}\right]
θt+1=θt+αθ(ES∼ρμ)[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)]
这里面的均值变为:
θ
t
+
1
=
θ
t
+
α
θ
∇
θ
μ
(
s
t
)
(
∇
a
q
μ
(
s
t
,
a
)
)
∣
a
=
μ
(
s
t
)
\theta_{t+1}=\theta_t+\left.\alpha_\theta \nabla_\theta \mu\left(s_t\right)\left(\nabla_a q_\mu\left(s_t, a\right)\right)\right|_{a=\mu\left(s_t\right)}
θt+1=θt+αθ∇θμ(st)(∇aqμ(st,a))∣a=μ(st)