value-based methods -> policy-based methods
value function approximation -> policy function approximation
【policy 梯度思想】
原本的策略表示:
现在的策略:
π
(
a
∣
s
,
θ
)
\pi(a \mid s, \theta)
π(a∣s,θ)
其中
θ
∈
R
m
\theta \in \mathbb{R}^m
θ∈Rm是参数向量,原本是通过查表得形式得到策略,现在需要传播计算一次才能得到是多少
【Metric最优策略】
-
average state value:
v ˉ π = ∑ s ∈ S d ( s ) v π ( s ) = E [ ∑ t = 0 ∞ γ t R t + 1 ] \bar{v}_\pi=\sum_{s \in \mathcal{S}} d(s) v_\pi(s)=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_{t+1}\right] vˉπ=s∈S∑d(s)vπ(s)=E[t=0∑∞γtRt+1]-
v ˉ π \bar{v}_\pi vˉπ:是state value的加权平均
-
d ( s ) ≥ 0 d(s) \geq 0 d(s)≥0:状态 s s s的权重
-
∑ s ∈ S d ( s ) = 1 \sum_{s \in \mathcal{S}} d(s)=1 ∑s∈Sd(s)=1也是状态 s s s被选中的概率
v ˉ π = E [ v π ( S ) ] \bar{v}_\pi=\mathbb{E}\left[v_\pi(S)\right] vˉπ=E[vπ(S)]v ˉ π = ∑ s ∈ S d ( s ) v π ( s ) = d T v π \bar{v}_\pi=\sum_{s \in \mathcal{S}} d(s) v_\pi(s)=d^T v_\pi vˉπ=s∈S∑d(s)vπ(s)=dTvπ
- 其中 v π = [ … , v π ( s ) , … ] T ∈ R ∣ S ∣ d = [ … , d ( s ) , … ] T ∈ R ∣ S ∣ \begin{aligned} v_\pi & =\left[\ldots, v_\pi(s), \ldots\right]^T \in \mathbb{R}^{|\mathcal{S}|} \\ d & =\left[\ldots, d^{(s)}, \ldots\right]^T \in \mathbb{R}^{|\mathcal{S}|}\end{aligned} vπd=[…,vπ(s),…]T∈R∣S∣=[…,d(s),…]T∈R∣S∣
-
如何选择 d d d:
-
d
d
d和
π
\pi
π 没有关系:
d
d
d 变
d
0
d_0
d0 并且
v
ˉ
π
\bar{v}_\pi
vˉπ 变
v
ˉ
π
0
\bar{v}_\pi^0
vˉπ0.,
- 每个状态的权重相同的(均匀分布), d 0 ( s ) = 1 / ∣ S ∣ d_0(s)=1 /|\mathcal{S}| d0(s)=1/∣S∣
- 对于某些状态偏好,极端下 d 0 ( s 0 ) = 1 , d 0 ( s ≠ s 0 ) = 0 d_0\left(s_0\right)=1, \quad d_0\left(s \neq s_0\right)=0 d0(s0)=1,d0(s=s0)=0
- d d d和 π \pi π 有关系:根据测量不断与环境交互,可以预测在某个状态下平稳多少
d π T P π = d π T d_\pi^T P_\pi=d_\pi^T dπTPπ=dπT
-
d
d
d和
π
\pi
π 没有关系:
d
d
d 变
d
0
d_0
d0 并且
v
ˉ
π
\bar{v}_\pi
vˉπ 变
v
ˉ
π
0
\bar{v}_\pi^0
vˉπ0.,
-
-
average one-step reward:
r ˉ π ≐ ∑ s ∈ S d π ( s ) r π ( s ) = E [ r π ( S ) ] \bar{r}_\pi \doteq \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s)=\mathbb{E}\left[r_\pi(S)\right] rˉπ≐s∈S∑dπ(s)rπ(s)=E[rπ(S)]-
r π ( s ) ≐ ∑ a ∈ A π ( a ∣ s ) r ( s , a ) r_\pi(s) \doteq \sum_{a \in \mathcal{A}} \pi(a \mid s) r(s, a) rπ(s)≐∑a∈Aπ(a∣s)r(s,a) 是从状态 s s s 开始的一步立即奖励均值
-
r ( s , a ) = E [ R ∣ s , a ] = ∑ r r p ( r ∣ s , a ) r(s, a)=\mathbb{E}[R \mid s, a]=\sum_r r p(r \mid s, a) r(s,a)=E[R∣s,a]=∑rrp(r∣s,a)
lim n → ∞ 1 n E [ ∑ k = 1 n R t + k ∣ S t = s 0 ] = lim n → ∞ 1 n E [ ∑ k = 1 n R t + k ] = ∑ s d π ( s ) r π ( s ) = r ˉ π \begin{aligned} \lim _{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[\sum_{k=1}^n R_{t+k} \mid S_t=s_0\right] & =\lim _{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[\sum_{k=1}^n R_{t+k}\right] \\ & =\sum_s d_\pi(s) r_\pi(s) \\ & =\bar{r}_\pi \end{aligned} n→∞limn1E[k=1∑nRt+k∣St=s0]=n→∞limn1E[k=1∑nRt+k]=s∑dπ(s)rπ(s)=rˉπ
-
-
metrics 1:
- 所有metrics关于策略的函数
- 策略是函数,参数是 θ \theta θ,所以上面所述都是 θ \theta θ的函数
- 所以希望找到最优的 θ \theta θ最大化metrics
metrics 2:
- discount rate γ ∈ [ 0 , 1 ) \gamma \in[0,1) γ∈[0,1) 或者 undiscounted case γ = 1 \gamma=1 γ=1
metrics 3:
r
ˉ
π
=
(
1
−
γ
)
v
ˉ
π
\bar{r}_\pi=(1-\gamma) \bar{v}_\pi
rˉπ=(1−γ)vˉπ
对一个做优化另一个也达到了极值
【metrics 的梯度】
梯度测量是policy gradient方法中最复杂的部分:
- 我们需要区分不同的metrics: v ˉ π , r ˉ π , v ˉ π 0 \bar{v}_\pi, \bar{r}_\pi, \bar{v}_\pi^0 vˉπ,rˉπ,vˉπ0
- 我们需要区分discounted 和 undiscounted 情况
∇ θ J ( θ ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a) ∇θJ(θ)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a)
- J ( θ ) J(\theta) J(θ)可以是 v ˉ π , r ˉ π \bar{v}_\pi, \bar{r}_\pi vˉπ,rˉπ, v ˉ π 0 \bar{v}_\pi^0 vˉπ0
- "="可以是严格相等、约等、成比例等
- η \eta η 是状态的权重
∇ θ r ˉ π ≃ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) , ∇ θ v ˉ π = 1 1 − γ ∇ θ r ˉ π ∇ θ v ˉ π 0 = ∑ s ∈ S ρ π ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \begin{gathered} \nabla_\theta \bar{r}_\pi \simeq \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \\ \nabla_\theta \bar{v}_\pi=\frac{1}{1-\gamma} \nabla_\theta \bar{r}_\pi \\ \nabla_\theta \bar{v}_\pi^0=\sum_{s \in \mathcal{S}} \rho_\pi(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a) \end{gathered} ∇θrˉπ≃s∑dπ(s)a∑∇θπ(a∣s,θ)qπ(s,a),∇θvˉπ=1−γ1∇θrˉπ∇θvˉπ0=s∈S∑ρπ(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a)
我们可以将上面的式子进行重写:
∇
θ
J
(
θ
)
=
∑
s
∈
S
η
(
s
)
∑
a
∈
A
∇
θ
π
(
a
∣
s
,
θ
)
q
π
(
s
,
a
)
=
E
[
∇
θ
ln
π
(
A
∣
S
,
θ
)
q
π
(
S
,
A
)
]
\begin{aligned} \nabla_\theta J(\theta) & =\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a) \\ & =\mathbb{E}\left[\nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right] \end{aligned}
∇θJ(θ)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a)=E[∇θlnπ(A∣S,θ)qπ(S,A)]
其中
S
∼
η
S \sim \eta
S∼η 并且
A
∼
π
(
A
∣
S
,
θ
)
A \sim \pi(A \mid S, \theta)
A∼π(A∣S,θ)
【梯度上升算法REINFORCE】
θ t + 1 = θ t + α ∇ θ J ( θ ) = θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \nabla_\theta J(\theta) \\ & =\theta_t+\alpha \mathbb{E}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) q_\pi(S, A)\right] \end{aligned} θt+1=θt+α∇θJ(θ)=θt+αE[∇θlnπ(A∣S,θt)qπ(S,A)]
环境的信息是没法全部知道的,我们用随机的梯度替代
θ
t
+
1
=
θ
t
+
α
∇
θ
ln
π
(
a
t
∣
s
t
,
θ
t
)
q
π
(
s
t
,
a
t
)
\theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_\pi\left(s_t, a_t\right)
θt+1=θt+α∇θlnπ(at∣st,θt)qπ(st,at)
q
π
q_\pi
qπ是策略
π
\pi
π所对应的真实的action value,我们没法知道,所以我们进行近似
θ
t
+
1
=
θ
t
+
α
∇
θ
ln
π
(
a
t
∣
s
t
,
θ
t
)
q
t
(
s
t
,
a
t
)
\theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right)
θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at)
- 蒙特卡洛方法:相结合叫做reinforce
- TD算法
我们可以将上面式子进行重写:
θ
t
+
1
=
θ
t
+
α
∇
θ
ln
π
(
a
t
∣
s
t
,
θ
t
)
q
t
(
s
t
,
a
t
)
=
θ
t
+
α
(
q
t
(
s
t
,
a
t
)
π
(
a
t
∣
s
t
,
θ
t
)
)
⏟
β
t
∇
θ
π
(
a
t
∣
s
t
,
θ
t
)
.
\begin{aligned} \theta_{t+1} & =\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \underbrace{\left(\frac{q_t\left(s_t, a_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)}\right)}_{\beta_t} \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right) . \end{aligned}
θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at)=θt+αβt
(π(at∣st,θt)qt(st,at))∇θπ(at∣st,θt).
这个时候我们发现一个有趣的现象:
θ
t
+
1
=
θ
t
+
α
β
t
∇
θ
π
(
a
t
∣
s
t
,
θ
t
)
\theta_{t+1}=\theta_t+\alpha \beta_t \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)
θt+1=θt+αβt∇θπ(at∣st,θt)
当
α
β
t
\alpha \beta_t
αβt步长不大的时候:
-
β t > 0 \beta_t>0 βt>0:梯度上升
σ 2 = − 174 + 10 log 10 B m \sigma^2=-174+10 \log _{10} B_m σ2=−174+10log10Bm -
β t < 0 \beta_t<0 βt<0:梯度下降
π ( a t ∣ s t , θ t + 1 ) < π ( a t ∣ s t , θ t ) . \pi\left(a_t \mid s_t, \theta_{t+1}\right)<\pi\left(a_t \mid s_t, \theta_t\right) . π(at∣st,θt+1)<π(at∣st,θt).
当
θ
t
+
1
−
θ
t
\theta_{t+1}-\theta_t
θt+1−θt的时候
π
(
a
t
∣
s
t
,
θ
t
+
1
)
≈
π
(
a
t
∣
s
t
,
θ
t
)
+
(
∇
θ
π
(
a
t
∣
s
t
,
θ
t
)
)
T
(
θ
t
+
1
−
θ
t
)
\pi\left(a_t \mid s_t, \theta_{t+1}\right) \approx \pi\left(a_t \mid s_t, \theta_t\right)+\left(\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)\right)^T\left(\theta_{t+1}-\theta_t\right)
π(at∣st,θt+1)≈π(at∣st,θt)+(∇θπ(at∣st,θt))T(θt+1−θt)
β
t
\beta_t
βt能够用来平衡探索和利用
✌REINFORCE伪代码
- 对于第k个iteration
- 我们选择一个初始的state,依据当前的策略 π ( θ k ) \pi\left(\theta_k\right) π(θk) 与环境进行交互得到episode { s 0 , a 0 , r 1 , … , s T − 1 , a T − 1 , r T } \left\{s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T\right\} {s0,a0,r1,…,sT−1,aT−1,rT}
- 对其中每个元素:
- value update: q t ( s t , a t ) = ∑ k = t + 1 T γ k − t − 1 r k q_t\left(s_t, a_t\right)=\sum_{k=t+1}^T \gamma^{k-t-1} r_k qt(st,at)=∑k=t+1Tγk−t−1rk
- policy update: θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) q t ( s t , a t ) θ k = θ t \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) \\\theta_k=\theta_t θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at)θk=θt