迁移强化学习论文笔记(一)(Successor Features)

迁移强化学习论文笔记(一)(Successor Features)

一.Background and problem formulation

M ≡ ( S , A , p , R , γ ) M \equiv(\mathcal{S}, \mathcal{A}, p, R, \gamma) M(S,A,p,R,γ)

S \cal S S:状态空间

A \cal A A:行动空间

p p p: p ( ⋅ ∣ s t , a t ) p(\cdot\mid s_t,a_t) p(st,at)状态转移概率

R R R: R ( s t , a t , s t + 1 ) R(s_t,a_t,s_{t+1}) R(st,at,st+1)奖励

二.Successor features

假设奖励函数可以写为
r ( s , a , s ′ ) = ϕ ( s , a , s ′ ) ⊤ w , r\left(s, a, s^{\prime}\right)=\boldsymbol{\phi}\left(s, a, s^{\prime}\right)^{\top} \mathbf{w}, r(s,a,s)=ϕ(s,a,s)w,
其中 ϕ ( s , a , s ′ ) \boldsymbol\phi(s,a,s') ϕ(s,a,s)是d维向量, w \mathbf w w是对应的权重。利用这种形式,我们有以下结论(定义 ϕ t + 1 = ϕ ( s t , a t , s t + 1 ) \boldsymbol \phi_{t+1}=\boldsymbol \phi(s_t,a_t,s_{t+1}) ϕt+1=ϕ(st,at,st+1)
Q π ( s , a ) = E π [ r t + 1 + γ r t + 2 + … ∣ S t = s , A t = a ] = E π [ ϕ t + 1 ⊤ w + γ ϕ t + 2 ⊤ w + … ∣ S t = s , A t = a ] = E π [ ∑ i = t ∞ γ i − t ϕ i + 1 ∣ S t = s , A t = a ] ⊤ w = ψ π ( s , a ) ⊤ w . \begin{aligned} Q^\pi(s, a) & =\mathrm{E}^\pi\left[r_{t+1}+\gamma r_{t+2}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\boldsymbol{\phi}_{t+1}^{\top} \mathbf{w}+\gamma \boldsymbol{\phi}_{t+2}^{\top} \mathbf{w}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\sum_{i=t}^{\infty} \gamma^{i-t} \boldsymbol{\phi}_{i+1} \mid S_t=s, A_t=a\right]^{\top} \mathbf{w}=\boldsymbol{\psi}^\pi(s, a)^{\top} \mathbf{w} . \end{aligned} Qπ(s,a)=Eπ[rt+1+γrt+2+St=s,At=a]=Eπ[ϕt+1w+γϕt+2w+St=s,At=a]=Eπ[i=tγitϕi+1St=s,At=a]w=ψπ(s,a)w.
ψ π ( s , a ) \boldsymbol \psi^{\pi}(s,a) ψπ(s,a)是在策略 π \pi π ( s , a ) (s,a) (s,a)的Successor Features(SFs)

由定义知 ψ π ( s , a ) = E π [ ϕ t + 1 + γ ϕ t + 2 + γ 2 ϕ t + 3 + ⋯ ∣ S t = s , A t = a ] \boldsymbol \psi^{\pi}(s,a)=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a] ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+St=s,At=a]可得如下贝尔曼公式
ψ π ( s , a ) = E π [ ϕ t + 1 + γ ϕ t + 2 + γ 2 ϕ t + 3 + ⋯ ∣ S t = s , A t = a ] = E S t + 1 , A t + 1 [ ϕ t + 1 + ψ π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] 如果采取确定策略 π = ϕ t + 1 ( s , a ) + E S t + 1 [ ψ π ( S t + 1 , π ( S t + 1 ) ) ∣ S t = s , A t = a ] \begin{aligned} \boldsymbol \psi^{\pi}(s,a)&=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a]\\ &=\mathrm{E}_{S_{t+1},A_{t+1}}[\boldsymbol{\phi}_{t+1}+\boldsymbol \psi^{\pi}(S_{t+1},A_{t+1})\mid S_t=s,A_t=a]\text{如果采取确定策略}\pi\\ &=\boldsymbol \phi_{t+1}(s,a)+\mathrm E_{S_{t+1}}[\boldsymbol \psi^{\pi}(S_{t+1},\pi(S_{t+1}))\mid S_t=s,A_t=a] \end{aligned} ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+St=s,At=a]=ESt+1,At+1[ϕt+1+ψπ(St+1,At+1)St=s,At=a]如果采取确定策略π=ϕt+1(s,a)+ESt+1[ψπ(St+1,π(St+1))St=s,At=a]
利用上式即可迭代求解 ψ π ( s , a ) \boldsymbol \psi^{\pi}(s,a) ψπ(s,a),而对于 w \mathbf w w的求解则是一个有监督学习问题很多机器学习算法都可进行。

这样对于不同的任务只要求解出不同的 w \mathbf w w即可。

三.Generalized policy improvement

作者在论文中还证明了迁移强化学习的泛化误差界

Theorem 1. (Generalized Policy Improvement) Let π 1 , π 2 , … , π n \pi_1, \pi_2, \ldots, \pi_n π1,π2,,πn be n n n decision policies and let Q ~ π 1 , Q ~ π 2 , … , Q ~ π n \tilde{Q}^{\pi_1}, \tilde{Q}^{\pi_2}, \ldots, \tilde{Q}^{\pi_n} Q~π1,Q~π2,,Q~πn be approximations of their respective action-value functions such that
∣ Q π i ( s , a ) − Q ~ π i ( s , a ) ∣ ≤ ϵ  for all  s ∈ S , a ∈ A , and  i ∈ { 1 , 2 , … , n } .  \left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon \text { for all } s \in \mathcal{S}, a \in \mathcal{A} \text {, and } i \in\{1,2, \ldots, n\} \text {. } Qπi(s,a)Q~πi(s,a) ϵ for all sS,aA, and i{1,2,,n}

Define
π ( s ) ∈ argmax ⁡ a max ⁡ i Q ~ π i ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _i \tilde{Q}^{\pi_i}(s, a) . π(s)aargmaximaxQ~πi(s,a).

Then,
Q π ( s , a ) ≥ max ⁡ i Q π i ( s , a ) − 2 1 − γ ϵ Q^\pi(s, a) \geq \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon Qπ(s,a)imaxQπi(s,a)1γ2ϵ
for any s ∈ S s \in \mathcal{S} sS and a ∈ A a \in \mathcal{A} aA, where Q π Q^\pi Qπ is the action-value function of π \pi π.

proof:为简化符号,定义
Q m a x ( s , a ) = max i Q π i ( s , a ) ( 在策略 π i 中的最优动作价值函数 ) Q ~ m a x ( s , a ) = max i Q π i ~ ( s , a ) ( 在策略 π i 中最优动作价值函数的估计值 ) Q_{max}(s,a)=\text{max}_{i}Q^{\pi_i}(s,a)(在策略\pi_{i}中的最优动作价值函数)\\ \tilde{Q}_{max}(s,a)=\text{max}_{i}\tilde{Q^{\pi_{i}}}(s,a)(在策略\pi_{i}中最优动作价值函数的估计值) Qmax(s,a)=maxiQπi(s,a)(在策略πi中的最优动作价值函数)Q~max(s,a)=maxiQπi~(s,a)(在策略πi中最优动作价值函数的估计值)
借助以上符号我们有如下不等式
∣ Q max ⁡ ( s , a ) − Q ~ max ⁡ ( s , a ) ∣ = ∣ max ⁡ i Q π i ( s , a ) − max ⁡ i Q ~ π i ( s , a ) ∣ ≤ max ⁡ i ∣ Q π i ( s , a ) − Q ~ π i ( s , a ) ∣ ≤ ϵ . \left|Q_{\max }(s, a)-\tilde{Q}_{\max }(s, a)\right|=\left|\max _i Q^{\pi_i}(s, a)-\max _i \tilde{Q}^{\pi_i}(s, a)\right| \leq \max _i\left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon . Qmax(s,a)Q~max(s,a) = imaxQπi(s,a)imaxQ~πi(s,a) imax Qπi(s,a)Q~πi(s,a) ϵ.
于是我们可得
Q max ⁡ ( s , a ) − ϵ ≤ Q ~ max ⁡ ( s , a ) Q_{\max }(s, a)-\epsilon \leq\tilde{Q}_{\max }(s, a) Qmax(s,a)ϵQ~max(s,a)
借助贝尔曼算子 T π T^{\pi} Tπ,其中
T π f ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V ( s ′ ) ] V ( s ′ ) = E a ∼ π ( a ∣ s ′ ) [ f ( s ′ , a ) ] r ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a , s ′ ) ] T^{\pi}f(s,a)=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[V(s')]\\ V(s')=\mathrm E_{a\sim \pi(a\mid s')}[f(s',a)]\\ r(s,a)=\mathrm E_{s'\sim p(s'\mid s,a)}[r(s,a,s')] Tπf(s,a)=r(s,a)+γEsp(ss,a)[V(s)]V(s)=Eaπ(as)[f(s,a)]r(s,a)=Esp(ss,a)[r(s,a,s)]
因我们采用确定策略 π \pi π(在所有策略中选取能使得动作价值最大的动作), V ( s ′ ) = f ( s ′ , π ( s ′ ) ) V(s')=f(s',\pi(s')) V(s)=f(s,π(s))

对于任意 ( s , a ) ∈ S × A (s,a)\in \cal S \times \cal A (s,a)S×A和任意策略 π i \pi_{i} πi我们都有下式成立
T π Q ~ max ⁡ ( s , a ) = r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q ~ max ⁡ ( s ′ , π ( s ′ ) ) = r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q ~ max ⁡ ( s ′ , b ) ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q max ⁡ ( s ′ , b ) − γ ϵ ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q max ⁡ ( s ′ , π i ( s ′ ) ) − γ ϵ ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q π i ( s ′ , π i ( s ′ ) ) − γ ϵ = T π i Q π i ( s , a ) − γ ϵ = Q π i ( s , a ) − γ ϵ . \begin{aligned} T^\pi \tilde{Q}_{\max }(s, a) & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s,a\right) \tilde{Q}_{\max }\left(s^{\prime}, \pi\left(s^{\prime}\right)\right) \\ & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b \tilde{Q}_{\max }\left(s^{\prime}, b\right) \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_{\max }\left(s^{\prime}, b\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_{\max }\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q^{\pi_i}\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & =T^{\pi_i} Q^{\pi_i}(s, a)-\gamma \epsilon \\ & =Q^{\pi_i}(s, a)-\gamma \epsilon . \end{aligned} TπQ~max(s,a)=r(s,a)+γsp(ss,a)Q~max(s,π(s))=r(s,a)+γsp(ss,a)bmaxQ~max(s,b)r(s,a)+γsp(ss,a)bmaxQmax(s,b)γϵr(s,a)+γsp(ss,a)Qmax(s,πi(s))γϵr(s,a)+γsp(ss,a)Qπi(s,πi(s))γϵ=TπiQπi(s,a)γϵ=Qπi(s,a)γϵ.
又因 T π Q ~ max ⁡ ( s , a ) ≥ Q π i ( s , a ) − γ ϵ T^\pi \tilde{Q}_{\max }(s, a)\geq Q^{\pi_i}(s, a)-\gamma \epsilon TπQ~max(s,a)Qπi(s,a)γϵ对任意策略成立
T π Q ~ max ⁡ ( s , a ) ≥ Q π i ( s , a ) − γ ϵ f o r ∀ π i ≥ max i Q π i − γ ϵ ≥ Q ~ max ⁡ ( s , a ) − γ − γ ϵ \begin{aligned} T^\pi \tilde{Q}_{\max }(s, a)&\geq Q^{\pi_i}(s, a)-\gamma \epsilon \qquad for \forall \pi_{i}\\ &\geq \text{max}_{i}Q^{\pi_{i}}-\gamma \epsilon\\ &\geq \tilde{Q}_{\max }(s, a)-\gamma-\gamma\epsilon \end{aligned} TπQ~max(s,a)Qπi(s,a)γϵforπimaxiQπiγϵQ~max(s,a)γγϵ
为得出最终结论。我们还需要证明以下事实
T π ( f ( s , a ) + c ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ f ( s ′ , π ( s ′ ) ) + c ] = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ f ( s ′ , π ( s ′ ) ) ] + γ ⋅ c = T π ( f ( s , a ) ) + γ ⋅ c \begin{aligned} T^{\pi}(f(s,a)+c)&=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))+c]\\ &=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))]+\gamma\cdot c\\ &=T^{\pi}(f(s,a))+\gamma\cdot c \end{aligned} Tπ(f(s,a)+c)=r(s,a)+γEsp(ss,a)[f(s,π(s))+c]=r(s,a)+γEsp(ss,a)[f(s,π(s))]+γc=Tπ(f(s,a))+γc
于是我们可知
T π Q ~ max ⁡ ( s , a ) ≥ Q ~ max ⁡ ( s , a ) − ( 1 + γ ) ϵ T π ( T π Q ~ max ⁡ ( s , a ) ) ≥ T π Q ~ max ⁡ ( s , a ) − γ ( 1 + γ ) ϵ ⋮ ( T π ) k ( Q ~ max ⁡ ( s , a ) ) ≥ ( T π ) k − 1 − γ k − 1 ( 1 + γ ) ϵ \begin{aligned} T^{\pi}\tilde{Q}_{\max }(s, a)&\geq \tilde{Q}_{\max }(s, a)-(1+\gamma)\epsilon\\ T^{\pi}(T^{\pi}\tilde{Q}_{\max }(s, a))&\geq T^{\pi}\tilde{Q}_{\max }(s, a)-\gamma(1+\gamma)\epsilon\\ \vdots\\ (T^{\pi})^{k}(\tilde{Q}_{\max }(s, a))&\geq (T^{\pi})^{k-1}-\gamma^{k-1}(1+\gamma)\epsilon \end{aligned} TπQ~max(s,a)Tπ(TπQ~max(s,a))(Tπ)k(Q~max(s,a))Q~max(s,a)(1+γ)ϵTπQ~max(s,a)γ(1+γ)ϵ(Tπ)k1γk1(1+γ)ϵ
将上式连续相加,且当 k k k趋于无穷时可知
Q π ( s , a ) = lim ⁡ k → ∞ ( T π ) k Q ~ max ⁡ ( s , a ) ≥ Q ~ max ⁡ ( s , a ) − 1 + γ 1 − γ ϵ ≥ Q max ⁡ ( s , a ) − ϵ − 1 + γ 1 − γ ϵ = max ⁡ i Q π i ( s , a ) − 2 1 − γ ϵ \begin{aligned} Q^\pi(s, a) & =\lim _{k \rightarrow \infty}\left(T^\pi\right)^k \tilde{Q}_{\max }(s, a) \\ & \geq \tilde{Q}_{\max }(s, a)-\frac{1+\gamma}{1-\gamma} \epsilon \\ & \geq Q_{\max }(s, a)-\epsilon-\frac{1+\gamma}{1-\gamma} \epsilon\\ & = \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon \end{aligned} Qπ(s,a)=klim(Tπ)kQ~max(s,a)Q~max(s,a)1γ1+γϵQmax(s,a)ϵ1γ1+γϵ=imaxQπi(s,a)1γ2ϵ
证毕

想要证明最后误差界,我们还需借助以下引理

Lemma 1. Let δ i j = max ⁡ s , a ∣ r i ( s , a ) − r j ( s , a ) ∣ \delta_{i j}=\max _{s, a}\left|r_i(s, a)-r_j(s, a)\right| δij=maxs,ari(s,a)rj(s,a). Then,
Q i π i ∗ ( s , a ) − Q i π j ∗ ( s , a ) ≤ 2 δ i j 1 − γ . Q_i^{\pi_i^*}(s, a)-Q_i^{\pi_j^*}(s, a) \leq \frac{2 \delta_{i j}}{1-\gamma} . Qiπi(s,a)Qiπj(s,a)1γ2δij.

proof为简化记号,令 Q i j ( s , a ) ≡ Q i π j ∗ ( s , a ) Q_i^j(s, a) \equiv Q_i^{\pi_j^*}(s, a) Qij(s,a)Qiπj(s,a).
Q i i ( s , a ) − Q i j ( s , a ) = Q i i ( s , a ) − Q j j ( s , a ) + Q j j ( s , a ) − Q i j ( s , a ) ≤ ∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ + ∣ Q j j ( s , a ) − Q i j ( s , a ) ∣ . \begin{aligned} Q_i^i(s, a)-Q_i^j(s, a) & =Q_i^i(s, a)-Q_j^j(s, a)+Q_j^j(s, a)-Q_i^j(s, a) \\ & \leq\left|Q_i^i(s, a)-Q_j^j(s, a)\right|+\left|Q_j^j(s, a)-Q_i^j(s, a)\right| . \end{aligned} Qii(s,a)Qij(s,a)=Qii(s,a)Qjj(s,a)+Qjj(s,a)Qij(s,a) Qii(s,a)Qjj(s,a) + Qjj(s,a)Qij(s,a) .
Δ i j = max ⁡ s , a ∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ \Delta_{i j}=\max _{s, a}\left|Q_i^i(s, a)-Q_j^j(s, a)\right| Δij=maxs,a Qii(s,a)Qjj(s,a) .
∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ = ∣ r i ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q i i ( s ′ , b ) − r j ( s , a ) − γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q j j ( s ′ , b ) ∣ = ∣ r i ( s , a ) − r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) ( max ⁡ b Q i i ( s ′ , b ) − max ⁡ b Q j j ( s ′ , b ) ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) ∣ max ⁡ b Q i i ( s ′ , b ) − max ⁡ b Q j j ( s ′ , b ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b ∣ Q i i ( s ′ , b ) − Q j j ( s ′ , b ) ∣ ≤ δ i j + γ Δ i j . \begin{aligned} \left|Q_i^i(s, a)-Q_j^j(s, a)\right| & =\left|r_i(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_i^i\left(s^{\prime}, b\right)-r_j(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b\left|Q_i^i\left(s^{\prime}, b\right)-Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j} . \end{aligned} Qii(s,a)Qjj(s,a) = ri(s,a)+γsp(ss,a)bmaxQii(s,b)rj(s,a)γsp(ss,a)bmaxQjj(s,b) = ri(s,a)rj(s,a)+γsp(ss,a)(bmaxQii(s,b)bmaxQjj(s,b)) ri(s,a)rj(s,a)+γsp(ss,a) bmaxQii(s,b)bmaxQjj(s,b) ri(s,a)rj(s,a)+γsp(ss,a)bmax Qii(s,b)Qjj(s,b) δij+γΔij.

从上式中可知
Δ i j ≤ 1 1 − γ δ i j . \Delta_{i j} \leq \frac{1}{1-\gamma} \delta_{i j} . Δij1γ1δij.

定义 Δ i j ′ = \Delta_{i j}^{\prime}= Δij= max ⁡ s , a ∣ Q i i ( s , a ) − Q i j ( s , a ) ∣ \max _{s, a}\left|Q_i^i(s, a)-Q_i^j(s, a)\right| maxs,a Qii(s,a)Qij(s,a) .
∣ Q j j ( s , a ) − Q i j ( s , a ) ∣ = ∣ r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q j j ( s ′ , π j ∗ ( s ′ ) ) − r i ( s , a ) − γ ∑ s ′ p ( s ′ ∣ s , a ) Q i j ( s ′ , π j ∗ ( s ′ ) ) ∣ = ∣ r i ( s , a ) − r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) ( Q j j ( s ′ , π j ∗ ( s ′ ) ) − Q i j ( s ′ , π j ∗ ( s ′ ) ) ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) ∣ Q j j ( s ′ , π j ∗ ( s ′ ) ) − Q i j ( s ′ , π j ∗ ( s ′ ) ) ∣ ≤ δ i j + γ Δ i j ′ . \begin{aligned} \left|Q_j^j(s, a)-Q_i^j(s, a)\right| & =\left|r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-r_i(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j}^{\prime} . \end{aligned} Qjj(s,a)Qij(s,a) = rj(s,a)+γsp(ss,a)Qjj(s,πj(s))ri(s,a)γsp(ss,a)Qij(s,πj(s)) = ri(s,a)rj(s,a)+γsp(ss,a)(Qjj(s,πj(s))Qij(s,πj(s))) ri(s,a)rj(s,a)+γsp(ss,a) Qjj(s,πj(s))Qij(s,πj(s)) δij+γΔij.

同样可知
Δ i j ′ ≤ 1 1 − γ δ i j . \Delta_{i j}^{\prime} \leq \frac{1}{1-\gamma} \delta_{i j} . Δij1γ1δij.

证毕

Theorem 2. Let M i ∈ M ϕ M_i \in \mathcal{M}^\phi MiMϕ and let Q i π j ∗ Q_i^{\pi_j^*} Qiπj be the value function of an optimal policy of M j ∈ M ϕ M_j \in \mathcal{M}^\phi MjMϕ when executed in M i M_i Mi. Given the set { Q ~ i π 1 ∗ , Q ~ i π 2 ∗ , … , Q ~ i π n ∗ } \left\{\tilde{Q}_i^{\pi_1^*}, \tilde{Q}_i^{\pi_2^*}, \ldots, \tilde{Q}_i^{\pi_n^*}\right\} {Q~iπ1,Q~iπ2,,Q~iπn} such that
∣ Q i π j ∗ ( s , a ) − Q ~ i π j ∗ ( s , a ) ∣ ≤ ϵ  for all  s ∈ S , a ∈ A , and  j ∈ { 1 , 2 , … , n } , \left|Q_i^{\pi_j^*}(s, a)-\tilde{Q}_i^{\pi_j^*}(s, a)\right| \leq \epsilon \text { for all } s \in S, a \in A \text {, and } j \in\{1,2, \ldots, n\}, Qiπj(s,a)Q~iπj(s,a) ϵ for all sS,aA, and j{1,2,,n},
let
π ( s ) ∈ argmax ⁡ a max ⁡ j Q ~ i π j ∗ ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _j \tilde{Q}_i^{\pi_j^*}(s, a) . π(s)aargmaxjmaxQ~iπj(s,a).

Finally, let ϕ max ⁡ = max ⁡ s , a ∥ ϕ ( s , a ) ∥ \phi_{\max }=\max _{s, a}\|\phi(s, a)\| ϕmax=maxs,aϕ(s,a), where ∥ ⋅ ∥ \|\cdot\| is the norm induced by the inner product adopted. Then,
Q i ∗ ( s , a ) − Q i π ( s , a ) ≤ 2 1 − γ ( ϕ max ⁡ min ⁡ j ∥ w i − w j ∥ + ϵ ) . Q_i^*(s, a)-Q_i^\pi(s, a) \leq \frac{2}{1-\gamma}\left(\phi_{\max } \min _j\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\epsilon\right) . Qi(s,a)Qiπ(s,a)1γ2(ϕmaxjminwiwj+ϵ).

proof:
Q i ∗ ( s , a ) − Q i π ( s , a ) ≤ Q i ∗ ( s , a ) − Q i π j ∗ ( s , a ) + 2 1 − γ ϵ ≤ 2 1 − γ max ⁡ s , a ∣ r i ( s , a ) − r j ( s , a ) ∣ + 2 1 − γ ϵ = 2 1 − γ max ⁡ s , a ∣ ϕ ( s , a ) ⊤ w i − ϕ ( s , a ) ⊤ w j ∣ + 2 1 − γ ϵ = 2 1 − γ max ⁡ s , a ∣ ϕ ( s , a ) ⊤ ( w i − w j ) ∣ + 2 1 − γ ϵ ≤ 2 1 − γ max ⁡ s , a ∥ ϕ ( s , a ) ∥ ∥ w i − w j ∥ + 2 1 − γ ϵ = 2 ϕ max ⁡ 1 − γ ∥ w i − w j ∥ + 2 1 − γ ϵ . \begin{aligned} Q_i^*(s, a)-Q_i^\pi(s, a) & \leq Q_i^*(s, a)-Q_i^{\pi_j^*}(s, a)+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\left|r_i(s, a)-r_j(s, a)\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top} \mathbf{w}_i-\phi(s, a)^{\top} \mathbf{w}_j\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top}\left(\mathbf{w}_i-\mathbf{w}_j\right)\right|+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\|\phi(s, a)\|\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2 \phi_{\max }}{1-\gamma}\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon . \end{aligned} Qi(s,a)Qiπ(s,a)Qi(s,a)Qiπj(s,a)+1γ2ϵ1γ2s,amaxri(s,a)rj(s,a)+1γ2ϵ=1γ2s,amax ϕ(s,a)wiϕ(s,a)wj +1γ2ϵ=1γ2s,amax ϕ(s,a)(wiwj) +1γ2ϵ1γ2s,amaxϕ(s,a)wiwj+1γ2ϵ=1γ2ϕmaxwiwj+1γ2ϵ.

  • 11
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值