Mirror Descent

Mirror Descent

翻译自 Bregman Divergence and Mirror Descent

次梯度下降的收敛速度通常取决于问题的维数。假设求函数 f f f C C C上的最小值,那么次梯度下降(subgradient descent)
x k + 1 2 = x k − α k g k , g k ∈ ∂ f ( x k ) x k + 1 = arg min ⁡ x ∈ C 1 2 ∥ x − x k + 1 2 ∥ 2 = arg min ⁡ x ∈ C 1 2 ∥ x − ( x k − α k g k ) ∥ 2 . (20) \begin{aligned} x_{k+\frac{1}{2}} &= x_k - \alpha_k g_k, \quad g_k \in \partial f(x_k) \\ x_{k+1} &= \argmin_{x \in C} \frac{1}{2} \| x - x_{k+\frac{1}{2}} \|^2 = \argmin_{x \in C} \frac{1}{2} \| x - \left( x_k - \alpha_k g_k \right) \|^2. \end{aligned} \tag{20} xk+21xk+1=xkαkgk,gkf(xk)=xCargmin21xxk+212=xCargmin21x(xkαkgk)2.(20)
可以解释如下。用 f f f x k x_k xk附近的一阶Taylor展开式近似 f f f
f ( x ) ≈ f ( x k ) + < g k , x − x k > . (21) f(x) \approx f(x_k) + \left< g_k, x - x_k \right>. \tag{21} f(x)f(xk)+gk,xxk.(21)
然后用 1 2 α k ∥ x − x k ∥ 2 \frac{1}{2 \alpha_k} \| x - x_k \|^2 2αk1xxk2余项作为惩罚项。因此,更新规则是找到下式的极小值:
x k + 1 = arg min ⁡ x ∈ C { f ( x k ) + < g k , x − x k > + 1 2 α k ∥ x − x k ∥ 2 } . (22) x_{k+1} = \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + \frac{1}{2 \alpha_k} \| x - x_k \|^2 \right\}. \tag{22} xk+1=xCargmin{f(xk)+gk,xxk+2αk1xxk2}.(22)
上式(22)与式(20)等价。为了将方法推广到欧几里得距离以外,可以直接使用Bregman散度作为余项的量度:
x k + 1 = arg min ⁡ x ∈ C { f ( x k ) + < g k , x − x k > + 1 α k Div ψ ( x , x k ) } = arg min ⁡ x ∈ C { α k f ( x k ) + α k < g k , x − x k > + Div ψ ( x , x k ) } = arg min ⁡ x ∈ C { < α k g k , x > + Div ψ ( x , x k ) } . (23) \begin{aligned} x_{k + 1} &= \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \left< \alpha_k g_k, x \right> + \text{Div}_{\psi}(x, x_k) \right\}. \end{aligned} \tag{23} xk+1=xCargmin{f(xk)+gk,xxk+αk1Divψ(x,xk)}=xCargmin{αkf(xk)+αkgk,xxk+Divψ(x,xk)}=xCargmin{αkgk,x+Divψ(x,xk)}.(23)

镜像梯度的解释

假设约束集 C C C是整个空间(即无约束)。那么我们可以用关于 x x x的梯度,寻找最优条件:
∂ ∂ x ( < g k , x > + 1 α k Div ψ ( x , x k ) ) ∣ x = x k + 1 = g k + 1 α ( ∇ ψ ( x k + 1 ) − ∇ ψ ( x k ) ) = 0 ⇔ ∇ ψ ( x k + 1 ) = ∇ ψ ( x k ) − α k g k ⇔ x k + 1 = ( ∇ ψ ) − 1 ( ∇ ψ ( x k ) − < α k g k , x > ) = ( ∇ ψ ∗ ) ( ∇ ψ ( x k ) − α k g k ) . (24) \begin{aligned} & \frac{\partial}{\partial x} \left( \left< g_k, x \right> + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right) |_{x = x_{k+1}} = g_k + \frac{1}{\alpha} \left( \nabla \psi(x_{k+1}) - \nabla \psi(x_{k}) \right) = 0 \\ \Leftrightarrow & \nabla \psi(x_{k+1}) = \nabla \psi(x_{k}) - \alpha_k g_k \\ \Leftrightarrow & x_{k+1} = \left( \nabla \psi \right)^{-1} \left( \nabla \psi(x_{k}) - \left< \alpha_k g_k, x \right> \right) = \left( \nabla \psi^{*} \right) \left( \nabla \psi(x_{k}) - \alpha_k g_k \right). \end{aligned} \tag{24} x(gk,x+αk1Divψ(x,xk))x=xk+1=gk+α1(ψ(xk+1)ψ(xk))=0ψ(xk+1)=ψ(xk)αkgkxk+1=(ψ)1(ψ(xk)αkgk,x)=(ψ)(ψ(xk)αkgk).(24)
如果是KL散度,那么 ∇ x k ( i ) ψ ( x k ) = log ⁡ x k ( i ) + 1 \nabla_{x_k(i)} \psi(x_{k}) = \log x_k(i) + 1 xk(i)ψ(xk)=logxk(i)+1,更新规则为:
x k + 1 ( i ) = x k ( i ) exp ⁡ ( − α k g k ( i ) ) . (25) x_{k+1}(i) = x_{k}(i) \exp \left( - \alpha_k g_k(i) \right). \tag{25} xk+1(i)=xk(i)exp(αkgk(i)).(25)

收敛速度

回顾在无约束的子梯度下降中,4个步骤:

1. 受单次更新的约束:
∥ x k + 1 − x ∗ ∥ 2 2 = ∥ x k − α k g k − x ∗ ∥ 2 2 = ∥ x k − x ∗ ∥ 2 2 − 2 α k < g k , x k − x ∗ > + α k 2 ∥ g k ∥ 2 2 ≤ ∥ x k − x ∗ ∥ 2 2 − 2 α k ( f ( x k ) − f ( x ∗ ) ) + α k 2 ∥ g k ∥ 2 2 . (26) \begin{aligned} \| x_{k+1} - x^{*} \|_2^2 =& \| x_{k} - \alpha_k g_k - x^{*} \|_2^2 \\ =& \| x_{k} - x^{*} \|_2^2 - 2 \alpha_k \left< g_k, x_k - x^{*} \right> + \alpha_k^2 \| g_k \|_2^2 \\ \leq & \| x_{k} - x^{*} \|_2^2 - 2 \alpha_k \left( f(x_k) - f(x^{*}) \right) + \alpha_k^2 \| g_k \|_2^2. \end{aligned} \tag{26} xk+1x22==xkαkgkx22xkx222αkgk,xkx+αk2gk22xkx222αk(f(xk)f(x))+αk2gk22.(26)
上式用到了 f ( x ∗ ) ≥ f ( x k ) + < g k , x ∗ − x k > f(x^{*}) \geq f(x_k) + \left< g_k, x^{*} - x_k \right> f(x)f(xk)+gk,xxk

2. 递推求和:
∥ x T + 1 − x ∗ ∥ 2 2 ≤ ∥ x 1 − x ∗ ∥ 2 2 − 2 ∑ k = 1 T α k ( f ( x k ) − f ( x ∗ ) ) + ∑ k = 1 T α k 2 ∥ g k ∥ 2 2 . (27) \| x_{T+1} - x^{*} \|_2^2 \leq \| x_{1} - x^{*} \|_2^2 - 2 \sum_{k=1}^{T} \alpha_k \left( f(x_k) - f(x^{*}) \right) + \sum_{k=1}^{T} \alpha_k^2 \| g_k \|_2^2. \tag{27} xT+1x22x1x222k=1Tαk(f(xk)f(x))+k=1Tαk2gk22.(27)

3. 根据 ∥ x 1 − x ∗ ∥ 2 2 ≤ R 2 \| x_{1} - x^{*} \|_2^2 \leq R^2 x1x22R2 ∥ g k ∥ 2 2 ≤ G 2 \| g_k \|_2^2 \leq G^2 gk22G2
2 ∑ k = 1 T α k ( f ( x k ) − f ( x ∗ ) ) ≤ R 2 + G 2 ∑ k = 1 T α k 2 . (28) 2 \sum_{k=1}^{T} \alpha_k \left( f(x_k) - f(x^{*}) \right) \leq R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2. \tag{28} 2k=1Tαk(f(xk)f(x))R2+G2k=1Tαk2.(28)

4. ϵ k = f ( x k ) − f ( x ∗ ) \epsilon_k = f(x_k) - f(x^{*}) ϵk=f(xk)f(x),那么:
min ⁡ k ∈ { 1 , ⋯   , T } ϵ k ≤ R 2 + G 2 ∑ k = 1 T α k 2 2 ∑ k = 1 T α k . (29) \min_{k \in \{ 1, \cdots, T \}} \epsilon_k \leq \frac{R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2}{2 \sum_{k=1}^{T} \alpha_k}. \tag{29} k{1,,T}minϵk2k=1TαkR2+G2k=1Tαk2.(29)
通过选择合适的步长 α k = R G T \alpha_k = \frac{R}{G\sqrt{T}} αk=GT R,上式右边:
R 2 + G 2 ∑ k = 1 T α k 2 2 ∑ k = 1 T α k = R 2 + G 2 ∑ k = 1 T R 2 G 2 T 2 ∑ k = 1 T R G T = R G 2 T (30) \frac{R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2}{2 \sum_{k=1}^{T} \alpha_k} = \frac{R^2 + G^2 \sum_{k=1}^{T} \frac{R^2}{G^2T}}{2 \sum_{k=1}^{T} \frac{R}{G\sqrt{T}}} = \frac{RG}{2\sqrt{T}} \tag{30} 2k=1TαkR2+G2k=1Tαk2=2k=1TGT RR2+G2k=1TG2TR2=2T RG(30)
即:
min ⁡ k ∈ { 1 , ⋯   , T } ϵ k ≤ R G T . (31) \min_{k \in \{ 1, \cdots, T \}} \epsilon_k \leq \frac{RG}{\sqrt{T}}. \tag{31} k{1,,T}minϵkT RG.(31)
假设 C C C是simplex,那么 R ≤ 2 R \leq \sqrt{2} R2 。如果每个梯度 g i g_i gi的每个坐标都是以 M M M为上界,那么 G G G最多可以是 M n M\sqrt{n} Mn ,即取决于维度。

从步骤2到4,可以用 Div ψ ( x ∗ , x k + 1 ) \text{Div}_{\psi}(x^{*},x_{k+1}) Divψ(x,xk+1)代替 ∥ x k + 1 − x ∗ ∥ 2 2 \| x_{k+1} - x^{*} \|_2^2 xk+1x22。而步骤1,则需要用到引理1。

假设 ψ \psi ψ σ \sigma σ-严格凸函数,将式(23)中 α k f ( x k ) + α k < g k , x − x k > \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> αkf(xk)+αkgk,xxk的视为引理1中的 L L L,那么:
α k f ( x k ) + α k < g k , x ∗ − x k > + Div ψ ( x ∗ , x k ) ≥ α k f ( x k ) + α k < g k , x k + 1 − x k > + Div ψ ( x k + 1 , x k ) + Div ψ ( x ∗ , x k + 1 ) . (32) \begin{aligned} & \alpha_k f(x_k) + \alpha_k \left< g_k, x^{*} - x_k \right> + \text{Div}_{\psi}(x^{*}, x_k) \\ \geq & \alpha_k f(x_k) + \alpha_k \left< g_k, x_{k+1} - x_k \right> + \text{Div}_{\psi}(x_{k+1}, x_{k}) + \text{Div}_{\psi}(x^{*}, x_{k+1}). \end{aligned} \tag{32} αkf(xk)+αkgk,xxk+Divψ(x,xk)αkf(xk)+αkgk,xk+1xk+Divψ(xk+1,xk)+Divψ(x,xk+1).(32)
移项后得:
Div ψ ( x ∗ , x k + 1 ) ≤ Div ψ ( x ∗ , x k ) + α k < g k , x ∗ − x k + 1 > − Div ψ ( x k + 1 , x k ) = Div ψ ( x ∗ , x k ) + α k < g k , x ∗ − x k > + α k < g k , x k − x k + 1 > − Div ψ ( x k + 1 , x k ) ≤ 式 ( 4 ) Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) ) + α k < g k , x k − x k + 1 > − σ 2 ∥ x k − x k + 1 ∥ 2 ≤ Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) ) + α k ∥ g k ∥ ∗ ∥ x k − x k + 1 ∥ − σ 2 ∥ x k − x k + 1 ∥ 2 ≤ Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) ) + α k 2 2 σ ∥ g k ∥ 2 . (33) \begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ = & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k} \right> + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{式(4)}{\leq} & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*})\right) + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*})\right) + \alpha_k \| g_k \|_{*} \| x_{k} - x_{k+1} \| - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|^2. \end{aligned} \tag{33} Divψ(x,xk+1)=(4)Divψ(x,xk)+αkgk,xxk+1Divψ(xk+1,xk)Divψ(x,xk)+αkgk,xxk+αkgk,xkxk+1Divψ(xk+1,xk)Divψ(x,xk)αk(f(xk)f(x))+αkgk,xkxk+12σxkxk+12Divψ(x,xk)αk(f(xk)f(x))+αkgkxkxk+12σxkxk+12Divψ(x,xk)αk(f(xk)f(x))+2σαk2gk2.(33)
与式(26)对比,可以将 ∥ x k − x ∗ ∥ 2 2 \| x_{k} - x^{*} \|_2^2 xkx22替换为 Div ψ ( x ∗ , x k ) \text{Div}_{\psi}(x^{*}, x_{k}) Divψ(x,xk)。同样,假设 Div ψ ( x ∗ , x 1 ) \text{Div}_{\psi}(x^{*}, x_{1}) Divψ(x,x1)界为 R 2 R^2 R2,且 ∥ g k ∥ ∗ \| g_k \|_{*} gk的界为 G G G,其中 ∥ ⋅ ∥ ∗ \| \cdot \|_{*} 是对偶范数。

为了显示mirror descent的优势,假设 C C C n n n维的simplex,使用kL散度,其中 ψ \psi ψ是关于 l 1 l_1 l1范数的1-严格凸函数。那么 l 1 l_1 l1范数的对偶范数就是 l ∞ l_{\infty} l范数。因此,可以用kL散度考虑 Div ψ ( x ∗ , x 1 ) \text{Div}_{\psi}(x^{*}, x_{1}) Divψ(x,x1)的界为 log ⁡ n \log n logn,而 G G G则上界为 M M M。所以,对于 R G RG RG的值,mirror descent比子梯度下降小一个数量级 O ( n log ⁡ n ) O(\sqrt{\frac{n}{\log n}}) O(lognn )

加速1: f f f是强凸函数。 我们说关于另一个函数 ψ \psi ψ和模数 λ \lambda λ f f f强凸函数,那么满足:
f ( x ) ≥ f ( y ) + < g , x − y > + λ Div ψ ( x , y ) g ∈ ∂ f ( y ) . (34) f(x) \geq f(y) + \left< g, x - y \right> + \lambda \text{Div}_{\psi}(x, y) \quad g \in \partial f(y). \tag{34} f(x)f(y)+g,xy+λDivψ(x,y)gf(y).(34)

注意,并不要求 f f f是可微的。那么式(33)可以增加强凸的条件:
Div ψ ( x ∗ , x k + 1 ) ≤ Div ψ ( x ∗ , x k ) + α k < g k , x ∗ − x k + 1 > − Div ψ ( x k + 1 , x k ) = Div ψ ( x ∗ , x k ) + α k < g k , x ∗ − x k > + α k < g k , x k − x k + 1 > − Div ψ ( x k + 1 , x k ) ≤ ψ 强 凸 , 式 ( 4 ) Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) + λ Div ψ ( x ∗ , x k ) ) + α k < g k , x k − x k + 1 > − σ 2 ∥ x k − x k + 1 ∥ 2 ≤ Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) + λ Div ψ ( x ∗ , x k ) ) + α k ∥ g k ∥ ∗ ∥ x k − x k + 1 ∥ − σ 2 ∥ x k − x k + 1 ∥ 2 ≤ ( 1 − λ α k ) Div ψ ( x ∗ , x k ) − α k ( f ( x k ) − f ( x ∗ ) ) + α k 2 2 σ ∥ g k ∥ ∗ 2 . (35) \begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ = & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k} \right> + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{\psi强凸,式(4)}{\leq} & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) + \lambda \text{Div}_{\psi}(x^{*}, x_{k}) \right) + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) + \lambda \text{Div}_{\psi}(x^{*}, x_{k}) \right) + \alpha_k \| g_k \|_{*} \| x_{k} - x_{k+1} \| - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & (1 - \lambda \alpha_k ) \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|_{*}^2. \end{aligned} \tag{35} Divψ(x,xk+1)=ψ,(4)Divψ(x,xk)+αkgk,xxk+1Divψ(xk+1,xk)Divψ(x,xk)+αkgk,xxk+αkgk,xkxk+1Divψ(xk+1,xk)Divψ(x,xk)αk(f(xk)f(x)+λDivψ(x,xk))+αkgk,xkxk+12σxkxk+12Divψ(x,xk)αk(f(xk)f(x)+λDivψ(x,xk))+αkgkxkxk+12σxkxk+12(1λαk)Divψ(x,xk)αk(f(xk)f(x))+2σαk2gk2.(35)

δ k = Div ψ ( x ∗ , x k ) \delta_{k} = \text{Div}_{\psi}(x^{*}, x_k) δk=Divψ(x,xk),令 α k = 1 λ k \alpha_k = \frac{1}{\lambda k} αk=λk1,那么上式有:
δ k + 1 ≤ k − 1 k δ k − 1 λ k ϵ k + G 2 2 σ λ 2 k 2 ⇒ k δ k + 1 ≤ ( k − 1 ) δ k − 1 λ ϵ k + G 2 2 σ λ 2 k . (36) \begin{aligned} & \delta_{k+1} \leq \frac{k-1}{k} \delta_{k} - \frac{1}{\lambda k} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2 k^2} \\ \Rightarrow & k \delta_{k+1} \leq (k-1) \delta_{k} - \frac{1}{\lambda} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2 k}. \end{aligned} \tag{36} δk+1kk1δkλk1ϵk+2σλ2k2G2kδk+1(k1)δkλ1ϵk+2σλ2kG2.(36)
递推求和有:
T δ T + 1 ≤ − 1 λ ∑ k = 1 T ϵ k + G 2 2 σ λ 2 ∑ k = 1 T 1 k . ⇒ min ⁡ i ∈ { 1 , ⋅ , T } ϵ k ≤ G 2 2 σ λ 1 T ∑ k = 1 T 1 k ≤ G 2 2 σ λ O ( log ⁡ T ) T . (37) \begin{aligned} & T \delta_{T+1} \leq - \frac{1}{\lambda} \sum_{k=1}^{T} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2} \sum_{k=1}^{T} \frac{1}{k}. \\ \Rightarrow & \min_{i \in \{ 1,\cdot, T \}} \epsilon_{k} \leq \frac{G^2}{2 \sigma \lambda } \frac{1}{T} \sum_{k=1}^{T} \frac{1}{k} \leq \frac{G^2}{2 \sigma \lambda } \frac{O(\log T)}{T}. \end{aligned} \tag{37} TδT+1λ1k=1Tϵk+2σλ2G2k=1Tk1.i{1,,T}minϵk2σλG2T1k=1Tk12σλG2TO(logT).(37)

加速2: f f f的梯度Lipschitz连续。 如果函数 f f f的梯度是Lipschitz连续,那么存在 L > 0 L > 0 L>0使得:
∥ ∇ f ( x ) − ∇ f ( y ) ∥ ∗ ≤ L ∥ x − y ∥ , ∀ x , y . (38) \| \nabla f(x) - \nabla f(y) \|_{*} \leq L \| x - y \|, \quad \forall x, y. \tag{38} f(x)f(y)Lxy,x,y.(38)
有时候我们直接说 f f f是光滑的。上式等价于:
f ( x ) ≤ f ( y ) + < ∇ f ( y ) , x − y > + L 2 ∥ x − y ∥ 2 . (39) f(x) \leq f(y) + \left< \nabla f(y), x - y \right> + \frac{L}{2} \| x - y \|^2. \tag{39} f(x)f(y)+f(y),xy+2Lxy2.(39)
现在考虑式(33)中的 < g k , x ∗ − x k + 1 > \left< g_k, x^{*} - x_{k+1} \right> gk,xxk+1的界:
< g k , x ∗ − x k + 1 > = < g k , x ∗ − x k > + < g k , x k − x k + 1 > ≤ f ( x ∗ ) − f ( x k ) + L 2 ∥ x ∗ − x k ∥ 2 + f ( x k ) − f ( x k + 1 ) + L 2 ∥ x k − x k + 1 ∥ 2 = f ( x ∗ ) − f ( x k + 1 ) + L 2 ∥ x k − x k + 1 ∥ 2 . (40) \begin{aligned} \left< g_k, x^{*} - x_{k+1} \right> =& \left< g_k, x^{*} - x_{k} \right> + \left< g_k, x_{k} - x_{k+1} \right> \\ \leq & f(x^{*}) - f(x_k) + \frac{L}{2} \| x^{*} - x_k \|^2 + f(x_{k}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \\ = & f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2. \end{aligned} \tag{40} gk,xxk+1==gk,xxk+gk,xkxk+1f(x)f(xk)+2Lxxk2+f(xk)f(xk+1)+2Lxkxk+12f(x)f(xk+1)+2Lxkxk+12.(40)
将其代入式(33)有:
Div ψ ( x ∗ , x k + 1 ) ≤ Div ψ ( x ∗ , x k ) + α k ( f ( x ∗ ) − f ( x k + 1 ) + L 2 ∥ x k − x k + 1 ∥ 2 ) − Div ψ ( x k + 1 , x k ) ≤ ψ 强 凸 , 式 ( 4 ) Div ψ ( x ∗ , x k ) + α k ( f ( x ∗ ) − f ( x k + 1 ) + L 2 ∥ x k − x k + 1 ∥ 2 ) − σ 2 ∥ x k − x k + 1 ∥ 2 . (41) \begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left( f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \right) - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{\psi强凸,式(4)}{\leq}& \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left( f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \right) - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2. \end{aligned} \tag{41} Divψ(x,xk+1)ψ,(4)Divψ(x,xk)+αk(f(x)f(xk+1)+2Lxkxk+12)Divψ(xk+1,xk)Divψ(x,xk)+αk(f(x)f(xk+1)+2Lxkxk+12)2σxkxk+12.(41)
α k = σ L \alpha_k = \frac{\sigma}{L} αk=Lσ,有:
Div ψ ( x ∗ , x k + 1 ) ≤ Div ψ ( x ∗ , x k ) − σ L ( f ( x k + 1 ) − f ( x ∗ ) ) . (42) \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq \text{Div}_{\psi}(x^{*}, x_k) - \frac{\sigma}{L} \left( f(x_{k+1}) - f(x^{*}) \right). \tag{42} Divψ(x,xk+1)Divψ(x,xk)Lσ(f(xk+1)f(x)).(42)
递推之:
min ⁡ k ∈ { 2 , ⋯   , T + 1 } f ( x k ) − f ( x ∗ ) ≤ L Div ψ ( x ∗ , x 1 ) σ T ≤ L R 2 σ T . (43) \min_{k \in \{2,\cdots, T+1\}} f(x_k) - f(x^{*}) \leq \frac{L \text{Div}_{\psi}(x^{*}, x_1)}{\sigma T} \leq \frac{L R^2}{\sigma T}. \tag{43} k{2,,T+1}minf(xk)f(x)σTLDivψ(x,x1)σTLR2.(43)
这时候的收敛速度为 O ( 1 T ) O(\frac{1}{T}) O(T1),如果使用像Nesterov的技术,可以达到 O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21)。可以使用引理1证明,我们称之为加速近似梯度法(accelerated proximal gradient method, APGM)


1 组合目标函数

假设目标函数为 h ( x ) = f ( x ) + r ( x ) h(x) = f(x) + r(x) h(x)=f(x)+r(x),其中 f f f是光滑的,而 r ( x ) r(x) r(x)是simple的,比如 ∥ x ∥ 1 \| x \|_1 x1。如果直接使用上面的优化方式,可以得到收敛速度为 O ( 1 T ) O(\frac{1}{\sqrt{T}}) O(T 1),因为 h h h不是光滑的。我们希望能够获得光滑时的收敛速度 O ( 1 T ) O(\frac{1}{T}) O(T1)是可以办到的,因为 r ( x ) r(x) r(x)是简单的函数,只需要扩展式(23)如下:
x k + 1 = arg min ⁡ x ∈ C { f ( x k ) + < g k , x − x k > + r ( x ) + 1 α k Div ψ ( x , x k ) } = arg min ⁡ x ∈ C { α k f ( x k ) + α k < g k , x − x k > + r ( x ) + Div ψ ( x , x k ) } = arg min ⁡ x ∈ C { < α k g k , x > + r ( x ) + Div ψ ( x , x k ) } . (44) \begin{aligned} x_{k + 1} &= \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + r(x) + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + r(x) + \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \left< \alpha_k g_k, x \right> + r(x) + \text{Div}_{\psi}(x, x_k) \right\}. \end{aligned} \tag{44} xk+1=xCargmin{f(xk)+gk,xxk+r(x)+αk1Divψ(x,xk)}=xCargmin{αkf(xk)+αkgk,xxk+r(x)+Divψ(x,xk)}=xCargmin{αkgk,x+r(x)+Divψ(x,xk)}.(44)
这里我们只采用了 f f f x k x_k xk附近的一阶近似,不考虑 r ( x ) r(x) r(x)。假设这个近似操作可以有效地计算,那么我们就可以证明上述的速率都能延续。

这里考虑更为一般的情况,即 f f f是非光滑或者强凸的,如果 f f f的梯度是Lipschitz连续,依然能够获得 O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21)的收敛速度。

α k f ( x k ) + α k < g k , x − x k > + r ( x ) \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + r(x) αkf(xk)+αkgk,xxk+r(x)视作引理1中的 L L L,那么有:
α k f ( x k ) + α k < g k , x ∗ − x k > + r ( x ∗ ) + Div ψ ( x ∗ , x k ) ≥ α k f ( x k ) + α k < g k , x k + 1 − x k > + r ( x k + 1 ) + Div ψ ( x k + 1 , x k ) + Div ψ ( x ∗ , x k + 1 ) . (45) \begin{aligned} & \alpha_k f(x_k) + \alpha_k \left< g_k, x^{*} - x_k \right> + r(x^{*}) + \text{Div}_{\psi}(x^{*}, x_k) \\ \geq & \alpha_k f(x_k) + \alpha_k \left< g_k, x_{k+1} - x_k \right> + r(x_{k+1}) + \text{Div}_{\psi}(x_{k+1}, x_k) + \text{Div}_{\psi}(x^{*}, x_{k+1}). \end{aligned} \tag{45} αkf(xk)+αkgk,xxk+r(x)+Divψ(x,xk)αkf(xk)+αkgk,xk+1xk+r(xk+1)+Divψ(xk+1,xk)+Divψ(x,xk+1).(45)
那么,同式(33),有:
Div ψ ( x ∗ , x k + 1 ) ≤ Div ψ ( x ∗ , x k ) + α k < g k , x ∗ − x k + 1 > + α k ( r ( x ∗ ) − r ( x k + 1 ) ) − Div ψ ( x k + 1 , x k ) ≤ ⋯ ≤ Div ψ ( x ∗ , x k ) − α k ( f ( x k ) + r ( x k + 1 ) − f ( x ∗ ) − f ( r ( x ∗ ) ) ) + α k 2 2 σ ∥ g k ∥ 2 . (46) \begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> + \alpha_k \left( r(x^{*}) - r(x_{k+1}) \right) - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \leq & \cdots \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) + r(x_{k+1}) - f(x^{*}) - f(r(x^{*}) ) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|^2. \end{aligned} \tag{46} Divψ(x,xk+1)Divψ(x,xk)+αkgk,xxk+1+αk(r(x)r(xk+1))Divψ(xk+1,xk)Divψ(x,xk)αk(f(xk)+r(xk+1)f(x)f(r(x)))+2σαk2gk2.(46)
δ k = Div ψ ( x ∗ , x k ) \delta_k = \text{Div}_{\psi}(x^{*}, x_k) δk=Divψ(x,xk),那么:
f ( x k ) + r ( x k + 1 ) − f ( x ∗ ) − f ( r ( x ∗ ) ≤ 1 α k ( δ k − δ k + 1 ) + α k 2 σ ∥ g k ∥ ∗ 2 . (47) f(x_{k}) + r(x_{k+1}) - f(x^{*}) - f(r(x^{*}) \leq \frac{1}{\alpha_k} ( \delta_k - \delta_{k+1} ) + \frac{ \alpha_k }{2 \sigma} \| g_{k} \|_{*}^2. \tag{47} f(xk)+r(xk+1)f(x)f(r(x)αk1(δkδk+1)+2σαkgk2.(47)
累加求和得到:
r ( x T + 1 ) − r ( x 1 ) + ∑ k = 1 T ( h ( x k ) − h ( x ∗ ) ) ≤ δ 1 α 1 + ∑ k = 2 T δ k ( 1 α k − 1 α k − 1 ) − δ T + 1 α T + G 2 2 σ ∑ k = 1 T α k ≤ R 2 ( 1 α 1 + ∑ k = 2 T ( 1 α k − 1 α k − 1 ) ) + G 2 2 σ ∑ k = 1 T α k = R 2 α T + G 2 2 σ ∑ k = 1 T α k . (48) \begin{aligned} & r(x_{T+1}) - r(x_1) + \sum_{k=1}^{T} \left( h(x_k) - h(x^{*}) \right) \\ \leq & \frac{\delta_1}{\alpha_1} + \sum_{k=2}^{T} \delta_k \left( \frac{1}{\alpha_k} - \frac{1}{\alpha_{k-1}} \right) - \frac{\delta_{T+1}}{\alpha_T} + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k \\ \leq & R^2 \left( \frac{1}{\alpha_1} + \sum_{k=2}^{T} \left( \frac{1}{\alpha_k} - \frac{1}{\alpha_{k-1}} \right) \right) + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k \\ = & \frac{R^2}{\alpha_T} + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k. \end{aligned} \tag{48} =r(xT+1)r(x1)+k=1T(h(xk)h(x))α1δ1+k=2Tδk(αk1αk11)αTδT+1+2σG2k=1TαkR2(α11+k=2T(αk1αk11))+2σG2k=1TαkαTR2+2σG2k=1Tαk.(48)
假设选择 x 1 = arg min ⁡ x r ( X ) x_1 = \argmin_{x} r(X) x1=xargminr(X),那么有 r ( x T + 1 ) − r ( x 1 ) ≥ 0 r(x_{T+1}) - r(x_1) \geq 0 r(xT+1)r(x1)0。令 α k = R G σ k \alpha_k = \frac{R}{G} \sqrt{\frac{\sigma}{k}} αk=GRkσ ,则有:
∑ k = 1 T ( h ( x k ) − h ( x ∗ ) ) ≤ R G σ ( ( T ) + 1 2 ∑ k = 1 T 1 k ) = R G σ O ( T ) . (49) \sum_{k=1}^{T} \left( h(x_k) - h(x^{*}) \right) \leq \frac{RG}{\sqrt{\sigma}} \left( \sqrt(T) + \frac{1}{2} \sum_{k=1}^{T} \frac{1}{\sqrt{k}} \right) = \frac{RG}{\sqrt{\sigma}} O(\sqrt{T}). \tag{49} k=1T(h(xk)h(x))σ RG(( T)+21k=1Tk 1)=σ RGO(T ).(49)
因此 min ⁡ k = 1 , ⋯   , T { h ( x k ) − h ( x ∗ ) } \min_{k=1,\cdots,T} \{ h(x_k) - h(x^{*}) \} mink=1,,T{h(xk)h(x)}的收敛速度为 O ( R G σ T ) O(\frac{RG}{\sqrt{\sigma}T}) O(σ TRG)


2 在线学习

在这里插入图片描述

上图是在线学习算法。玩家在线学习的目的就是尽量减少遗憾,使得可能的损失 ∑ k f k ( x ) \sum_{k}f_k (x) kfk(x)在所有可能的 x x x上最小:
Regret = ∑ k = 1 T f k ( x k ) − min ⁡ x ∑ k = 1 T f k ( x ) . (50) \text{Regret} = \sum_{k=1}^Tf_k (x_k) - \min_{x} \sum_{k=1}^{T} f_k (x). \tag{50} Regret=k=1Tfk(xk)xmink=1Tfk(x).(50)

需要注意的是,没有假设对手如何选 f k f_k fk,可以对立。在第 k k k次迭代获得 f k f_k fk后,使用mirror descent对 f k f_k fk进行更新获得 x k + 1 x_{k+1} xk+1
x k + 1 = arg min ⁡ x ∈ C { f k ( x k ) + < g k , x − x k > + 1 α k Div ( x , x k ) } , g k ∈ ∂ f x ( x k ) . (51) x_{k+1} = \argmin_{x \in C} \left\{ f_k(x_k) + \left< g_k, x - x_k \right> + \frac{1}{\alpha_k} \text{Div}(x, x_k) \right\}, \quad g_k \in \partial f_x (x_k). \tag{51} xk+1=xCargmin{fk(xk)+gk,xxk+αk1Div(x,xk)},gkfx(xk).(51)
可以推得regret有界。由式(33)有:
f k ( x k ) − f k ( x ∗ ) ≤ 1 α k ( Div ( x ∗ , x k ) − Div ( x ∗ , x k + 1 ) ) + α k 2 σ ∥ g k ∥ ∗ 2 . (52) f_k(x_k) - f_k(x^{*}) \leq \frac{1}{\alpha_k} \left( \text{Div}(x^{*}, x_k) - \text{Div}(x^{*}, x_{k+1}) \right) + \frac{\alpha_k}{2 \sigma} \| g_k \|_{*}^{2}. \tag{52} fk(xk)fk(x)αk1(Div(x,xk)Div(x,xk+1))+2σαkgk2.(52)
k = 1 k=1 k=1开始累加,同式(48)和式(49):
∑ k = 1 T ( f k ( x k ) − f k ( x ∗ ) ) ≤ R G σ O ( T ) . (53) \sum_{k=1}^{T} \left( f_k(x_k) - f_k(x^{*}) \right)\leq \frac{RG}{\sqrt{\sigma}} O(\sqrt{T}). \tag{53} k=1T(fk(xk)fk(x))σ RGO(T ).(53)
所以,regret的是以 O ( T ) O(\sqrt{T}) O(T )的速度增长。

f f f是强凸函数 f k f_k fk代替式(35)中的 f f f,得到regret的界为 O ( log ⁡ T ) O(\log T) O(logT)

f f f的梯度是Lipschitz连续 式(43)的结果无法推广到在线学习的情况,如果用 f k f_k fk代替 f f f,那么会得到等式右边有 f k ( x k + 1 ) − f k ( x ∗ ) f_k(x_{k+1}) - f_k (x^{*}) fk(xk+1)fk(x)。递推求和无法得到regret的界。因此,梯度的Lipschitz连续无法保证regret的界为 O ( log ⁡ T ) O(\log T) O(logT)

组合目标函数 在在线设定中,玩家和对手都有 r ( x ) r(x) r(x),对手在每次迭代时改变 f k ( x ) f_k(x) fk(x)。每次迭代的目标函数是 h k ( x k ) = f k ( x k ) + r ( x k ) h_k(x_k) = f_k(x_k) + r(x_k) hk(xk)=fk(xk)+r(xk)。更新规则为:
x k + 1 = arg min ⁡ x ∈ C { f k ( x k ) + < g k , x − x k > + r ( x ) + 1 α k Div ( x , x k ) } , g k ∈ ∂ f x ( x k ) . (54) x_{k+1} = \argmin_{x \in C} \left\{ f_k(x_k) + \left< g_k, x - x_k \right> + r(x) + \frac{1}{\alpha_k} \text{Div}(x, x_k) \right\}, \quad g_k \in \partial f_x (x_k). \tag{54} xk+1=xCargmin{fk(xk)+gk,xxk+r(x)+αk1Div(x,xk)},gkfx(xk).(54)
那么式(47)变为:
f k ( x k ) + r ( x k + 1 ) − f k ( x ∗ ) − f k ( r ( x ∗ ) ≤ 1 α k ( δ k − δ k + 1 ) + α k 2 σ ∥ g k ∥ ∗ 2 . (55) f_k(x_{k}) + r(x_{k+1}) - f_k(x^{*}) - f_k(r(x^{*}) \leq \frac{1}{\alpha_k} ( \delta_k - \delta_{k+1} ) + \frac{ \alpha_k }{2 \sigma} \| g_{k} \|_{*}^2. \tag{55} fk(xk)+r(xk+1)fk(x)fk(r(x)αk1(δkδk+1)+2σαkgk2.(55)
虽然我们这里用的是 r ( x k + 1 ) r(x_{k+1}) r(xk+1),而不是 r ( x k ) r(x_k) r(xk),但这是没有问题的,因为 r r r不会通过迭代而改变。
选择 x 1 = arg min ⁡ x r ( x ) x_1 = \argmin_{x} r(x) x1=xargminr(x),同式(48,49)有:
∑ k = 1 T ( h k ( x k ) − h k ( x ∗ ) ) ≤ G σ O ( T ) . (56) \sum_{k=1}^{T} \left( h_k(x_k) - h_k(x^{*}) \right) \leq \frac{G}{\sqrt{\sigma}} O(\sqrt{T}). \tag{56} k=1T(hk(xk)hk(x))σ GO(T ).(56)
因此,有 O ( T ) O(\sqrt{T}) O(T )

f k f_k fk为强凸时,我们可以得到组合情况下的 O ( log ⁡ T ) O(\log T) O(logT)regret。但是,同上面一样, f f f梯度的Lipschitz连续无法保证regret的界为 O ( log ⁡ T ) O(\log T) O(logT)


3 随机优化

让我们考虑优化一个函数,它的期望值形式为
min ⁡ x F ( x ) : = E w ∼ p [ f ( x ; w ) ] , (57) \min_{x} F(x) := \mathbb{E}_{w \sim p} [ f(x; w) ], \tag{57} xminF(x):=Ewp[f(x;w)],(57)
其中 p p p w w w的分布。这其中包含了很多机器学习模型。比如SVM的目标函数为:
F ( x ) = 1 m ∑ i = 1 m max ⁡ { 0 , 1 − c i < a i , x > + λ 2 ∥ x ∥ 2 } . (58) F(x) = \frac{1}{m} \sum_{i=1}^{m} \max \left\{ 0, 1 - c_i \left< a_i, x \right> + \frac{\lambda}{2} \| x \|^2 \right\}. \tag{58} F(x)=m1i=1mmax{0,1ciai,x+2λx2}.(58)
它可以解释成式(57), w w w是在 { 1 , 2 , ⋯   , m } \{ 1, 2, \cdots, m \} {1,2,,m}上的均匀分布,例如 p ( w = i ) = 1 m p(w=i) = \frac{1}{m} p(w=i)=m1
f ( x ; i ) = max ⁡ { 0 , 1 − c i < a i , x > } + λ 2 ∥ x ∥ 2 . (59) f(x; i) = \max \left\{ 0, 1 - c_i \left< a_i, x \right> \right\} + \frac{\lambda}{2} \| x \|^2. \tag{59} f(x;i)=max{0,1ciai,x}+2λx2.(59)
m m m较大时,计算 F F F及其子梯度的成本会很高。所以,一个简单的想法是基于一个随机选择的数据点进行更新。它可以认为是算法1中在线学习的一种特殊情况,步骤4中的对手现在随机选取 f k f_k fk f ( x ; w k ) f(x;w_k) f(x;wk) w k w_k wk p p p无关。理想情况下,我们希望通过使用mirror descent更新, x k x_k xk将逐渐接近 F ( x ) F(x) F(x)的最小化值。直观上这是非常合理的,通过使用 f k f_k fk,我们可以计算出 F ( x k ) F(x_k) F(xk)的无偏估计值和 F ( x k ) F(x_k) F(xk)的次梯度(因为 w k w_k wk是从 p p p中进行独立同分布采样得到的)。这是随机优化的一种特殊情况,我们在算法2中进行了总结。

在这里插入图片描述

事实上,该方法在更一般的环境下也是有效的。为了简单起见,我们只说对手在迭代 k k k时有 w k w_k wk。那么在线学习算法 A \mathcal{A} A就是简单的从一个有序集 { w 1 , ⋯   , w k } \{ w_1, \cdots, w_k \} {w1,,wk} x k + 1 x_{k+1} xk+1的确定性映射。将初始模型 x 1 x_1 x1表示为 A ( w 0 ) \mathcal{A}(w_0) A(w0)。那么下面的定理就是在线到批量转换的关键。

定理1 假设在线学习算法 A \mathcal{A} A在使用算法1迭代 k k k次后regret有上界 R k R_k Rk。假设 w 1 , ⋯   , w T + 1 w_1, \cdots, w_{T+1} w1,,wT+1独立同分布采样自 p p p。定义 x ^ = A ( w j + 1 , ⋯   , w T ) \hat{x}=\mathcal{A}(w_{j+1}, \cdots, w_{T}) x^=A(wj+1,,wT),其中 j j j均匀随机采样自 { 0 , ⋯   , T } \{ 0, \cdots, T \} {0,,T}。那么:
E [ F ( x ^ ) ] − min ⁡ x F ( x ) ≤ R T + 1 T + 1 . (60) \mathbb{E}[F(\hat{x})] - \min_{x} F(x) \leq \frac{R_{T+1}}{T + 1}. \tag{60} E[F(x^)]xminF(x)T+1RT+1.(60)
其中期望值与 w 1 , ⋯   , w T w_1, \cdots, w_{T} w1,,wT j j j的随机性有关。

相似地,可以用 1 − σ 1 - \sigma 1σ高概率保持边界:
F ( x ^ ) − min ⁡ x F ( x ) ≤ R T + 1 T + 1 log ⁡ 1 σ . (61) F(\hat{x}) - \min_{x} F(x) \leq \frac{R_{T+1}}{T + 1} \log \frac{1}{\sigma}. \tag{61} F(x^)xminF(x)T+1RT+1logσ1.(61)
其中概率与 w 1 , ⋯   , w T w_1, \cdots, w_{T} w1,,wT j j j的随机性有关。

证明
E [ F ( x ^ ) ] = E j , w 1 , ⋯   , w T + 1 [ f ( x ^ ; w T + 1 ) ] = E j , w 1 , ⋯   , w T + 1 [ f ( A w j + 1 , ⋯   , w T ; w T + 1 ) ] = E w 1 , ⋯   , w T + 1 [ 1 T + 1 ∑ j = 0 T f ( A w j + 1 , ⋯   , w T ; w T + 1 ) ] ( j 是均匀随机抽取的 ) = 1 T + 1 E w 1 , ⋯   , w T + 1 [ ∑ j = 0 T f ( A w 1 , ⋯   , w T − j ; w T + 1 − j ) ] ( 平移指标 ) = 1 T + 1 E w 1 , ⋯   , w T + 1 [ ∑ s = 1 T + 1 f ( A w 1 , ⋯   , w s − 1 ; w s ) ] ( 替换变量 s = T + 1 − j ) ≤ 1 T + 1 E w 1 , ⋯   , w T + 1 [ min ⁡ x ∑ s = 1 T + 1 f ( x ; w s ) + R T + 1 ] ( 利用regret的界 ) ≤ min ⁡ x E w [ f ( x ; w ) ] + R T + 1 T + 1 ( 最小值的期望值小于期望值的最小值 ) = min ⁡ x F ( x ) + R T + 1 T + 1 . (62) \begin{aligned} \mathbb{E} [F(\hat{x})] =& \mathbb{E}_{j,w_1, \cdots, w_{T+1}} [f(\hat{x}; w_{T+1})] \\ =& \mathbb{E}_{j,w_1, \cdots, w_{T+1}} [f(\mathcal{A}_{w_{j+1}, \cdots, w_{T}}; w_{T+1})] \\ =& \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \frac{1}{T + 1} \sum_{j=0}^{T} f(\mathcal{A}_{w_{j+1}, \cdots, w_{T}}; w_{T+1}) \right] \quad (j \text{是均匀随机抽取的}) \\ =& \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \sum_{j=0}^{T} f(\mathcal{A}_{w_{1}, \cdots, w_{T-j}}; w_{T+1-j}) \right] \quad (\text{平移指标}) \\ =& \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \sum_{s=1}^{T+1} f(\mathcal{A}_{w_{1}, \cdots, w_{s-1}}; w_{s}) \right] \quad (\text{替换变量} s = T+1-j ) \\ \leq & \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \min_{x} \sum_{s=1}^{T+1} f(x; w_{s}) + R_{T+1} \right] \quad (\text{利用regret的界}) \\ \leq & \min_{x} \mathbb{E}_{w} [f(x;w)] + \frac{R_{T+1}}{T+1} \quad (\text{最小值的期望值小于期望值的最小值}) \\ = & \min_{x} F(x) + \frac{R_{T+1}}{T + 1}. \end{aligned} \tag{62} E[F(x^)]======Ej,w1,,wT+1[f(x^;wT+1)]Ej,w1,,wT+1[f(Awj+1,,wT;wT+1)]Ew1,,wT+1[T+11j=0Tf(Awj+1,,wT;wT+1)](j是均匀随机抽取的)T+11Ew1,,wT+1[j=0Tf(Aw1,,wTj;wT+1j)](平移指标)T+11Ew1,,wT+1[s=1T+1f(Aw1,,ws1;ws)](替换变量s=T+1j)T+11Ew1,,wT+1[xmins=1T+1f(x;ws)+RT+1](利用regret的界)xminEw[f(x;w)]+T+1RT+1(最小值的期望值小于期望值的最小值)xminF(x)+T+1RT+1.(62)

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值