Mirror Descent
次梯度下降的收敛速度通常取决于问题的维数。假设求函数
f
f
f在
C
C
C上的最小值,那么次梯度下降(subgradient descent)
为
x
k
+
1
2
=
x
k
−
α
k
g
k
,
g
k
∈
∂
f
(
x
k
)
x
k
+
1
=
arg min
x
∈
C
1
2
∥
x
−
x
k
+
1
2
∥
2
=
arg min
x
∈
C
1
2
∥
x
−
(
x
k
−
α
k
g
k
)
∥
2
.
(20)
\begin{aligned} x_{k+\frac{1}{2}} &= x_k - \alpha_k g_k, \quad g_k \in \partial f(x_k) \\ x_{k+1} &= \argmin_{x \in C} \frac{1}{2} \| x - x_{k+\frac{1}{2}} \|^2 = \argmin_{x \in C} \frac{1}{2} \| x - \left( x_k - \alpha_k g_k \right) \|^2. \end{aligned} \tag{20}
xk+21xk+1=xk−αkgk,gk∈∂f(xk)=x∈Cargmin21∥x−xk+21∥2=x∈Cargmin21∥x−(xk−αkgk)∥2.(20)
可以解释如下。用
f
f
f在
x
k
x_k
xk附近的一阶Taylor展开式近似
f
f
f:
f
(
x
)
≈
f
(
x
k
)
+
<
g
k
,
x
−
x
k
>
.
(21)
f(x) \approx f(x_k) + \left< g_k, x - x_k \right>. \tag{21}
f(x)≈f(xk)+⟨gk,x−xk⟩.(21)
然后用
1
2
α
k
∥
x
−
x
k
∥
2
\frac{1}{2 \alpha_k} \| x - x_k \|^2
2αk1∥x−xk∥2余项作为惩罚项。因此,更新规则是找到下式的极小值:
x
k
+
1
=
arg min
x
∈
C
{
f
(
x
k
)
+
<
g
k
,
x
−
x
k
>
+
1
2
α
k
∥
x
−
x
k
∥
2
}
.
(22)
x_{k+1} = \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + \frac{1}{2 \alpha_k} \| x - x_k \|^2 \right\}. \tag{22}
xk+1=x∈Cargmin{f(xk)+⟨gk,x−xk⟩+2αk1∥x−xk∥2}.(22)
上式(22)与式(20)等价。为了将方法推广到欧几里得距离以外,可以直接使用Bregman散度作为余项的量度:
x
k
+
1
=
arg min
x
∈
C
{
f
(
x
k
)
+
<
g
k
,
x
−
x
k
>
+
1
α
k
Div
ψ
(
x
,
x
k
)
}
=
arg min
x
∈
C
{
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
−
x
k
>
+
Div
ψ
(
x
,
x
k
)
}
=
arg min
x
∈
C
{
<
α
k
g
k
,
x
>
+
Div
ψ
(
x
,
x
k
)
}
.
(23)
\begin{aligned} x_{k + 1} &= \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \left< \alpha_k g_k, x \right> + \text{Div}_{\psi}(x, x_k) \right\}. \end{aligned} \tag{23}
xk+1=x∈Cargmin{f(xk)+⟨gk,x−xk⟩+αk1Divψ(x,xk)}=x∈Cargmin{αkf(xk)+αk⟨gk,x−xk⟩+Divψ(x,xk)}=x∈Cargmin{⟨αkgk,x⟩+Divψ(x,xk)}.(23)
镜像梯度的解释
假设约束集
C
C
C是整个空间(即无约束)。那么我们可以用关于
x
x
x的梯度,寻找最优条件:
∂
∂
x
(
<
g
k
,
x
>
+
1
α
k
Div
ψ
(
x
,
x
k
)
)
∣
x
=
x
k
+
1
=
g
k
+
1
α
(
∇
ψ
(
x
k
+
1
)
−
∇
ψ
(
x
k
)
)
=
0
⇔
∇
ψ
(
x
k
+
1
)
=
∇
ψ
(
x
k
)
−
α
k
g
k
⇔
x
k
+
1
=
(
∇
ψ
)
−
1
(
∇
ψ
(
x
k
)
−
<
α
k
g
k
,
x
>
)
=
(
∇
ψ
∗
)
(
∇
ψ
(
x
k
)
−
α
k
g
k
)
.
(24)
\begin{aligned} & \frac{\partial}{\partial x} \left( \left< g_k, x \right> + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right) |_{x = x_{k+1}} = g_k + \frac{1}{\alpha} \left( \nabla \psi(x_{k+1}) - \nabla \psi(x_{k}) \right) = 0 \\ \Leftrightarrow & \nabla \psi(x_{k+1}) = \nabla \psi(x_{k}) - \alpha_k g_k \\ \Leftrightarrow & x_{k+1} = \left( \nabla \psi \right)^{-1} \left( \nabla \psi(x_{k}) - \left< \alpha_k g_k, x \right> \right) = \left( \nabla \psi^{*} \right) \left( \nabla \psi(x_{k}) - \alpha_k g_k \right). \end{aligned} \tag{24}
⇔⇔∂x∂(⟨gk,x⟩+αk1Divψ(x,xk))∣x=xk+1=gk+α1(∇ψ(xk+1)−∇ψ(xk))=0∇ψ(xk+1)=∇ψ(xk)−αkgkxk+1=(∇ψ)−1(∇ψ(xk)−⟨αkgk,x⟩)=(∇ψ∗)(∇ψ(xk)−αkgk).(24)
如果是KL散度,那么
∇
x
k
(
i
)
ψ
(
x
k
)
=
log
x
k
(
i
)
+
1
\nabla_{x_k(i)} \psi(x_{k}) = \log x_k(i) + 1
∇xk(i)ψ(xk)=logxk(i)+1,更新规则为:
x
k
+
1
(
i
)
=
x
k
(
i
)
exp
(
−
α
k
g
k
(
i
)
)
.
(25)
x_{k+1}(i) = x_{k}(i) \exp \left( - \alpha_k g_k(i) \right). \tag{25}
xk+1(i)=xk(i)exp(−αkgk(i)).(25)
收敛速度
回顾在无约束的子梯度下降中,4个步骤:
1.
受单次更新的约束:
∥
x
k
+
1
−
x
∗
∥
2
2
=
∥
x
k
−
α
k
g
k
−
x
∗
∥
2
2
=
∥
x
k
−
x
∗
∥
2
2
−
2
α
k
<
g
k
,
x
k
−
x
∗
>
+
α
k
2
∥
g
k
∥
2
2
≤
∥
x
k
−
x
∗
∥
2
2
−
2
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
α
k
2
∥
g
k
∥
2
2
.
(26)
\begin{aligned} \| x_{k+1} - x^{*} \|_2^2 =& \| x_{k} - \alpha_k g_k - x^{*} \|_2^2 \\ =& \| x_{k} - x^{*} \|_2^2 - 2 \alpha_k \left< g_k, x_k - x^{*} \right> + \alpha_k^2 \| g_k \|_2^2 \\ \leq & \| x_{k} - x^{*} \|_2^2 - 2 \alpha_k \left( f(x_k) - f(x^{*}) \right) + \alpha_k^2 \| g_k \|_2^2. \end{aligned} \tag{26}
∥xk+1−x∗∥22==≤∥xk−αkgk−x∗∥22∥xk−x∗∥22−2αk⟨gk,xk−x∗⟩+αk2∥gk∥22∥xk−x∗∥22−2αk(f(xk)−f(x∗))+αk2∥gk∥22.(26)
上式用到了
f
(
x
∗
)
≥
f
(
x
k
)
+
<
g
k
,
x
∗
−
x
k
>
f(x^{*}) \geq f(x_k) + \left< g_k, x^{*} - x_k \right>
f(x∗)≥f(xk)+⟨gk,x∗−xk⟩。
2.
递推求和:
∥
x
T
+
1
−
x
∗
∥
2
2
≤
∥
x
1
−
x
∗
∥
2
2
−
2
∑
k
=
1
T
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
∑
k
=
1
T
α
k
2
∥
g
k
∥
2
2
.
(27)
\| x_{T+1} - x^{*} \|_2^2 \leq \| x_{1} - x^{*} \|_2^2 - 2 \sum_{k=1}^{T} \alpha_k \left( f(x_k) - f(x^{*}) \right) + \sum_{k=1}^{T} \alpha_k^2 \| g_k \|_2^2. \tag{27}
∥xT+1−x∗∥22≤∥x1−x∗∥22−2k=1∑Tαk(f(xk)−f(x∗))+k=1∑Tαk2∥gk∥22.(27)
3.
根据
∥
x
1
−
x
∗
∥
2
2
≤
R
2
\| x_{1} - x^{*} \|_2^2 \leq R^2
∥x1−x∗∥22≤R2和
∥
g
k
∥
2
2
≤
G
2
\| g_k \|_2^2 \leq G^2
∥gk∥22≤G2:
2
∑
k
=
1
T
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
≤
R
2
+
G
2
∑
k
=
1
T
α
k
2
.
(28)
2 \sum_{k=1}^{T} \alpha_k \left( f(x_k) - f(x^{*}) \right) \leq R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2. \tag{28}
2k=1∑Tαk(f(xk)−f(x∗))≤R2+G2k=1∑Tαk2.(28)
4.
记
ϵ
k
=
f
(
x
k
)
−
f
(
x
∗
)
\epsilon_k = f(x_k) - f(x^{*})
ϵk=f(xk)−f(x∗),那么:
min
k
∈
{
1
,
⋯
,
T
}
ϵ
k
≤
R
2
+
G
2
∑
k
=
1
T
α
k
2
2
∑
k
=
1
T
α
k
.
(29)
\min_{k \in \{ 1, \cdots, T \}} \epsilon_k \leq \frac{R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2}{2 \sum_{k=1}^{T} \alpha_k}. \tag{29}
k∈{1,⋯,T}minϵk≤2∑k=1TαkR2+G2∑k=1Tαk2.(29)
通过选择合适的步长
α
k
=
R
G
T
\alpha_k = \frac{R}{G\sqrt{T}}
αk=GTR,上式右边:
R
2
+
G
2
∑
k
=
1
T
α
k
2
2
∑
k
=
1
T
α
k
=
R
2
+
G
2
∑
k
=
1
T
R
2
G
2
T
2
∑
k
=
1
T
R
G
T
=
R
G
2
T
(30)
\frac{R^2 + G^2 \sum_{k=1}^{T} \alpha_k^2}{2 \sum_{k=1}^{T} \alpha_k} = \frac{R^2 + G^2 \sum_{k=1}^{T} \frac{R^2}{G^2T}}{2 \sum_{k=1}^{T} \frac{R}{G\sqrt{T}}} = \frac{RG}{2\sqrt{T}} \tag{30}
2∑k=1TαkR2+G2∑k=1Tαk2=2∑k=1TGTRR2+G2∑k=1TG2TR2=2TRG(30)
即:
min
k
∈
{
1
,
⋯
,
T
}
ϵ
k
≤
R
G
T
.
(31)
\min_{k \in \{ 1, \cdots, T \}} \epsilon_k \leq \frac{RG}{\sqrt{T}}. \tag{31}
k∈{1,⋯,T}minϵk≤TRG.(31)
假设
C
C
C是simplex,那么
R
≤
2
R \leq \sqrt{2}
R≤2。如果每个梯度
g
i
g_i
gi的每个坐标都是以
M
M
M为上界,那么
G
G
G最多可以是
M
n
M\sqrt{n}
Mn,即取决于维度。
从步骤2到4,可以用 Div ψ ( x ∗ , x k + 1 ) \text{Div}_{\psi}(x^{*},x_{k+1}) Divψ(x∗,xk+1)代替 ∥ x k + 1 − x ∗ ∥ 2 2 \| x_{k+1} - x^{*} \|_2^2 ∥xk+1−x∗∥22。而步骤1,则需要用到引理1。
假设
ψ
\psi
ψ是
σ
\sigma
σ-严格凸函数,将式(23)中
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
−
x
k
>
\alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right>
αkf(xk)+αk⟨gk,x−xk⟩的视为引理1中的
L
L
L,那么:
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
>
+
Div
ψ
(
x
∗
,
x
k
)
≥
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
k
+
1
−
x
k
>
+
Div
ψ
(
x
k
+
1
,
x
k
)
+
Div
ψ
(
x
∗
,
x
k
+
1
)
.
(32)
\begin{aligned} & \alpha_k f(x_k) + \alpha_k \left< g_k, x^{*} - x_k \right> + \text{Div}_{\psi}(x^{*}, x_k) \\ \geq & \alpha_k f(x_k) + \alpha_k \left< g_k, x_{k+1} - x_k \right> + \text{Div}_{\psi}(x_{k+1}, x_{k}) + \text{Div}_{\psi}(x^{*}, x_{k+1}). \end{aligned} \tag{32}
≥αkf(xk)+αk⟨gk,x∗−xk⟩+Divψ(x∗,xk)αkf(xk)+αk⟨gk,xk+1−xk⟩+Divψ(xk+1,xk)+Divψ(x∗,xk+1).(32)
移项后得:
Div
ψ
(
x
∗
,
x
k
+
1
)
≤
Div
ψ
(
x
∗
,
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
+
1
>
−
Div
ψ
(
x
k
+
1
,
x
k
)
=
Div
ψ
(
x
∗
,
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
>
+
α
k
<
g
k
,
x
k
−
x
k
+
1
>
−
Div
ψ
(
x
k
+
1
,
x
k
)
≤
式
(
4
)
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
α
k
<
g
k
,
x
k
−
x
k
+
1
>
−
σ
2
∥
x
k
−
x
k
+
1
∥
2
≤
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
α
k
∥
g
k
∥
∗
∥
x
k
−
x
k
+
1
∥
−
σ
2
∥
x
k
−
x
k
+
1
∥
2
≤
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
α
k
2
2
σ
∥
g
k
∥
2
.
(33)
\begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ = & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k} \right> + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{式(4)}{\leq} & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*})\right) + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*})\right) + \alpha_k \| g_k \|_{*} \| x_{k} - x_{k+1} \| - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|^2. \end{aligned} \tag{33}
Divψ(x∗,xk+1)≤=≤式(4)≤≤Divψ(x∗,xk)+αk⟨gk,x∗−xk+1⟩−Divψ(xk+1,xk)Divψ(x∗,xk)+αk⟨gk,x∗−xk⟩+αk⟨gk,xk−xk+1⟩−Divψ(xk+1,xk)Divψ(x∗,xk)−αk(f(xk)−f(x∗))+αk⟨gk,xk−xk+1⟩−2σ∥xk−xk+1∥2Divψ(x∗,xk)−αk(f(xk)−f(x∗))+αk∥gk∥∗∥xk−xk+1∥−2σ∥xk−xk+1∥2Divψ(x∗,xk)−αk(f(xk)−f(x∗))+2σαk2∥gk∥2.(33)
与式(26)对比,可以将
∥
x
k
−
x
∗
∥
2
2
\| x_{k} - x^{*} \|_2^2
∥xk−x∗∥22替换为
Div
ψ
(
x
∗
,
x
k
)
\text{Div}_{\psi}(x^{*}, x_{k})
Divψ(x∗,xk)。同样,假设
Div
ψ
(
x
∗
,
x
1
)
\text{Div}_{\psi}(x^{*}, x_{1})
Divψ(x∗,x1)界为
R
2
R^2
R2,且
∥
g
k
∥
∗
\| g_k \|_{*}
∥gk∥∗的界为
G
G
G,其中
∥
⋅
∥
∗
\| \cdot \|_{*}
∥⋅∥∗是对偶范数。
为了显示mirror descent的优势,假设 C C C是 n n n维的simplex,使用kL散度,其中 ψ \psi ψ是关于 l 1 l_1 l1范数的1-严格凸函数。那么 l 1 l_1 l1范数的对偶范数就是 l ∞ l_{\infty} l∞范数。因此,可以用kL散度考虑 Div ψ ( x ∗ , x 1 ) \text{Div}_{\psi}(x^{*}, x_{1}) Divψ(x∗,x1)的界为 log n \log n logn,而 G G G则上界为 M M M。所以,对于 R G RG RG的值,mirror descent比子梯度下降小一个数量级 O ( n log n ) O(\sqrt{\frac{n}{\log n}}) O(lognn)。
加速1:
f
f
f是强凸函数。 我们说关于另一个函数
ψ
\psi
ψ和模数
λ
\lambda
λ,
f
f
f是强凸函数
,那么满足:
f
(
x
)
≥
f
(
y
)
+
<
g
,
x
−
y
>
+
λ
Div
ψ
(
x
,
y
)
g
∈
∂
f
(
y
)
.
(34)
f(x) \geq f(y) + \left< g, x - y \right> + \lambda \text{Div}_{\psi}(x, y) \quad g \in \partial f(y). \tag{34}
f(x)≥f(y)+⟨g,x−y⟩+λDivψ(x,y)g∈∂f(y).(34)
注意,并不要求
f
f
f是可微的。那么式(33)可以增加强凸的条件:
Div
ψ
(
x
∗
,
x
k
+
1
)
≤
Div
ψ
(
x
∗
,
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
+
1
>
−
Div
ψ
(
x
k
+
1
,
x
k
)
=
Div
ψ
(
x
∗
,
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
>
+
α
k
<
g
k
,
x
k
−
x
k
+
1
>
−
Div
ψ
(
x
k
+
1
,
x
k
)
≤
ψ
强
凸
,
式
(
4
)
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
+
λ
Div
ψ
(
x
∗
,
x
k
)
)
+
α
k
<
g
k
,
x
k
−
x
k
+
1
>
−
σ
2
∥
x
k
−
x
k
+
1
∥
2
≤
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
+
λ
Div
ψ
(
x
∗
,
x
k
)
)
+
α
k
∥
g
k
∥
∗
∥
x
k
−
x
k
+
1
∥
−
σ
2
∥
x
k
−
x
k
+
1
∥
2
≤
(
1
−
λ
α
k
)
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
−
f
(
x
∗
)
)
+
α
k
2
2
σ
∥
g
k
∥
∗
2
.
(35)
\begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ = & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k} \right> + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{\psi强凸,式(4)}{\leq} & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) + \lambda \text{Div}_{\psi}(x^{*}, x_{k}) \right) + \alpha_k \left< g_k, x_{k} - x_{k+1} \right> - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) + \lambda \text{Div}_{\psi}(x^{*}, x_{k}) \right) + \alpha_k \| g_k \|_{*} \| x_{k} - x_{k+1} \| - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2 \\ \leq & (1 - \lambda \alpha_k ) \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) - f(x^{*}) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|_{*}^2. \end{aligned} \tag{35}
Divψ(x∗,xk+1)≤=≤ψ强凸,式(4)≤≤Divψ(x∗,xk)+αk⟨gk,x∗−xk+1⟩−Divψ(xk+1,xk)Divψ(x∗,xk)+αk⟨gk,x∗−xk⟩+αk⟨gk,xk−xk+1⟩−Divψ(xk+1,xk)Divψ(x∗,xk)−αk(f(xk)−f(x∗)+λDivψ(x∗,xk))+αk⟨gk,xk−xk+1⟩−2σ∥xk−xk+1∥2Divψ(x∗,xk)−αk(f(xk)−f(x∗)+λDivψ(x∗,xk))+αk∥gk∥∗∥xk−xk+1∥−2σ∥xk−xk+1∥2(1−λαk)Divψ(x∗,xk)−αk(f(xk)−f(x∗))+2σαk2∥gk∥∗2.(35)
记
δ
k
=
Div
ψ
(
x
∗
,
x
k
)
\delta_{k} = \text{Div}_{\psi}(x^{*}, x_k)
δk=Divψ(x∗,xk),令
α
k
=
1
λ
k
\alpha_k = \frac{1}{\lambda k}
αk=λk1,那么上式有:
δ
k
+
1
≤
k
−
1
k
δ
k
−
1
λ
k
ϵ
k
+
G
2
2
σ
λ
2
k
2
⇒
k
δ
k
+
1
≤
(
k
−
1
)
δ
k
−
1
λ
ϵ
k
+
G
2
2
σ
λ
2
k
.
(36)
\begin{aligned} & \delta_{k+1} \leq \frac{k-1}{k} \delta_{k} - \frac{1}{\lambda k} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2 k^2} \\ \Rightarrow & k \delta_{k+1} \leq (k-1) \delta_{k} - \frac{1}{\lambda} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2 k}. \end{aligned} \tag{36}
⇒δk+1≤kk−1δk−λk1ϵk+2σλ2k2G2kδk+1≤(k−1)δk−λ1ϵk+2σλ2kG2.(36)
递推求和有:
T
δ
T
+
1
≤
−
1
λ
∑
k
=
1
T
ϵ
k
+
G
2
2
σ
λ
2
∑
k
=
1
T
1
k
.
⇒
min
i
∈
{
1
,
⋅
,
T
}
ϵ
k
≤
G
2
2
σ
λ
1
T
∑
k
=
1
T
1
k
≤
G
2
2
σ
λ
O
(
log
T
)
T
.
(37)
\begin{aligned} & T \delta_{T+1} \leq - \frac{1}{\lambda} \sum_{k=1}^{T} \epsilon_{k} + \frac{G^2}{2 \sigma \lambda^2} \sum_{k=1}^{T} \frac{1}{k}. \\ \Rightarrow & \min_{i \in \{ 1,\cdot, T \}} \epsilon_{k} \leq \frac{G^2}{2 \sigma \lambda } \frac{1}{T} \sum_{k=1}^{T} \frac{1}{k} \leq \frac{G^2}{2 \sigma \lambda } \frac{O(\log T)}{T}. \end{aligned} \tag{37}
⇒TδT+1≤−λ1k=1∑Tϵk+2σλ2G2k=1∑Tk1.i∈{1,⋅,T}minϵk≤2σλG2T1k=1∑Tk1≤2σλG2TO(logT).(37)
加速2:
f
f
f的梯度Lipschitz连续。 如果函数
f
f
f的梯度是Lipschitz连续
,那么存在
L
>
0
L > 0
L>0使得:
∥
∇
f
(
x
)
−
∇
f
(
y
)
∥
∗
≤
L
∥
x
−
y
∥
,
∀
x
,
y
.
(38)
\| \nabla f(x) - \nabla f(y) \|_{*} \leq L \| x - y \|, \quad \forall x, y. \tag{38}
∥∇f(x)−∇f(y)∥∗≤L∥x−y∥,∀x,y.(38)
有时候我们直接说
f
f
f是光滑的。上式等价于:
f
(
x
)
≤
f
(
y
)
+
<
∇
f
(
y
)
,
x
−
y
>
+
L
2
∥
x
−
y
∥
2
.
(39)
f(x) \leq f(y) + \left< \nabla f(y), x - y \right> + \frac{L}{2} \| x - y \|^2. \tag{39}
f(x)≤f(y)+⟨∇f(y),x−y⟩+2L∥x−y∥2.(39)
现在考虑式(33)中的
<
g
k
,
x
∗
−
x
k
+
1
>
\left< g_k, x^{*} - x_{k+1} \right>
⟨gk,x∗−xk+1⟩的界:
<
g
k
,
x
∗
−
x
k
+
1
>
=
<
g
k
,
x
∗
−
x
k
>
+
<
g
k
,
x
k
−
x
k
+
1
>
≤
f
(
x
∗
)
−
f
(
x
k
)
+
L
2
∥
x
∗
−
x
k
∥
2
+
f
(
x
k
)
−
f
(
x
k
+
1
)
+
L
2
∥
x
k
−
x
k
+
1
∥
2
=
f
(
x
∗
)
−
f
(
x
k
+
1
)
+
L
2
∥
x
k
−
x
k
+
1
∥
2
.
(40)
\begin{aligned} \left< g_k, x^{*} - x_{k+1} \right> =& \left< g_k, x^{*} - x_{k} \right> + \left< g_k, x_{k} - x_{k+1} \right> \\ \leq & f(x^{*}) - f(x_k) + \frac{L}{2} \| x^{*} - x_k \|^2 + f(x_{k}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \\ = & f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2. \end{aligned} \tag{40}
⟨gk,x∗−xk+1⟩=≤=⟨gk,x∗−xk⟩+⟨gk,xk−xk+1⟩f(x∗)−f(xk)+2L∥x∗−xk∥2+f(xk)−f(xk+1)+2L∥xk−xk+1∥2f(x∗)−f(xk+1)+2L∥xk−xk+1∥2.(40)
将其代入式(33)有:
Div
ψ
(
x
∗
,
x
k
+
1
)
≤
Div
ψ
(
x
∗
,
x
k
)
+
α
k
(
f
(
x
∗
)
−
f
(
x
k
+
1
)
+
L
2
∥
x
k
−
x
k
+
1
∥
2
)
−
Div
ψ
(
x
k
+
1
,
x
k
)
≤
ψ
强
凸
,
式
(
4
)
Div
ψ
(
x
∗
,
x
k
)
+
α
k
(
f
(
x
∗
)
−
f
(
x
k
+
1
)
+
L
2
∥
x
k
−
x
k
+
1
∥
2
)
−
σ
2
∥
x
k
−
x
k
+
1
∥
2
.
(41)
\begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left( f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \right) - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \overset{\psi强凸,式(4)}{\leq}& \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left( f(x^{*}) - f(x_{k+1}) + \frac{L}{2} \| x_{k} - x_{k+1} \|^2 \right) - \frac{\sigma}{2} \| x_{k} - x_{k+1} \|^2. \end{aligned} \tag{41}
Divψ(x∗,xk+1)≤≤ψ强凸,式(4)Divψ(x∗,xk)+αk(f(x∗)−f(xk+1)+2L∥xk−xk+1∥2)−Divψ(xk+1,xk)Divψ(x∗,xk)+αk(f(x∗)−f(xk+1)+2L∥xk−xk+1∥2)−2σ∥xk−xk+1∥2.(41)
令
α
k
=
σ
L
\alpha_k = \frac{\sigma}{L}
αk=Lσ,有:
Div
ψ
(
x
∗
,
x
k
+
1
)
≤
Div
ψ
(
x
∗
,
x
k
)
−
σ
L
(
f
(
x
k
+
1
)
−
f
(
x
∗
)
)
.
(42)
\text{Div}_{\psi}(x^{*}, x_{k+1}) \leq \text{Div}_{\psi}(x^{*}, x_k) - \frac{\sigma}{L} \left( f(x_{k+1}) - f(x^{*}) \right). \tag{42}
Divψ(x∗,xk+1)≤Divψ(x∗,xk)−Lσ(f(xk+1)−f(x∗)).(42)
递推之:
min
k
∈
{
2
,
⋯
,
T
+
1
}
f
(
x
k
)
−
f
(
x
∗
)
≤
L
Div
ψ
(
x
∗
,
x
1
)
σ
T
≤
L
R
2
σ
T
.
(43)
\min_{k \in \{2,\cdots, T+1\}} f(x_k) - f(x^{*}) \leq \frac{L \text{Div}_{\psi}(x^{*}, x_1)}{\sigma T} \leq \frac{L R^2}{\sigma T}. \tag{43}
k∈{2,⋯,T+1}minf(xk)−f(x∗)≤σTLDivψ(x∗,x1)≤σTLR2.(43)
这时候的收敛速度为
O
(
1
T
)
O(\frac{1}{T})
O(T1),如果使用像Nesterov的技术,可以达到
O
(
1
T
2
)
O(\frac{1}{T^2})
O(T21)。可以使用引理1证明,我们称之为加速近似梯度法(accelerated proximal gradient method, APGM)
。
1 组合目标函数
假设目标函数为
h
(
x
)
=
f
(
x
)
+
r
(
x
)
h(x) = f(x) + r(x)
h(x)=f(x)+r(x),其中
f
f
f是光滑的,而
r
(
x
)
r(x)
r(x)是simple的,比如
∥
x
∥
1
\| x \|_1
∥x∥1。如果直接使用上面的优化方式,可以得到收敛速度为
O
(
1
T
)
O(\frac{1}{\sqrt{T}})
O(T1),因为
h
h
h不是光滑的。我们希望能够获得光滑时的收敛速度
O
(
1
T
)
O(\frac{1}{T})
O(T1)是可以办到的,因为
r
(
x
)
r(x)
r(x)是简单的函数,只需要扩展式(23)如下:
x
k
+
1
=
arg min
x
∈
C
{
f
(
x
k
)
+
<
g
k
,
x
−
x
k
>
+
r
(
x
)
+
1
α
k
Div
ψ
(
x
,
x
k
)
}
=
arg min
x
∈
C
{
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
−
x
k
>
+
r
(
x
)
+
Div
ψ
(
x
,
x
k
)
}
=
arg min
x
∈
C
{
<
α
k
g
k
,
x
>
+
r
(
x
)
+
Div
ψ
(
x
,
x
k
)
}
.
(44)
\begin{aligned} x_{k + 1} &= \argmin_{x \in C} \left\{ f(x_k) + \left< g_k, x - x_k \right> + r(x) + \frac{1}{\alpha_k} \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + r(x) + \text{Div}_{\psi}(x, x_k) \right\} \\ &= \argmin_{x \in C} \left\{ \left< \alpha_k g_k, x \right> + r(x) + \text{Div}_{\psi}(x, x_k) \right\}. \end{aligned} \tag{44}
xk+1=x∈Cargmin{f(xk)+⟨gk,x−xk⟩+r(x)+αk1Divψ(x,xk)}=x∈Cargmin{αkf(xk)+αk⟨gk,x−xk⟩+r(x)+Divψ(x,xk)}=x∈Cargmin{⟨αkgk,x⟩+r(x)+Divψ(x,xk)}.(44)
这里我们只采用了
f
f
f在
x
k
x_k
xk附近的一阶近似,不考虑
r
(
x
)
r(x)
r(x)。假设这个近似操作可以有效地计算,那么我们就可以证明上述的速率都能延续。
这里考虑更为一般的情况,即 f f f是非光滑或者强凸的,如果 f f f的梯度是Lipschitz连续,依然能够获得 O ( 1 T 2 ) O(\frac{1}{T^2}) O(T21)的收敛速度。
将
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
−
x
k
>
+
r
(
x
)
\alpha_k f(x_k) + \alpha_k \left< g_k, x - x_k \right> + r(x)
αkf(xk)+αk⟨gk,x−xk⟩+r(x)视作引理1中的
L
L
L,那么有:
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
>
+
r
(
x
∗
)
+
Div
ψ
(
x
∗
,
x
k
)
≥
α
k
f
(
x
k
)
+
α
k
<
g
k
,
x
k
+
1
−
x
k
>
+
r
(
x
k
+
1
)
+
Div
ψ
(
x
k
+
1
,
x
k
)
+
Div
ψ
(
x
∗
,
x
k
+
1
)
.
(45)
\begin{aligned} & \alpha_k f(x_k) + \alpha_k \left< g_k, x^{*} - x_k \right> + r(x^{*}) + \text{Div}_{\psi}(x^{*}, x_k) \\ \geq & \alpha_k f(x_k) + \alpha_k \left< g_k, x_{k+1} - x_k \right> + r(x_{k+1}) + \text{Div}_{\psi}(x_{k+1}, x_k) + \text{Div}_{\psi}(x^{*}, x_{k+1}). \end{aligned} \tag{45}
≥αkf(xk)+αk⟨gk,x∗−xk⟩+r(x∗)+Divψ(x∗,xk)αkf(xk)+αk⟨gk,xk+1−xk⟩+r(xk+1)+Divψ(xk+1,xk)+Divψ(x∗,xk+1).(45)
那么,同式(33),有:
Div
ψ
(
x
∗
,
x
k
+
1
)
≤
Div
ψ
(
x
∗
,
x
k
)
+
α
k
<
g
k
,
x
∗
−
x
k
+
1
>
+
α
k
(
r
(
x
∗
)
−
r
(
x
k
+
1
)
)
−
Div
ψ
(
x
k
+
1
,
x
k
)
≤
⋯
≤
Div
ψ
(
x
∗
,
x
k
)
−
α
k
(
f
(
x
k
)
+
r
(
x
k
+
1
)
−
f
(
x
∗
)
−
f
(
r
(
x
∗
)
)
)
+
α
k
2
2
σ
∥
g
k
∥
2
.
(46)
\begin{aligned} \text{Div}_{\psi}(x^{*}, x_{k+1}) \leq & \text{Div}_{\psi}(x^{*}, x_k) + \alpha_k \left< g_k, x^{*} - x_{k+1} \right> + \alpha_k \left( r(x^{*}) - r(x_{k+1}) \right) - \text{Div}_{\psi}(x_{k+1}, x_{k}) \\ \leq & \cdots \\ \leq & \text{Div}_{\psi}(x^{*}, x_k) - \alpha_k \left( f(x_{k}) + r(x_{k+1}) - f(x^{*}) - f(r(x^{*}) ) \right) + \frac{ \alpha_k^2 }{2 \sigma} \| g_{k} \|^2. \end{aligned} \tag{46}
Divψ(x∗,xk+1)≤≤≤Divψ(x∗,xk)+αk⟨gk,x∗−xk+1⟩+αk(r(x∗)−r(xk+1))−Divψ(xk+1,xk)⋯Divψ(x∗,xk)−αk(f(xk)+r(xk+1)−f(x∗)−f(r(x∗)))+2σαk2∥gk∥2.(46)
记
δ
k
=
Div
ψ
(
x
∗
,
x
k
)
\delta_k = \text{Div}_{\psi}(x^{*}, x_k)
δk=Divψ(x∗,xk),那么:
f
(
x
k
)
+
r
(
x
k
+
1
)
−
f
(
x
∗
)
−
f
(
r
(
x
∗
)
≤
1
α
k
(
δ
k
−
δ
k
+
1
)
+
α
k
2
σ
∥
g
k
∥
∗
2
.
(47)
f(x_{k}) + r(x_{k+1}) - f(x^{*}) - f(r(x^{*}) \leq \frac{1}{\alpha_k} ( \delta_k - \delta_{k+1} ) + \frac{ \alpha_k }{2 \sigma} \| g_{k} \|_{*}^2. \tag{47}
f(xk)+r(xk+1)−f(x∗)−f(r(x∗)≤αk1(δk−δk+1)+2σαk∥gk∥∗2.(47)
累加求和得到:
r
(
x
T
+
1
)
−
r
(
x
1
)
+
∑
k
=
1
T
(
h
(
x
k
)
−
h
(
x
∗
)
)
≤
δ
1
α
1
+
∑
k
=
2
T
δ
k
(
1
α
k
−
1
α
k
−
1
)
−
δ
T
+
1
α
T
+
G
2
2
σ
∑
k
=
1
T
α
k
≤
R
2
(
1
α
1
+
∑
k
=
2
T
(
1
α
k
−
1
α
k
−
1
)
)
+
G
2
2
σ
∑
k
=
1
T
α
k
=
R
2
α
T
+
G
2
2
σ
∑
k
=
1
T
α
k
.
(48)
\begin{aligned} & r(x_{T+1}) - r(x_1) + \sum_{k=1}^{T} \left( h(x_k) - h(x^{*}) \right) \\ \leq & \frac{\delta_1}{\alpha_1} + \sum_{k=2}^{T} \delta_k \left( \frac{1}{\alpha_k} - \frac{1}{\alpha_{k-1}} \right) - \frac{\delta_{T+1}}{\alpha_T} + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k \\ \leq & R^2 \left( \frac{1}{\alpha_1} + \sum_{k=2}^{T} \left( \frac{1}{\alpha_k} - \frac{1}{\alpha_{k-1}} \right) \right) + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k \\ = & \frac{R^2}{\alpha_T} + \frac{ G^2 }{2 \sigma} \sum_{k=1}^{T} \alpha_k. \end{aligned} \tag{48}
≤≤=r(xT+1)−r(x1)+k=1∑T(h(xk)−h(x∗))α1δ1+k=2∑Tδk(αk1−αk−11)−αTδT+1+2σG2k=1∑TαkR2(α11+k=2∑T(αk1−αk−11))+2σG2k=1∑TαkαTR2+2σG2k=1∑Tαk.(48)
假设选择
x
1
=
arg min
x
r
(
X
)
x_1 = \argmin_{x} r(X)
x1=xargminr(X),那么有
r
(
x
T
+
1
)
−
r
(
x
1
)
≥
0
r(x_{T+1}) - r(x_1) \geq 0
r(xT+1)−r(x1)≥0。令
α
k
=
R
G
σ
k
\alpha_k = \frac{R}{G} \sqrt{\frac{\sigma}{k}}
αk=GRkσ,则有:
∑
k
=
1
T
(
h
(
x
k
)
−
h
(
x
∗
)
)
≤
R
G
σ
(
(
T
)
+
1
2
∑
k
=
1
T
1
k
)
=
R
G
σ
O
(
T
)
.
(49)
\sum_{k=1}^{T} \left( h(x_k) - h(x^{*}) \right) \leq \frac{RG}{\sqrt{\sigma}} \left( \sqrt(T) + \frac{1}{2} \sum_{k=1}^{T} \frac{1}{\sqrt{k}} \right) = \frac{RG}{\sqrt{\sigma}} O(\sqrt{T}). \tag{49}
k=1∑T(h(xk)−h(x∗))≤σRG((T)+21k=1∑Tk1)=σRGO(T).(49)
因此
min
k
=
1
,
⋯
,
T
{
h
(
x
k
)
−
h
(
x
∗
)
}
\min_{k=1,\cdots,T} \{ h(x_k) - h(x^{*}) \}
mink=1,⋯,T{h(xk)−h(x∗)}的收敛速度为
O
(
R
G
σ
T
)
O(\frac{RG}{\sqrt{\sigma}T})
O(σTRG)。
2 在线学习
上图是在线学习算法。玩家在线学习的目的就是尽量减少遗憾,使得可能的损失
∑
k
f
k
(
x
)
\sum_{k}f_k (x)
∑kfk(x)在所有可能的
x
x
x上最小:
Regret
=
∑
k
=
1
T
f
k
(
x
k
)
−
min
x
∑
k
=
1
T
f
k
(
x
)
.
(50)
\text{Regret} = \sum_{k=1}^Tf_k (x_k) - \min_{x} \sum_{k=1}^{T} f_k (x). \tag{50}
Regret=k=1∑Tfk(xk)−xmink=1∑Tfk(x).(50)
需要注意的是,没有假设对手如何选
f
k
f_k
fk,可以对立。在第
k
k
k次迭代获得
f
k
f_k
fk后,使用mirror descent对
f
k
f_k
fk进行更新获得
x
k
+
1
x_{k+1}
xk+1:
x
k
+
1
=
arg min
x
∈
C
{
f
k
(
x
k
)
+
<
g
k
,
x
−
x
k
>
+
1
α
k
Div
(
x
,
x
k
)
}
,
g
k
∈
∂
f
x
(
x
k
)
.
(51)
x_{k+1} = \argmin_{x \in C} \left\{ f_k(x_k) + \left< g_k, x - x_k \right> + \frac{1}{\alpha_k} \text{Div}(x, x_k) \right\}, \quad g_k \in \partial f_x (x_k). \tag{51}
xk+1=x∈Cargmin{fk(xk)+⟨gk,x−xk⟩+αk1Div(x,xk)},gk∈∂fx(xk).(51)
可以推得regret有界。由式(33)有:
f
k
(
x
k
)
−
f
k
(
x
∗
)
≤
1
α
k
(
Div
(
x
∗
,
x
k
)
−
Div
(
x
∗
,
x
k
+
1
)
)
+
α
k
2
σ
∥
g
k
∥
∗
2
.
(52)
f_k(x_k) - f_k(x^{*}) \leq \frac{1}{\alpha_k} \left( \text{Div}(x^{*}, x_k) - \text{Div}(x^{*}, x_{k+1}) \right) + \frac{\alpha_k}{2 \sigma} \| g_k \|_{*}^{2}. \tag{52}
fk(xk)−fk(x∗)≤αk1(Div(x∗,xk)−Div(x∗,xk+1))+2σαk∥gk∥∗2.(52)
从
k
=
1
k=1
k=1开始累加,同式(48)和式(49):
∑
k
=
1
T
(
f
k
(
x
k
)
−
f
k
(
x
∗
)
)
≤
R
G
σ
O
(
T
)
.
(53)
\sum_{k=1}^{T} \left( f_k(x_k) - f_k(x^{*}) \right)\leq \frac{RG}{\sqrt{\sigma}} O(\sqrt{T}). \tag{53}
k=1∑T(fk(xk)−fk(x∗))≤σRGO(T).(53)
所以,regret的是以
O
(
T
)
O(\sqrt{T})
O(T)的速度增长。
f f f是强凸函数 将 f k f_k fk代替式(35)中的 f f f,得到regret的界为 O ( log T ) O(\log T) O(logT)。
f f f的梯度是Lipschitz连续 式(43)的结果无法推广到在线学习的情况,如果用 f k f_k fk代替 f f f,那么会得到等式右边有 f k ( x k + 1 ) − f k ( x ∗ ) f_k(x_{k+1}) - f_k (x^{*}) fk(xk+1)−fk(x∗)。递推求和无法得到regret的界。因此,梯度的Lipschitz连续无法保证regret的界为 O ( log T ) O(\log T) O(logT)。
组合目标函数 在在线设定中,玩家和对手都有
r
(
x
)
r(x)
r(x),对手在每次迭代时改变
f
k
(
x
)
f_k(x)
fk(x)。每次迭代的目标函数是
h
k
(
x
k
)
=
f
k
(
x
k
)
+
r
(
x
k
)
h_k(x_k) = f_k(x_k) + r(x_k)
hk(xk)=fk(xk)+r(xk)。更新规则为:
x
k
+
1
=
arg min
x
∈
C
{
f
k
(
x
k
)
+
<
g
k
,
x
−
x
k
>
+
r
(
x
)
+
1
α
k
Div
(
x
,
x
k
)
}
,
g
k
∈
∂
f
x
(
x
k
)
.
(54)
x_{k+1} = \argmin_{x \in C} \left\{ f_k(x_k) + \left< g_k, x - x_k \right> + r(x) + \frac{1}{\alpha_k} \text{Div}(x, x_k) \right\}, \quad g_k \in \partial f_x (x_k). \tag{54}
xk+1=x∈Cargmin{fk(xk)+⟨gk,x−xk⟩+r(x)+αk1Div(x,xk)},gk∈∂fx(xk).(54)
那么式(47)变为:
f
k
(
x
k
)
+
r
(
x
k
+
1
)
−
f
k
(
x
∗
)
−
f
k
(
r
(
x
∗
)
≤
1
α
k
(
δ
k
−
δ
k
+
1
)
+
α
k
2
σ
∥
g
k
∥
∗
2
.
(55)
f_k(x_{k}) + r(x_{k+1}) - f_k(x^{*}) - f_k(r(x^{*}) \leq \frac{1}{\alpha_k} ( \delta_k - \delta_{k+1} ) + \frac{ \alpha_k }{2 \sigma} \| g_{k} \|_{*}^2. \tag{55}
fk(xk)+r(xk+1)−fk(x∗)−fk(r(x∗)≤αk1(δk−δk+1)+2σαk∥gk∥∗2.(55)
虽然我们这里用的是
r
(
x
k
+
1
)
r(x_{k+1})
r(xk+1),而不是
r
(
x
k
)
r(x_k)
r(xk),但这是没有问题的,因为
r
r
r不会通过迭代而改变。
选择
x
1
=
arg min
x
r
(
x
)
x_1 = \argmin_{x} r(x)
x1=xargminr(x),同式(48,49)有:
∑
k
=
1
T
(
h
k
(
x
k
)
−
h
k
(
x
∗
)
)
≤
G
σ
O
(
T
)
.
(56)
\sum_{k=1}^{T} \left( h_k(x_k) - h_k(x^{*}) \right) \leq \frac{G}{\sqrt{\sigma}} O(\sqrt{T}). \tag{56}
k=1∑T(hk(xk)−hk(x∗))≤σGO(T).(56)
因此,有
O
(
T
)
O(\sqrt{T})
O(T)。
当 f k f_k fk为强凸时,我们可以得到组合情况下的 O ( log T ) O(\log T) O(logT)regret。但是,同上面一样, f f f梯度的Lipschitz连续无法保证regret的界为 O ( log T ) O(\log T) O(logT)。
3 随机优化
让我们考虑优化一个函数,它的期望值形式为
min
x
F
(
x
)
:
=
E
w
∼
p
[
f
(
x
;
w
)
]
,
(57)
\min_{x} F(x) := \mathbb{E}_{w \sim p} [ f(x; w) ], \tag{57}
xminF(x):=Ew∼p[f(x;w)],(57)
其中
p
p
p是
w
w
w的分布。这其中包含了很多机器学习模型。比如SVM的目标函数为:
F
(
x
)
=
1
m
∑
i
=
1
m
max
{
0
,
1
−
c
i
<
a
i
,
x
>
+
λ
2
∥
x
∥
2
}
.
(58)
F(x) = \frac{1}{m} \sum_{i=1}^{m} \max \left\{ 0, 1 - c_i \left< a_i, x \right> + \frac{\lambda}{2} \| x \|^2 \right\}. \tag{58}
F(x)=m1i=1∑mmax{0,1−ci⟨ai,x⟩+2λ∥x∥2}.(58)
它可以解释成式(57),
w
w
w是在
{
1
,
2
,
⋯
,
m
}
\{ 1, 2, \cdots, m \}
{1,2,⋯,m}上的均匀分布,例如
p
(
w
=
i
)
=
1
m
p(w=i) = \frac{1}{m}
p(w=i)=m1:
f
(
x
;
i
)
=
max
{
0
,
1
−
c
i
<
a
i
,
x
>
}
+
λ
2
∥
x
∥
2
.
(59)
f(x; i) = \max \left\{ 0, 1 - c_i \left< a_i, x \right> \right\} + \frac{\lambda}{2} \| x \|^2. \tag{59}
f(x;i)=max{0,1−ci⟨ai,x⟩}+2λ∥x∥2.(59)
当
m
m
m较大时,计算
F
F
F及其子梯度的成本会很高。所以,一个简单的想法是基于一个随机选择的数据点进行更新。它可以认为是算法1中在线学习的一种特殊情况,步骤4中的对手现在随机选取
f
k
f_k
fk为
f
(
x
;
w
k
)
f(x;w_k)
f(x;wk),
w
k
w_k
wk与
p
p
p无关。理想情况下,我们希望通过使用mirror descent更新,
x
k
x_k
xk将逐渐接近
F
(
x
)
F(x)
F(x)的最小化值。直观上这是非常合理的,通过使用
f
k
f_k
fk,我们可以计算出
F
(
x
k
)
F(x_k)
F(xk)的无偏估计值和
F
(
x
k
)
F(x_k)
F(xk)的次梯度(因为
w
k
w_k
wk是从
p
p
p中进行独立同分布采样得到的)。这是随机优化的一种特殊情况,我们在算法2中进行了总结。
事实上,该方法在更一般的环境下也是有效的。为了简单起见,我们只说对手在迭代 k k k时有 w k w_k wk。那么在线学习算法 A \mathcal{A} A就是简单的从一个有序集 { w 1 , ⋯ , w k } \{ w_1, \cdots, w_k \} {w1,⋯,wk}到 x k + 1 x_{k+1} xk+1的确定性映射。将初始模型 x 1 x_1 x1表示为 A ( w 0 ) \mathcal{A}(w_0) A(w0)。那么下面的定理就是在线到批量转换的关键。
定理1 假设在线学习算法
A
\mathcal{A}
A在使用算法1迭代
k
k
k次后regret有上界
R
k
R_k
Rk。假设
w
1
,
⋯
,
w
T
+
1
w_1, \cdots, w_{T+1}
w1,⋯,wT+1独立同分布采样自
p
p
p。定义
x
^
=
A
(
w
j
+
1
,
⋯
,
w
T
)
\hat{x}=\mathcal{A}(w_{j+1}, \cdots, w_{T})
x^=A(wj+1,⋯,wT),其中
j
j
j均匀随机采样自
{
0
,
⋯
,
T
}
\{ 0, \cdots, T \}
{0,⋯,T}。那么:
E
[
F
(
x
^
)
]
−
min
x
F
(
x
)
≤
R
T
+
1
T
+
1
.
(60)
\mathbb{E}[F(\hat{x})] - \min_{x} F(x) \leq \frac{R_{T+1}}{T + 1}. \tag{60}
E[F(x^)]−xminF(x)≤T+1RT+1.(60)
其中期望值与
w
1
,
⋯
,
w
T
w_1, \cdots, w_{T}
w1,⋯,wT和
j
j
j的随机性有关。
相似地,可以用
1
−
σ
1 - \sigma
1−σ高概率保持边界:
F
(
x
^
)
−
min
x
F
(
x
)
≤
R
T
+
1
T
+
1
log
1
σ
.
(61)
F(\hat{x}) - \min_{x} F(x) \leq \frac{R_{T+1}}{T + 1} \log \frac{1}{\sigma}. \tag{61}
F(x^)−xminF(x)≤T+1RT+1logσ1.(61)
其中概率与
w
1
,
⋯
,
w
T
w_1, \cdots, w_{T}
w1,⋯,wT和
j
j
j的随机性有关。
证明
E
[
F
(
x
^
)
]
=
E
j
,
w
1
,
⋯
,
w
T
+
1
[
f
(
x
^
;
w
T
+
1
)
]
=
E
j
,
w
1
,
⋯
,
w
T
+
1
[
f
(
A
w
j
+
1
,
⋯
,
w
T
;
w
T
+
1
)
]
=
E
w
1
,
⋯
,
w
T
+
1
[
1
T
+
1
∑
j
=
0
T
f
(
A
w
j
+
1
,
⋯
,
w
T
;
w
T
+
1
)
]
(
j
是均匀随机抽取的
)
=
1
T
+
1
E
w
1
,
⋯
,
w
T
+
1
[
∑
j
=
0
T
f
(
A
w
1
,
⋯
,
w
T
−
j
;
w
T
+
1
−
j
)
]
(
平移指标
)
=
1
T
+
1
E
w
1
,
⋯
,
w
T
+
1
[
∑
s
=
1
T
+
1
f
(
A
w
1
,
⋯
,
w
s
−
1
;
w
s
)
]
(
替换变量
s
=
T
+
1
−
j
)
≤
1
T
+
1
E
w
1
,
⋯
,
w
T
+
1
[
min
x
∑
s
=
1
T
+
1
f
(
x
;
w
s
)
+
R
T
+
1
]
(
利用regret的界
)
≤
min
x
E
w
[
f
(
x
;
w
)
]
+
R
T
+
1
T
+
1
(
最小值的期望值小于期望值的最小值
)
=
min
x
F
(
x
)
+
R
T
+
1
T
+
1
.
(62)
\begin{aligned} \mathbb{E} [F(\hat{x})] =& \mathbb{E}_{j,w_1, \cdots, w_{T+1}} [f(\hat{x}; w_{T+1})] \\ =& \mathbb{E}_{j,w_1, \cdots, w_{T+1}} [f(\mathcal{A}_{w_{j+1}, \cdots, w_{T}}; w_{T+1})] \\ =& \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \frac{1}{T + 1} \sum_{j=0}^{T} f(\mathcal{A}_{w_{j+1}, \cdots, w_{T}}; w_{T+1}) \right] \quad (j \text{是均匀随机抽取的}) \\ =& \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \sum_{j=0}^{T} f(\mathcal{A}_{w_{1}, \cdots, w_{T-j}}; w_{T+1-j}) \right] \quad (\text{平移指标}) \\ =& \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \sum_{s=1}^{T+1} f(\mathcal{A}_{w_{1}, \cdots, w_{s-1}}; w_{s}) \right] \quad (\text{替换变量} s = T+1-j ) \\ \leq & \frac{1}{T + 1} \mathbb{E}_{w_1, \cdots, w_{T+1}} \left[ \min_{x} \sum_{s=1}^{T+1} f(x; w_{s}) + R_{T+1} \right] \quad (\text{利用regret的界}) \\ \leq & \min_{x} \mathbb{E}_{w} [f(x;w)] + \frac{R_{T+1}}{T+1} \quad (\text{最小值的期望值小于期望值的最小值}) \\ = & \min_{x} F(x) + \frac{R_{T+1}}{T + 1}. \end{aligned} \tag{62}
E[F(x^)]=====≤≤=Ej,w1,⋯,wT+1[f(x^;wT+1)]Ej,w1,⋯,wT+1[f(Awj+1,⋯,wT;wT+1)]Ew1,⋯,wT+1[T+11j=0∑Tf(Awj+1,⋯,wT;wT+1)](j是均匀随机抽取的)T+11Ew1,⋯,wT+1[j=0∑Tf(Aw1,⋯,wT−j;wT+1−j)](平移指标)T+11Ew1,⋯,wT+1[s=1∑T+1f(Aw1,⋯,ws−1;ws)](替换变量s=T+1−j)T+11Ew1,⋯,wT+1[xmins=1∑T+1f(x;ws)+RT+1](利用regret的界)xminEw[f(x;w)]+T+1RT+1(最小值的期望值小于期望值的最小值)xminF(x)+T+1RT+1.(62)