\section{无约束梯度方法 收敛速度分析}
\subsection{strongly convex and smooth}
minimize
x
f
(
x
)
subject to
x
∈
R
n
\begin{array}{rl} \operatorname{minimize}_x & f(\boldsymbol{x}) \\ \text { subject to } & \boldsymbol{x} \in \mathbb{R}^n \end{array}
minimizex subject to f(x)x∈Rn
全文假设目标函数
f
(
x
)
f(\boldsymbol{x})
f(x) 满足L-Smooth 条件,也就是说我们有
emma (descent lemma): Let
f
:
E
→
(
−
∞
,
∞
]
f: \mathbb{E} \rightarrow(-\infty, \infty]
f:E→(−∞,∞] be an
L
L
L-smooth function
(
L
≥
0
)
(L \geq 0)
(L≥0) over a given convex set
D
D
D. Then for any
x
,
y
∈
D
\mathbf{x}, \mathbf{y} \in D
x,y∈D,
f
(
y
)
≤
f
(
x
)
+
⟨
∇
f
(
x
)
,
y
−
x
⟩
+
L
2
∥
x
−
y
∥
2
f(\mathbf{y}) \leq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle+\frac{L}{2}\|\mathbf{x}-\mathbf{y}\|^2
f(y)≤f(x)+⟨∇f(x),y−x⟩+2L∥x−y∥2
f
(
x
t
+
1
)
≤
min
y
∈
R
n
{
f
(
x
t
)
+
⟨
∇
f
(
x
t
)
,
y
−
x
t
⟩
+
L
2
∥
y
−
x
t
∥
2
}
=
f
(
x
t
)
−
1
2
L
∥
∇
f
(
x
t
)
∥
2
2
\begin{aligned} f\left(\mathbf{x}^{t+1}\right) & \leq \min _{\mathbf{y} \in \mathbb{R}^n}\left\{f\left(\mathbf{x}^t\right)+\left\langle\nabla f\left(\mathbf{x}^t\right), \mathbf{y}-\mathbf{x}^t\right\rangle+\frac{L}{2}\left\|\mathbf{y}-\mathbf{x}^t\right\|^2\right\} \\ & =f\left(\mathbf{x}^t\right)-\frac{1}{2 L}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \end{aligned}
f(xt+1)≤y∈Rnmin{f(xt)+⟨∇f(xt),y−xt⟩+2L
y−xt
2}=f(xt)−2L1
∇f(xt)
22
这表明目标函数是非增的序列。
y
=
x
t
−
1
L
∇
f
(
x
t
)
\mathbf{y}=\mathbf{x}^t-\frac{1}{L} \nabla f\left(\mathbf{x}^t\right)
y=xt−L1∇f(xt)
目标函数满足 strongly convex and smooth
当函数还满足Strongly Convex 时,
f
(
x
∗
)
≥
min
y
∈
R
n
{
f
(
x
t
)
+
⟨
∇
f
(
x
t
)
,
y
−
x
t
⟩
+
μ
2
∥
y
−
x
t
∥
2
}
=
f
(
x
t
)
−
1
2
μ
∥
∇
f
(
x
t
)
∥
2
2
\begin{aligned} f\left(\mathbf{x}^*\right) & \geq \min _{\mathbf{y} \in \mathbb{R}^n}\left\{f\left(\mathbf{x}^t\right)+\left\langle\nabla f\left(\mathbf{x}^t\right), \mathbf{y}-\mathbf{x}^t\right\rangle+\frac{\mu}{2}\left\|\mathbf{y}-\mathbf{x}^t\right\|^2\right\} \\ & =f\left(\mathbf{x}^t\right)-\frac{1}{2 \mu}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \end{aligned}
f(x∗)≥y∈Rnmin{f(xt)+⟨∇f(xt),y−xt⟩+2μ
y−xt
2}=f(xt)−2μ1
∇f(xt)
22
我们有
2
μ
(
f
(
x
t
)
−
f
(
x
∗
)
)
≤
∥
∇
f
(
x
t
)
∥
2
2
2 \mu\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \leq\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2
2μ(f(xt)−f(x∗))≤∥∇f(xt)∥22 , 这也被成为 polyak-Lojasiewicz
(
P
L
)
(\mathrm{PL})
(PL) 条件。注 意到满足强凸条件的函数会满足PL条件,反之则不一定成立。
把
(
2
)
(2)
(2) 代入 (1), 我们有
f
(
x
t
+
1
)
−
f
(
x
∗
)
≤
f
(
x
t
)
−
f
(
x
∗
)
1
2
L
∥
∇
f
(
x
t
)
∥
2
2
≤
f
(
x
t
)
−
μ
L
(
f
(
x
t
)
−
f
(
x
∗
)
)
≤
(
1
−
μ
L
)
(
f
(
x
t
)
−
f
(
x
∗
)
)
\begin{aligned} f\left(\mathbf{x}^{t+1}\right)-f\left(\mathbf{x}^*\right) & \leq f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right) \frac{1}{2 L}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \\ & \leq f\left(\mathbf{x}^t\right)-\frac{\mu}{L}\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \\ & \leq\left(1-\frac{\mu}{L}\right)\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \end{aligned}
f(xt+1)−f(x∗)≤f(xt)−f(x∗)2L1
∇f(xt)
22≤f(xt)−Lμ(f(xt)−f(x∗))≤(1−Lμ)(f(xt)−f(x∗))
迭代地利用上面的关系我们可以得到
f
(
x
t
)
−
f
(
x
∗
)
≤
(
1
−
μ
L
)
t
(
f
(
x
0
)
−
f
(
x
∗
)
)
f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right) \leq\left(1-\frac{\mu}{L}\right)^t\left(f\left(\mathbf{x}^0\right)-f\left(\mathbf{x}^*\right)\right)
f(xt)−f(x∗)≤(1−Lμ)t(f(x0)−f(x∗)).
\subsection{convex and smooth}
下面我们证明 claim 1
∥
x
t
+
1
−
x
∗
∥
2
2
=
∥
x
t
−
x
∗
−
1
L
∇
f
(
x
t
)
∥
2
2
=
∥
x
t
−
x
∗
∥
2
2
−
2
L
⟨
x
t
−
x
∗
,
∇
f
(
x
t
)
−
∇
f
(
x
∗
)
⟩
⏟
≥
2
L
2
∥
∇
f
(
x
t
)
−
∇
f
(
x
∗
)
∥
2
2
(
smooth
+
c
v
x
)
+
1
L
2
∥
∇
f
(
x
t
)
∥
2
2
≤
∥
x
t
−
x
∗
∥
2
2
−
2
L
2
∥
∇
f
(
x
t
)
−
∇
f
(
x
∗
)
∥
2
2
+
1
L
2
∥
∇
f
(
x
t
)
−
∇
f
(
x
∗
)
∥
2
2
=
∥
x
t
−
x
∗
∥
2
2
−
1
L
2
∥
∇
f
(
x
t
)
−
∇
f
(
x
∗
)
⏟
=
0
∥
2
2
\begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\frac{1}{L} \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{\frac{2}{L}\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\rangle}_{\geq \frac{2}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2(\operatorname{smooth}+\mathrm{cvx})}+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{2}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{1}{L^2}\|\nabla f\left(\boldsymbol{x}^t\right)-\underbrace{\nabla f\left(\boldsymbol{x}^*\right)}_{=0}\|_2^2 \end{aligned}
xt+1−x∗
22=
xt−x∗−L1∇f(xt)
22=
xt−x∗
22−≥L22∥∇f(xt)−∇f(x∗)∥22(smooth+cvx)
L2⟨xt−x∗,∇f(xt)−∇f(x∗)⟩+L21
∇f(xt)
22≤
xt−x∗
22−L22
∇f(xt)−∇f(x∗)
22+L21
∇f(xt)−∇f(x∗)
22=
xt−x∗
22−L21∥∇f(xt)−=0
∇f(x∗)∥22
另一个简单的思路
∥
x
t
+
1
−
x
∗
∥
2
2
=
∥
x
t
−
x
∗
−
η
∇
f
(
x
t
)
∥
2
2
=
∥
x
t
−
x
∗
∥
2
2
−
2
η
⟨
x
t
−
x
∗
,
∇
f
(
x
t
)
⟩
⏟
≥
2
η
(
f
(
x
t
)
−
f
(
x
∗
)
)
(
smooth
+
c
v
x
)
+
η
2
∥
∇
f
(
x
t
)
∥
2
2
≤
∥
x
t
−
x
∗
∥
2
2
−
2
η
(
f
(
x
t
)
−
f
(
x
∗
)
)
+
η
2
∥
∇
f
(
x
t
)
∥
2
2
⏟
≤
2
L
(
f
(
x
t
)
−
f
(
x
∗
)
)
=
∥
x
t
−
x
∗
∥
2
2
−
(
2
η
−
2
L
η
2
)
(
f
(
x
t
)
−
f
(
x
∗
)
)
\begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\eta \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{2 \eta\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle}_{\geq 2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)(\text { smooth }+\mathrm{cvx})}+\eta^2\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)+\eta^2 \underbrace{\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2}_{\leq 2 L\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)} \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left(2 \eta-2 L \eta^2\right)\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \end{aligned}
xt+1−x∗
22=
xt−x∗−η∇f(xt)
22=
xt−x∗
22−≥2η(f(xt)−f(x∗))( smooth +cvx)
2η⟨xt−x∗,∇f(xt)⟩+η2
∇f(xt)
22≤
xt−x∗
22−2η(f(xt)−f(x∗))+η2≤2L(f(xt)−f(x∗))
∇f(xt)
22=
xt−x∗
22−(2η−2Lη2)(f(xt)−f(x∗))
As a result, we obtain
f
(
x
t
)
−
f
(
x
∗
)
≤
1
2
η
−
2
L
η
2
(
∥
x
t
−
x
∗
∥
2
2
−
∥
x
t
+
1
−
x
∗
∥
2
2
)
.
f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right) \leq \frac{1}{2 \eta-2 L \eta^2}\left(\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2\right) .
f(xt)−f(x∗)≤2η−2Lη21(
xt−x∗
22−
xt+1−x∗
22).
Let
η
=
1
2
L
\eta=\frac{1}{2 L}
η=2L1. Summing up from 0 to
T
−
1
T-1
T−1, we obtain
1
T
∑
t
=
0
T
−
1
(
f
(
x
t
)
−
f
(
x
∗
)
)
≤
4
L
∥
x
0
−
x
∗
∥
2
2
T
\frac{1}{T} \sum_{t=0}^{T-1}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \leq \frac{4 L\left\|\boldsymbol{x}_0-\boldsymbol{x}^*\right\|_2^2}{T}
T1t=0∑T−1(f(xt)−f(x∗))≤T4L∥x0−x∗∥22
- 再紧一点 *
∥ x t + 1 − x ∗ ∥ 2 2 = ∥ x t − x ∗ − η ∇ f ( x t ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 − 2 η ⟨ x t − x ∗ , ∇ f ( x t ) ⟩ ⏟ ≥ 2 η ( f ( x t ) − f ( x ∗ ) + 1 2 L ∥ ∇ f ( x t ) ∥ 2 ) ( smooth + c v x ) + η 2 ∥ ∇ f ( x t ) ∥ 2 2 ≤ ∥ x t − x ∗ ∥ 2 2 − 2 η ( f ( x t ) − f ( x ∗ ) ) − ( η L − η 2 ) ∥ ∇ f ( x t ) ∥ 2 2 ⏟ = η = 1 L ∥ x t − x ∗ ∥ 2 2 − 2 L ( f ( x t ) − f ( x ∗ ) ) \begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\eta \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{2 \eta\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle}_{\geq 2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)+\frac{1}{2 L}\left\|\nabla f\left(\boldsymbol{x}_t\right)\right\|^2\right)(\text { smooth }+\mathrm{cvx})}+\eta^2\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)-\left(\frac{\eta}{L}-\eta^2\right) \underbrace{\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2} \\ & \stackrel{\eta=\frac{1}{L}}{=}\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{2}{L}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \end{aligned} xt+1−x∗ 22= xt−x∗−η∇f(xt) 22= xt−x∗ 22−≥2η(f(xt)−f(x∗)+2L1∥∇f(xt)∥2)( smooth +cvx) 2η⟨xt−x∗,∇f(xt)⟩+η2 ∇f(xt) 22≤ xt−x∗ 22−2η(f(xt)−f(x∗))−(Lη−η2) ∇f(xt) 22=η=L1 xt−x∗ 22−L2(f(xt)−f(x∗))
因此,我们有
f ( x t ) − f ( x ∗ ) ≤ L 2 ( ∥ x t − x ∗ ∥ 2 2 − ∥ x t + 1 − x ∗ ∥ 2 2 ) . f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right) \leq \frac{L}{2}\left(\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2\right) . f(xt)−f(x∗)≤2L( xt−x∗ 22− xt+1−x∗ 22).
Summing up from 0 to T − 1 T-1 T−1, we obtain
1 T ∑ t = 0 T − 1 ( f ( x t ) − f ( x ∗ ) ) ≤ L ∥ x 0 − x ∗ ∥ 2 2 2 T \frac{1}{T} \sum_{t=0}^{T-1}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \leq \frac{L\left\|\boldsymbol{x}_0-\boldsymbol{x}^*\right\|_2^2}{2 T} T1t=0∑T−1(f(xt)−f(x∗))≤2TL∥x0−x∗∥22
\subsection{smooth and nonconvex}
我们不能期望找到有效的全局最优解。假设我们对任何 (近似) Stationary Point都满意。这意 味着我们的目标仅仅是找到一个要点
x
x
x 满足
∥
∇
f
(
x
)
∥
2
≤
ε
\|\nabla f(\boldsymbol{x})\|_2 \leq \varepsilon \quad
∥∇f(x)∥2≤ε (called
ε
\varepsilon
ε-approximate stationary point )
1
2
L
∥
∇
f
(
x
k
)
∥
2
2
≤
f
(
x
k
)
−
f
(
x
k
+
1
)
,
∀
k
\frac{1}{2 L}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2^2 \leq f\left(\boldsymbol{x}^k\right)-f\left(\boldsymbol{x}^{k+1}\right), \quad \forall k
2L1
∇f(xk)
22≤f(xk)−f(xk+1),∀k
求和 from
k
=
0
k=0
k=0 to
k
=
t
−
1
k=t-1
k=t−1 :
1
2
L
∑
k
=
t
0
t
−
1
∥
∇
f
(
x
k
)
∥
2
2
≤
∑
k
=
0
t
−
1
(
f
(
x
k
)
−
f
(
x
k
+
1
)
)
=
f
(
x
0
)
−
f
(
x
t
)
≤
f
(
x
0
)
−
f
(
x
∗
)
⟹
min
0
≤
k
<
t
∥
∇
f
(
x
k
)
∥
2
≤
2
L
(
f
(
x
0
)
−
f
(
x
∗
)
)
t
\begin{aligned} \frac{1}{2 L} \sum_{k=t_0}^{t-1}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2^2 & \leq \sum_{k=0}^{t-1}\left(f\left(\boldsymbol{x}^k\right)-f\left(\boldsymbol{x}^{k+1}\right)\right)=f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^t\right) \\ & \leq f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^*\right) \\ \Longrightarrow & \min _{0 \leq k<t}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2 \leq \sqrt{\frac{2 L\left(f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^*\right)\right)}{t}} \end{aligned}
2L1k=t0∑t−1
∇f(xk)
22⟹≤k=0∑t−1(f(xk)−f(xk+1))=f(x0)−f(xt)≤f(x0)−f(x∗)0≤k<tmin
∇f(xk)
2≤t2L(f(x0)−f(x∗))
\subsection{Regularity Condition}
From another perspective
∥
x
t
+
1
−
x
∗
∥
2
2
=
∥
x
t
−
x
∗
−
1
L
∇
f
(
x
t
)
∥
2
2
=
∥
x
t
−
x
∗
∥
2
2
+
1
L
2
∥
∇
f
(
x
t
)
∥
2
2
−
2
L
⟨
x
t
−
x
∗
,
∇
f
(
x
t
)
⟩
≤
∥
x
t
−
x
∗
∥
2
2
−
μ
L
∥
x
t
−
x
∗
∥
2
2
=
(
1
−
μ
L
)
∥
x
t
−
x
∗
∥
2
2
\begin{aligned} \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2 & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\frac{1}{L} \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2-\frac{2}{L}\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{\mu}{L}\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2 \\ & =\left(1-\frac{\mu}{L}\right)\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2 \end{aligned}
xt+1−x∗
22=
xt−x∗−L1∇f(xt)
22=
xt−x∗
22+L21
∇f(xt)
22−L2⟨xt−x∗,∇f(xt)⟩≤
xt−x∗
22−Lμ
xt−x∗
22=(1−Lμ)
xt−x∗
22
这样我们同样可以得到linear convergence rate, 为了让 (i) 成立,我们需要
⟨
∇
f
(
x
)
,
x
−
x
∗
⟩
≥
μ
2
∥
x
−
x
∗
∥
2
2
+
1
2
L
∥
∇
f
(
x
)
∥
2
2
,
∀
x
\left\langle\nabla f(\boldsymbol{x}), \boldsymbol{x}-\boldsymbol{x}^*\right\rangle \geq \frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2, \quad \forall \boldsymbol{x}
⟨∇f(x),x−x∗⟩≥2μ∥x−x∗∥22+2L1∥∇f(x)∥22,∀x
这被称为 Regularity Condition,可以隐约感觉到这个条件是强凸性和光滑性组合的结果,这表 明这个条件会比强凸性和光滑性弱。下面给出证明:
0
≤
f
(
x
+
)
−
f
(
x
∗
)
=
f
(
x
+
)
−
f
(
x
)
+
f
(
x
)
−
f
(
x
∗
)
≤
∇
f
(
x
)
⊤
(
x
+
−
x
)
+
L
2
∥
x
+
−
x
∥
2
2
⏟
smoothness
+
∇
f
(
x
)
⊤
(
x
−
x
∗
)
−
μ
2
∥
x
−
x
∗
∥
2
2
⏟
strong convexity
=
∇
f
(
x
)
⊤
(
x
+
−
x
∗
)
+
1
2
L
∥
∇
f
(
x
)
∥
2
2
−
μ
2
∥
x
−
x
∗
∥
2
2
=
∇
f
(
x
)
⊤
(
x
+
−
x
+
x
−
x
∗
)
+
1
2
L
∥
∇
f
(
x
)
∥
2
2
−
μ
2
∥
x
−
x
∗
∥
2
2
=
∇
f
(
x
)
⊤
(
x
−
x
∗
)
−
1
2
L
∥
∇
f
(
x
)
∥
2
2
−
μ
2
∥
x
−
x
∗
∥
2
2
,
\begin{aligned} & 0 \leq f\left(\boldsymbol{x}^{+}\right)-f\left(\boldsymbol{x}^*\right)=f\left(\boldsymbol{x}^{+}\right)-f(\boldsymbol{x})+f(\boldsymbol{x})-f\left(\boldsymbol{x}^*\right) \\ & \leq \underbrace{\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}\right)+\frac{L}{2}\left\|\boldsymbol{x}^{+}-\boldsymbol{x}\right\|_2^2}_{\text {smoothness }}+\underbrace{\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2}_{\text {strong convexity }} \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}^*\right)+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2 \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}+\boldsymbol{x}-\boldsymbol{x}^*\right)+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2 \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)-\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2, \\ & \end{aligned}
0≤f(x+)−f(x∗)=f(x+)−f(x)+f(x)−f(x∗)≤smoothness
∇f(x)⊤(x+−x)+2L
x+−x
22+strong convexity
∇f(x)⊤(x−x∗)−2μ∥x−x∗∥22=∇f(x)⊤(x+−x∗)+2L1∥∇f(x)∥22−2μ∥x−x∗∥22=∇f(x)⊤(x+−x+x−x∗)+2L1∥∇f(x)∥22−2μ∥x−x∗∥22=∇f(x)⊤(x−x∗)−2L1∥∇f(x)∥22−2μ∥x−x∗∥22,
移项就可以得到
(
R
L
)
(R L)
(RL) 条件。
\subsection{Nonsmooth case}
非光滑的话,我们研究次梯度方法的收敛性。次梯度的迭代为:
x
k
+
1
=
x
k
−
α
k
g
k
,
g
k
∈
∂
f
(
x
k
)
x^{k+1}=x^k-\alpha_k g^k, g^k \in \partial f\left(x^k\right)
xk+1=xk−αkgk,gk∈∂f(xk)
非光滑对于收敛性是及其不友好的,并且次梯度方向都不能保证是一个下降方向。在光滑的情况 下,只要
α
k
\alpha_k
αk 选的足够小 (取决于光滑系数
L
g
L_g
Lg ),再不济也能保证每步的函数值能够下降,通常 是通过线搜索来完成。而次梯度由于不是下降方向,就没有这个性质了,也就是说不能保证单调下 降。接下来,我分两个情况来讨论收敛性: 1.Convex。2. Strongly convex。
1.1.convex case
定理1.1 (Nonsmooth + convex) 如果函数
f
f
f 是凸的且是Lipschitzness的。对于迭代方法 (1.1),步长选择策略为:
α
k
=
f
(
x
k
)
−
f
∗
∥
g
k
∥
2
\alpha_k=\frac{f\left(x^k\right)-f^*}{\left\|g^k\right\|^2}
αk=∥gk∥2f(xk)−f∗ 如果
g
k
≠
0
g^k \neq 0
gk=0 ,否则
α
k
=
1
\alpha_k=1
αk=1 。那么我们有:
- ∥ x k + 1 − x ∗ ∥ 2 ≤ ∥ x k − x ∗ ∥ 2 \left\|x^{k+1}-x^*\right\|^2 \leq\left\|x^k-x^*\right\|^2 xk+1−x∗ 2≤ xk−x∗ 2
- f k → f ∗ f^k \rightarrow f^* fk→f∗ as k → ∞ k \rightarrow \infty k→∞
-
f
best
n
−
f
∗
≤
L
f
∥
x
0
−
x
∗
∥
n
+
1
f_{\text {best }}^n-f^* \leq \frac{L_f\left\|x^0-x^*\right\|}{\sqrt{n+1}}
fbest n−f∗≤n+1Lf∥x0−x∗∥ ,其中,
f
best
n
=
min
k
{
f
(
x
k
)
,
k
=
1
,
⋯
,
n
}
f_{\text {best }}^n=\min _k\left\{f\left(x^k\right), k=1, \cdots, n\right\}
fbest n=mink{f(xk),k=1,⋯,n}
Proof: 根据 (0.6),我们令 x = x ∗ x=x^* x=x∗ 得到:
∥ x k + 1 − x ∗ ∥ 2 = ∥ x k − x ∗ ∥ 2 − 2 α k ⟨ g k , x k − x ∗ ⟩ + α k 2 ∥ g k ∥ 2 ≤ ∥ x k − x ∗ ∥ 2 − 2 α k ( f ( x k ) − f ∗ ) + α k 2 ∥ g k ∥ 2 = ∥ x k − x ∗ ∥ 2 − ( f ( x k ) − f ∗ ) 2 ∥ g k ∥ 2 \begin{aligned} \left\|x^{k+1}-x^*\right\|^2 & =\left\|x^k-x^*\right\|^2-2 \alpha_k\left\langle g^k, x^k-x^*\right\rangle+\alpha_k^2\left\|g^k\right\|^2 \\ & \leq\left\|x^k-x^*\right\|^2-2 \alpha_k\left(f\left(x^k\right)-f^*\right)+\alpha_k^2\left\|g^k\right\|^2 \\ & =\left\|x^k-x^*\right\|^2-\frac{\left(f\left(x^k\right)-f^*\right)^2}{\left\|g^k\right\|^2} \end{aligned} xk+1−x∗ 2= xk−x∗ 2−2αk⟨gk,xk−x∗⟩+αk2 gk 2≤ xk−x∗ 2−2αk(f(xk)−f∗)+αk2 gk 2= xk−x∗ 2−∥gk∥2(f(xk)−f∗)2
第一个不等式用了凸性,第二个等式用了步长代入,第二个不等式用到了函数的Lipschitzness。 从上式可以得出定理中第一条满足。然后我们对上式做累加
1 L f 2 ∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ ∑ k = 0 n { ∥ x k − x ∗ ∥ 2 − ∥ x k + 1 − x ∗ ∥ 2 } = ∥ x 0 − x ∗ ∥ 2 − ∥ x n + 1 − x ∗ ∥ 2 \begin{aligned} \frac{1}{L_f^2} \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 & \leq \sum_{k=0}^n\left\{\left\|x^k-x^*\right\|^2-\left\|x^{k+1}-x^*\right\|^2\right\} \\ & =\left\|x^0-x^*\right\|^2-\left\|x^{n+1}-x^*\right\|^2 \end{aligned} Lf21k=0∑n(f(xk)−f∗)2≤k=0∑n{ xk−x∗ 2− xk+1−x∗ 2}= x0−x∗ 2− xn+1−x∗ 2
系数移一下:
∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ L f 2 ∥ x 0 − x ∗ ∥ 2 \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 \leq L_f^2\left\|x^0-x^*\right\|^2 k=0∑n(f(xk)−f∗)2≤Lf2 x0−x∗ 2
根据右边有界,我们得到 f k → f ∗ f^k \rightarrow f^* fk→f∗ as k → ∞ k \rightarrow \infty k→∞ 。最后
( n + 1 ) ( f b e s t k − f ∗ ) 2 ≤ ∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ L f 2 ∥ x 0 − x ∗ ∥ 2 (n+1)\left(f_{b e s t}^k-f^*\right)^2 \leq \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 \leq L_f^2\left\|x^0-x^*\right\|^2 (n+1)(fbestk−f∗)2≤k=0∑n(f(xk)−f∗)2≤Lf2 x0−x∗ 2
移项开根号便得到了定理第三部分。证毕
Remark:这个步长的选择叫做 “Polyak’ s stepsize”,这个步长里面涉及到了函数的最优值, 这在实际运算是不被允许的,除非那种我们事先知道最优值的问题(求最优解 x ∗ x^* x∗ 满足 f ( x ∗ ) = f ∗ f\left(x^*\right)=f^* f(x∗)=f∗ ) ; 另外还有其他的步长策略,我就不说了,证明类似。