无约束优化问题收敛性分析

\section{无约束梯度方法 收敛速度分析}
\subsection{strongly convex and smooth}
minimize ⁡ x f ( x )  subject to  x ∈ R n \begin{array}{rl} \operatorname{minimize}_x & f(\boldsymbol{x}) \\ \text { subject to } & \boldsymbol{x} \in \mathbb{R}^n \end{array} minimizex subject to f(x)xRn
全文假设目标函数 f ( x ) f(\boldsymbol{x}) f(x) 满足L-Smooth 条件,也就是说我们有
emma (descent lemma): Let f : E → ( − ∞ , ∞ ] f: \mathbb{E} \rightarrow(-\infty, \infty] f:E(,] be an L L L-smooth function ( L ≥ 0 ) (L \geq 0) (L0) over a given convex set D D D. Then for any x , y ∈ D \mathbf{x}, \mathbf{y} \in D x,yD,
f ( y ) ≤ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ + L 2 ∥ x − y ∥ 2 f(\mathbf{y}) \leq f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{y}-\mathbf{x}\rangle+\frac{L}{2}\|\mathbf{x}-\mathbf{y}\|^2 f(y)f(x)+f(x),yx+2Lxy2
f ( x t + 1 ) ≤ min ⁡ y ∈ R n { f ( x t ) + ⟨ ∇ f ( x t ) , y − x t ⟩ + L 2 ∥ y − x t ∥ 2 } = f ( x t ) − 1 2 L ∥ ∇ f ( x t ) ∥ 2 2 \begin{aligned} f\left(\mathbf{x}^{t+1}\right) & \leq \min _{\mathbf{y} \in \mathbb{R}^n}\left\{f\left(\mathbf{x}^t\right)+\left\langle\nabla f\left(\mathbf{x}^t\right), \mathbf{y}-\mathbf{x}^t\right\rangle+\frac{L}{2}\left\|\mathbf{y}-\mathbf{x}^t\right\|^2\right\} \\ & =f\left(\mathbf{x}^t\right)-\frac{1}{2 L}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \end{aligned} f(xt+1)yRnmin{f(xt)+f(xt),yxt+2L yxt 2}=f(xt)2L1 f(xt) 22
这表明目标函数是非增的序列。

y = x t − 1 L ∇ f ( x t ) \mathbf{y}=\mathbf{x}^t-\frac{1}{L} \nabla f\left(\mathbf{x}^t\right) y=xtL1f(xt)
目标函数满足 strongly convex and smooth
当函数还满足Strongly Convex 时,
f ( x ∗ ) ≥ min ⁡ y ∈ R n { f ( x t ) + ⟨ ∇ f ( x t ) , y − x t ⟩ + μ 2 ∥ y − x t ∥ 2 } = f ( x t ) − 1 2 μ ∥ ∇ f ( x t ) ∥ 2 2 \begin{aligned} f\left(\mathbf{x}^*\right) & \geq \min _{\mathbf{y} \in \mathbb{R}^n}\left\{f\left(\mathbf{x}^t\right)+\left\langle\nabla f\left(\mathbf{x}^t\right), \mathbf{y}-\mathbf{x}^t\right\rangle+\frac{\mu}{2}\left\|\mathbf{y}-\mathbf{x}^t\right\|^2\right\} \\ & =f\left(\mathbf{x}^t\right)-\frac{1}{2 \mu}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \end{aligned} f(x)yRnmin{f(xt)+f(xt),yxt+2μ yxt 2}=f(xt)2μ1 f(xt) 22
我们有 2 μ ( f ( x t ) − f ( x ∗ ) ) ≤ ∥ ∇ f ( x t ) ∥ 2 2 2 \mu\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \leq\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 2μ(f(xt)f(x))f(xt)22 , 这也被成为 polyak-Lojasiewicz ( P L ) (\mathrm{PL}) (PL) 条件。注 意到满足强凸条件的函数会满足PL条件,反之则不一定成立。
( 2 ) (2) (2) 代入 (1), 我们有
f ( x t + 1 ) − f ( x ∗ ) ≤ f ( x t ) − f ( x ∗ ) 1 2 L ∥ ∇ f ( x t ) ∥ 2 2 ≤ f ( x t ) − μ L ( f ( x t ) − f ( x ∗ ) ) ≤ ( 1 − μ L ) ( f ( x t ) − f ( x ∗ ) ) \begin{aligned} f\left(\mathbf{x}^{t+1}\right)-f\left(\mathbf{x}^*\right) & \leq f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right) \frac{1}{2 L}\left\|\nabla f\left(\mathbf{x}^t\right)\right\|_2^2 \\ & \leq f\left(\mathbf{x}^t\right)-\frac{\mu}{L}\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \\ & \leq\left(1-\frac{\mu}{L}\right)\left(f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right)\right) \end{aligned} f(xt+1)f(x)f(xt)f(x)2L1 f(xt) 22f(xt)Lμ(f(xt)f(x))(1Lμ)(f(xt)f(x))
迭代地利用上面的关系我们可以得到 f ( x t ) − f ( x ∗ ) ≤ ( 1 − μ L ) t ( f ( x 0 ) − f ( x ∗ ) ) f\left(\mathbf{x}^t\right)-f\left(\mathbf{x}^*\right) \leq\left(1-\frac{\mu}{L}\right)^t\left(f\left(\mathbf{x}^0\right)-f\left(\mathbf{x}^*\right)\right) f(xt)f(x)(1Lμ)t(f(x0)f(x)).

\subsection{convex and smooth}

下面我们证明 claim 1
∥ x t + 1 − x ∗ ∥ 2 2 = ∥ x t − x ∗ − 1 L ∇ f ( x t ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 − 2 L ⟨ x t − x ∗ , ∇ f ( x t ) − ∇ f ( x ∗ ) ⟩ ⏟ ≥ 2 L 2 ∥ ∇ f ( x t ) − ∇ f ( x ∗ ) ∥ 2 2 ( smooth ⁡ + c v x ) + 1 L 2 ∥ ∇ f ( x t ) ∥ 2 2 ≤ ∥ x t − x ∗ ∥ 2 2 − 2 L 2 ∥ ∇ f ( x t ) − ∇ f ( x ∗ ) ∥ 2 2 + 1 L 2 ∥ ∇ f ( x t ) − ∇ f ( x ∗ ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 − 1 L 2 ∥ ∇ f ( x t ) − ∇ f ( x ∗ ) ⏟ = 0 ∥ 2 2 \begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\frac{1}{L} \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{\frac{2}{L}\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\rangle}_{\geq \frac{2}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2(\operatorname{smooth}+\mathrm{cvx})}+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{2}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)-\nabla f\left(\boldsymbol{x}^*\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{1}{L^2}\|\nabla f\left(\boldsymbol{x}^t\right)-\underbrace{\nabla f\left(\boldsymbol{x}^*\right)}_{=0}\|_2^2 \end{aligned} xt+1x 22= xtxL1f(xt) 22= xtx 22L22f(xt)f(x)22(smooth+cvx) L2xtx,f(xt)f(x)+L21 f(xt) 22 xtx 22L22 f(xt)f(x) 22+L21 f(xt)f(x) 22= xtx 22L21∥∇f(xt)=0 f(x)22
另一个简单的思路
∥ x t + 1 − x ∗ ∥ 2 2 = ∥ x t − x ∗ − η ∇ f ( x t ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 − 2 η ⟨ x t − x ∗ , ∇ f ( x t ) ⟩ ⏟ ≥ 2 η ( f ( x t ) − f ( x ∗ ) ) (  smooth  + c v x ) + η 2 ∥ ∇ f ( x t ) ∥ 2 2 ≤ ∥ x t − x ∗ ∥ 2 2 − 2 η ( f ( x t ) − f ( x ∗ ) ) + η 2 ∥ ∇ f ( x t ) ∥ 2 2 ⏟ ≤ 2 L ( f ( x t ) − f ( x ∗ ) ) = ∥ x t − x ∗ ∥ 2 2 − ( 2 η − 2 L η 2 ) ( f ( x t ) − f ( x ∗ ) ) \begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\eta \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{2 \eta\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle}_{\geq 2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)(\text { smooth }+\mathrm{cvx})}+\eta^2\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)+\eta^2 \underbrace{\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2}_{\leq 2 L\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)} \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left(2 \eta-2 L \eta^2\right)\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \end{aligned} xt+1x 22= xtxηf(xt) 22= xtx 222η(f(xt)f(x))( smooth +cvx) 2ηxtx,f(xt)+η2 f(xt) 22 xtx 222η(f(xt)f(x))+η22L(f(xt)f(x)) f(xt) 22= xtx 22(2η2Lη2)(f(xt)f(x))

As a result, we obtain
f ( x t ) − f ( x ∗ ) ≤ 1 2 η − 2 L η 2 ( ∥ x t − x ∗ ∥ 2 2 − ∥ x t + 1 − x ∗ ∥ 2 2 ) . f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right) \leq \frac{1}{2 \eta-2 L \eta^2}\left(\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2\right) . f(xt)f(x)2η2Lη21( xtx 22 xt+1x 22).
Let η = 1 2 L \eta=\frac{1}{2 L} η=2L1. Summing up from 0 to T − 1 T-1 T1, we obtain
1 T ∑ t = 0 T − 1 ( f ( x t ) − f ( x ∗ ) ) ≤ 4 L ∥ x 0 − x ∗ ∥ 2 2 T \frac{1}{T} \sum_{t=0}^{T-1}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \leq \frac{4 L\left\|\boldsymbol{x}_0-\boldsymbol{x}^*\right\|_2^2}{T} T1t=0T1(f(xt)f(x))T4Lx0x22

  • 再紧一点 *
    ∥ x t + 1 − x ∗ ∥ 2 2 = ∥ x t − x ∗ − η ∇ f ( x t ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 − 2 η ⟨ x t − x ∗ , ∇ f ( x t ) ⟩ ⏟ ≥ 2 η ( f ( x t ) − f ( x ∗ ) + 1 2 L ∥ ∇ f ( x t ) ∥ 2 ) (  smooth  + c v x ) + η 2 ∥ ∇ f ( x t ) ∥ 2 2 ≤ ∥ x t − x ∗ ∥ 2 2 − 2 η ( f ( x t ) − f ( x ∗ ) ) − ( η L − η 2 ) ∥ ∇ f ( x t ) ∥ 2 2 ⏟ = η = 1 L ∥ x t − x ∗ ∥ 2 2 − 2 L ( f ( x t ) − f ( x ∗ ) ) \begin{aligned} & \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2=\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\eta \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\underbrace{2 \eta\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle}_{\geq 2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)+\frac{1}{2 L}\left\|\nabla f\left(\boldsymbol{x}_t\right)\right\|^2\right)(\text { smooth }+\mathrm{cvx})}+\eta^2\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-2 \eta\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right)-\left(\frac{\eta}{L}-\eta^2\right) \underbrace{\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2} \\ & \stackrel{\eta=\frac{1}{L}}{=}\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{2}{L}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \end{aligned} xt+1x 22= xtxηf(xt) 22= xtx 222η(f(xt)f(x)+2L1f(xt)2)( smooth +cvx) 2ηxtx,f(xt)+η2 f(xt) 22 xtx 222η(f(xt)f(x))(Lηη2) f(xt) 22=η=L1 xtx 22L2(f(xt)f(x))
    因此,我们有
    f ( x t ) − f ( x ∗ ) ≤ L 2 ( ∥ x t − x ∗ ∥ 2 2 − ∥ x t + 1 − x ∗ ∥ 2 2 ) . f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right) \leq \frac{L}{2}\left(\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2\right) . f(xt)f(x)2L( xtx 22 xt+1x 22).
    Summing up from 0 to T − 1 T-1 T1, we obtain
    1 T ∑ t = 0 T − 1 ( f ( x t ) − f ( x ∗ ) ) ≤ L ∥ x 0 − x ∗ ∥ 2 2 2 T \frac{1}{T} \sum_{t=0}^{T-1}\left(f\left(\boldsymbol{x}_t\right)-f\left(\boldsymbol{x}_*\right)\right) \leq \frac{L\left\|\boldsymbol{x}_0-\boldsymbol{x}^*\right\|_2^2}{2 T} T1t=0T1(f(xt)f(x))2TLx0x22
    \subsection{smooth and nonconvex}

我们不能期望找到有效的全局最优解。假设我们对任何 (近似) Stationary Point都满意。这意 味着我们的目标仅仅是找到一个要点 x x x 满足 ∥ ∇ f ( x ) ∥ 2 ≤ ε \|\nabla f(\boldsymbol{x})\|_2 \leq \varepsilon \quad ∥∇f(x)2ε (called ε \varepsilon ε-approximate stationary point )
1 2 L ∥ ∇ f ( x k ) ∥ 2 2 ≤ f ( x k ) − f ( x k + 1 ) , ∀ k \frac{1}{2 L}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2^2 \leq f\left(\boldsymbol{x}^k\right)-f\left(\boldsymbol{x}^{k+1}\right), \quad \forall k 2L1 f(xk) 22f(xk)f(xk+1),k
求和 from k = 0 k=0 k=0 to k = t − 1 k=t-1 k=t1 :
1 2 L ∑ k = t 0 t − 1 ∥ ∇ f ( x k ) ∥ 2 2 ≤ ∑ k = 0 t − 1 ( f ( x k ) − f ( x k + 1 ) ) = f ( x 0 ) − f ( x t ) ≤ f ( x 0 ) − f ( x ∗ ) ⟹ min ⁡ 0 ≤ k < t ∥ ∇ f ( x k ) ∥ 2 ≤ 2 L ( f ( x 0 ) − f ( x ∗ ) ) t \begin{aligned} \frac{1}{2 L} \sum_{k=t_0}^{t-1}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2^2 & \leq \sum_{k=0}^{t-1}\left(f\left(\boldsymbol{x}^k\right)-f\left(\boldsymbol{x}^{k+1}\right)\right)=f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^t\right) \\ & \leq f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^*\right) \\ \Longrightarrow & \min _{0 \leq k<t}\left\|\nabla f\left(\boldsymbol{x}^k\right)\right\|_2 \leq \sqrt{\frac{2 L\left(f\left(\boldsymbol{x}^0\right)-f\left(\boldsymbol{x}^*\right)\right)}{t}} \end{aligned} 2L1k=t0t1 f(xk) 22k=0t1(f(xk)f(xk+1))=f(x0)f(xt)f(x0)f(x)0k<tmin f(xk) 2t2L(f(x0)f(x))
\subsection{Regularity Condition}
From another perspective
∥ x t + 1 − x ∗ ∥ 2 2 = ∥ x t − x ∗ − 1 L ∇ f ( x t ) ∥ 2 2 = ∥ x t − x ∗ ∥ 2 2 + 1 L 2 ∥ ∇ f ( x t ) ∥ 2 2 − 2 L ⟨ x t − x ∗ , ∇ f ( x t ) ⟩ ≤ ∥ x t − x ∗ ∥ 2 2 − μ L ∥ x t − x ∗ ∥ 2 2 = ( 1 − μ L ) ∥ x t − x ∗ ∥ 2 2 \begin{aligned} \left\|\boldsymbol{x}^{t+1}-\boldsymbol{x}^*\right\|_2^2 & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*-\frac{1}{L} \nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2 \\ & =\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2+\frac{1}{L^2}\left\|\nabla f\left(\boldsymbol{x}^t\right)\right\|_2^2-\frac{2}{L}\left\langle\boldsymbol{x}^t-\boldsymbol{x}^*, \nabla f\left(\boldsymbol{x}^t\right)\right\rangle \\ & \leq\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2-\frac{\mu}{L}\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2 \\ & =\left(1-\frac{\mu}{L}\right)\left\|\boldsymbol{x}^t-\boldsymbol{x}^*\right\|_2^2 \end{aligned} xt+1x 22= xtxL1f(xt) 22= xtx 22+L21 f(xt) 22L2xtx,f(xt) xtx 22Lμ xtx 22=(1Lμ) xtx 22
这样我们同样可以得到linear convergence rate, 为了让 (i) 成立,我们需要
⟨ ∇ f ( x ) , x − x ∗ ⟩ ≥ μ 2 ∥ x − x ∗ ∥ 2 2 + 1 2 L ∥ ∇ f ( x ) ∥ 2 2 , ∀ x \left\langle\nabla f(\boldsymbol{x}), \boldsymbol{x}-\boldsymbol{x}^*\right\rangle \geq \frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2, \quad \forall \boldsymbol{x} f(x),xx2μxx22+2L1∥∇f(x)22,x
这被称为 Regularity Condition,可以隐约感觉到这个条件是强凸性和光滑性组合的结果,这表 明这个条件会比强凸性和光滑性弱。下面给出证明:
0 ≤ f ( x + ) − f ( x ∗ ) = f ( x + ) − f ( x ) + f ( x ) − f ( x ∗ ) ≤ ∇ f ( x ) ⊤ ( x + − x ) + L 2 ∥ x + − x ∥ 2 2 ⏟ smoothness  + ∇ f ( x ) ⊤ ( x − x ∗ ) − μ 2 ∥ x − x ∗ ∥ 2 2 ⏟ strong convexity  = ∇ f ( x ) ⊤ ( x + − x ∗ ) + 1 2 L ∥ ∇ f ( x ) ∥ 2 2 − μ 2 ∥ x − x ∗ ∥ 2 2 = ∇ f ( x ) ⊤ ( x + − x + x − x ∗ ) + 1 2 L ∥ ∇ f ( x ) ∥ 2 2 − μ 2 ∥ x − x ∗ ∥ 2 2 = ∇ f ( x ) ⊤ ( x − x ∗ ) − 1 2 L ∥ ∇ f ( x ) ∥ 2 2 − μ 2 ∥ x − x ∗ ∥ 2 2 , \begin{aligned} & 0 \leq f\left(\boldsymbol{x}^{+}\right)-f\left(\boldsymbol{x}^*\right)=f\left(\boldsymbol{x}^{+}\right)-f(\boldsymbol{x})+f(\boldsymbol{x})-f\left(\boldsymbol{x}^*\right) \\ & \leq \underbrace{\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}\right)+\frac{L}{2}\left\|\boldsymbol{x}^{+}-\boldsymbol{x}\right\|_2^2}_{\text {smoothness }}+\underbrace{\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2}_{\text {strong convexity }} \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}^*\right)+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2 \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}^{+}-\boldsymbol{x}+\boldsymbol{x}-\boldsymbol{x}^*\right)+\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2 \\ & =\nabla f(\boldsymbol{x})^{\top}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)-\frac{1}{2 L}\|\nabla f(\boldsymbol{x})\|_2^2-\frac{\mu}{2}\left\|\boldsymbol{x}-\boldsymbol{x}^*\right\|_2^2, \\ & \end{aligned} 0f(x+)f(x)=f(x+)f(x)+f(x)f(x)smoothness  f(x)(x+x)+2L x+x 22+strong convexity  f(x)(xx)2μxx22=f(x)(x+x)+2L1∥∇f(x)222μxx22=f(x)(x+x+xx)+2L1∥∇f(x)222μxx22=f(x)(xx)2L1∥∇f(x)222μxx22,
移项就可以得到 ( R L ) (R L) (RL) 条件。

\subsection{Nonsmooth case}

非光滑的话,我们研究次梯度方法的收敛性。次梯度的迭代为:
x k + 1 = x k − α k g k , g k ∈ ∂ f ( x k ) x^{k+1}=x^k-\alpha_k g^k, g^k \in \partial f\left(x^k\right) xk+1=xkαkgk,gkf(xk)
非光滑对于收敛性是及其不友好的,并且次梯度方向都不能保证是一个下降方向。在光滑的情况 下,只要 α k \alpha_k αk 选的足够小 (取决于光滑系数 L g L_g Lg ),再不济也能保证每步的函数值能够下降,通常 是通过线搜索来完成。而次梯度由于不是下降方向,就没有这个性质了,也就是说不能保证单调下 降。接下来,我分两个情况来讨论收敛性: 1.Convex。2. Strongly convex。
1.1.convex case
定理1.1 (Nonsmooth + convex) 如果函数 f f f 是凸的且是Lipschitzness的。对于迭代方法 (1.1),步长选择策略为: α k = f ( x k ) − f ∗ ∥ g k ∥ 2 \alpha_k=\frac{f\left(x^k\right)-f^*}{\left\|g^k\right\|^2} αk=gk2f(xk)f 如果 g k ≠ 0 g^k \neq 0 gk=0 ,否则 α k = 1 \alpha_k=1 αk=1 。那么我们有:

  1. ∥ x k + 1 − x ∗ ∥ 2 ≤ ∥ x k − x ∗ ∥ 2 \left\|x^{k+1}-x^*\right\|^2 \leq\left\|x^k-x^*\right\|^2 xk+1x 2 xkx 2
  2. f k → f ∗ f^k \rightarrow f^* fkf as k → ∞ k \rightarrow \infty k
  3. f best  n − f ∗ ≤ L f ∥ x 0 − x ∗ ∥ n + 1 f_{\text {best }}^n-f^* \leq \frac{L_f\left\|x^0-x^*\right\|}{\sqrt{n+1}} fbest nfn+1 Lfx0x ,其中, f best  n = min ⁡ k { f ( x k ) , k = 1 , ⋯   , n } f_{\text {best }}^n=\min _k\left\{f\left(x^k\right), k=1, \cdots, n\right\} fbest n=mink{f(xk),k=1,,n}
    Proof: 根据 (0.6),我们令 x = x ∗ x=x^* x=x 得到:
    ∥ x k + 1 − x ∗ ∥ 2 = ∥ x k − x ∗ ∥ 2 − 2 α k ⟨ g k , x k − x ∗ ⟩ + α k 2 ∥ g k ∥ 2 ≤ ∥ x k − x ∗ ∥ 2 − 2 α k ( f ( x k ) − f ∗ ) + α k 2 ∥ g k ∥ 2 = ∥ x k − x ∗ ∥ 2 − ( f ( x k ) − f ∗ ) 2 ∥ g k ∥ 2 \begin{aligned} \left\|x^{k+1}-x^*\right\|^2 & =\left\|x^k-x^*\right\|^2-2 \alpha_k\left\langle g^k, x^k-x^*\right\rangle+\alpha_k^2\left\|g^k\right\|^2 \\ & \leq\left\|x^k-x^*\right\|^2-2 \alpha_k\left(f\left(x^k\right)-f^*\right)+\alpha_k^2\left\|g^k\right\|^2 \\ & =\left\|x^k-x^*\right\|^2-\frac{\left(f\left(x^k\right)-f^*\right)^2}{\left\|g^k\right\|^2} \end{aligned} xk+1x 2= xkx 22αkgk,xkx+αk2 gk 2 xkx 22αk(f(xk)f)+αk2 gk 2= xkx 2gk2(f(xk)f)2
    第一个不等式用了凸性,第二个等式用了步长代入,第二个不等式用到了函数的Lipschitzness。 从上式可以得出定理中第一条满足。然后我们对上式做累加
    1 L f 2 ∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ ∑ k = 0 n { ∥ x k − x ∗ ∥ 2 − ∥ x k + 1 − x ∗ ∥ 2 } = ∥ x 0 − x ∗ ∥ 2 − ∥ x n + 1 − x ∗ ∥ 2 \begin{aligned} \frac{1}{L_f^2} \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 & \leq \sum_{k=0}^n\left\{\left\|x^k-x^*\right\|^2-\left\|x^{k+1}-x^*\right\|^2\right\} \\ & =\left\|x^0-x^*\right\|^2-\left\|x^{n+1}-x^*\right\|^2 \end{aligned} Lf21k=0n(f(xk)f)2k=0n{ xkx 2 xk+1x 2}= x0x 2 xn+1x 2
    系数移一下:
    ∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ L f 2 ∥ x 0 − x ∗ ∥ 2 \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 \leq L_f^2\left\|x^0-x^*\right\|^2 k=0n(f(xk)f)2Lf2 x0x 2
    根据右边有界,我们得到 f k → f ∗ f^k \rightarrow f^* fkf as k → ∞ k \rightarrow \infty k 。最后
    ( n + 1 ) ( f b e s t k − f ∗ ) 2 ≤ ∑ k = 0 n ( f ( x k ) − f ∗ ) 2 ≤ L f 2 ∥ x 0 − x ∗ ∥ 2 (n+1)\left(f_{b e s t}^k-f^*\right)^2 \leq \sum_{k=0}^n\left(f\left(x^k\right)-f^*\right)^2 \leq L_f^2\left\|x^0-x^*\right\|^2 (n+1)(fbestkf)2k=0n(f(xk)f)2Lf2 x0x 2
    移项开根号便得到了定理第三部分。证毕
    Remark:这个步长的选择叫做 “Polyak’ s stepsize”,这个步长里面涉及到了函数的最优值, 这在实际运算是不被允许的,除非那种我们事先知道最优值的问题(求最优解 x ∗ x^* x 满足 f ( x ∗ ) = f ∗ f\left(x^*\right)=f^* f(x)=f ) ; 另外还有其他的步长策略,我就不说了,证明类似。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值