First Order Methods in Optimization Ch3. Subgradients (Part II)_first order methods in optimization pdf-CSDN博客

第三章: 次梯度 (Part II)

文章目录

第三章: 次梯度 (Part II)
- 3. 方向导数

3. 方向导数

3.1 定义与基本性质

设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常函数, $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . $f$ 在 $\mathbf{x}$ 处沿给定方向 $\mathbf{d}\in\mathbb{E}$ 的方向导数(若存在)定义为 $f'(\mathbf{x};\mathbf{d})=\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha\mathbf{d})-f(\mathbf{x})}{\alpha}.$ 凸函数在有效域内部任意点处沿所有方向都存在方向导数. 这是下面的定理8.

定理8 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常凸函数, $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . 于是对 $\forall\mathbf{d}\in\mathbb{E}$ , 方向导数 $f'(\mathbf{x;d})$ 存在.
证明: 令 $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ , $\mathbf{d}\ne\mathbf{0}$ . 于是方向导数(若存在的话)就是极限 $\lim_{t\to0^+}\frac{g(t)-g(0)}{t},$ 其中 $g(t)=f(\mathbf{x}+t\mathbf{d})$ . 定义 $h(t)\equiv\frac{g(t)-g(0)}{t}$ , 则上述极限式可等价地写作 $\lim_{t\to0^+}h(t).$ 取 $\epsilon>0$ 使得 $\mathbf{x}+t\mathbf{d},\mathbf{x}-t\mathbf{d}\in\mathrm{int}(\mathrm{dom}(f)),\forall t\in[0,\epsilon]$ . 现令 $0<t_1<t_2\le\epsilon$ . 于是 $\mathbf{x}+t_1\mathbf{d}=\left(1-\frac{t_1}{t_2}\right)\mathbf{x}+\frac{t_1}{t_2}(\mathbf{x}+t_2\mathbf{d}),$ 因此由 $f$ 的凸性我们有 $f(\mathbf{x}+t_1\mathbf{d})\le\left(1-\frac{t_1}{t_2}\right)f(\mathbf{x})+\frac{t_1}{t_2}f(\mathbf{x}+t_2\mathbf{d}).$ 经整理后上述不等式就可以写成 $\frac{f(\mathbf{x}+t_1\mathbf{d})-f(\mathbf{x})}{t_1}\le\frac{f(\mathbf{x}+t_2\mathbf{d})-f(\mathbf{x})}{t_2},$ 这等同于 $h(t_1)\le h(t_2)$ . 因此 $h$ 在 $\mathbb{R}_+$ 上是单增的. 由数学分析, 我们只需要证明 $h$ 在 $(0,\epsilon]$ 下有界. 事实上, 取 $0<t\le\epsilon$ , 并注意到 $\mathbf{x}=\frac{\epsilon}{\epsilon+t}(\mathbf{x}+t\mathbf{d})+\frac{t}{\epsilon+t}(\mathbf{x}-\epsilon\mathbf{d}).$ 因此再由 $f$ 的凸性, 我们有 $f(\mathbf{x})\le\frac{\epsilon}{\epsilon+t}f(\mathbf{x}+t\mathbf{d})+\frac{t}{\epsilon+t}f(\mathbf{x}-\epsilon\mathbf{d}),$ 经整理后即可得 $h(t)=\frac{f(\mathbf{x}+t\mathbf{d})-f(\mathbf{x})}{t}\ge\frac{f(\mathbf{x}-f(\mathbf{x}-\epsilon\mathbf{d})}{\epsilon},$ 这就证明了 $h$ 在 $(0,\epsilon]$ 上下有界. 证毕.

下面讨论函数 $\mathbf{d}\mapsto f'(\mathbf{x};\mathbf{d})$ 的一些基本性质. 下面的引理表明, 它是凸的, 且是一阶齐次的.

引理2 ( $\mathbf{d}\mapsto f'(\mathbf{x;d})$ 的凸性和齐次性) 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常凸函数, $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . 于是
(i) 函数 $\mathbf{d}\mapsto f'(\mathbf{x;d})$ 是凸函数;
(ii) 对 $\forall\lambda\ge0,\mathbf{d}\in\mathbb{E}$ , $f'(\mathbf{x;\lambda d})=\lambda f'(\mathbf{x;d})$ .

证明: (i) 为证明 $g(\mathbf{d}\equiv f'(\mathbf{x;d})$ 是凸函数, 取 $\mathbf{d}_1,\mathbf{d}_2\in\mathbb{E},\lambda\in[0,1]$ . 于是 $\begin{aligned}f'(\mathbf{x}&;\lambda\mathbf{d}_1+(1-\lambda)\mathbf{d}_2)\\ &=\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha[\lambda\mathbf{d}_1+(1-\lambda)\mathbf{d}_2])-f(\mathbf{x})}{\alpha}\\ &=\lim_{\alpha\to0^+}\frac{f(\lambda(\mathbf{x}+\alpha\mathbf{d}_1)+(1-\lambda)(\mathbf{x}+\alpha\mathbf{d}_2))-f(\mathbf{x})}{\alpha}\\ &\le\lim_{\alpha\to0^+}\frac{\lambda f(\mathbf{x}+\alpha\mathbf{d}_1)+(1-\lambda)f(\mathbf{x}+\alpha\mathbf{d}_2)-f(\mathbf{x})}{\alpha}\\ &=\lambda\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha\mathbf{d}_1)-f(\mathbf{x})}{\alpha}+(1-\lambda)\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha\mathbf{d}_2)-f(\mathbf{x})}{\alpha}\\ &=\lambda f'(\mathbf{x;d}_1)+(1-\lambda)f'(\mathbf{x;d}_2),\end{aligned}$ (ii) 若 $\lambda=0$ , 则结论显然. 取 $\lambda>0$ . 于是 $f'(\mathbf{x;\lambda d})=\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha\lambda\mathbf{d})-f(\mathbf{x})}{\alpha}=\lambda\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha\lambda\mathbf{d})-f(\mathbf{x})}{\alpha\lambda}=\lambda f'(\mathbf{x;d}).$

下面的一个结论指出了方向导数和函数值之间的关系.

引理3 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常凸函数, $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . 于是 $f(\mathbf{y})\ge f(\mathbf{x})+f'(\mathbf{x;y-x}),\quad\forall\mathbf{y}\in\mathrm{dom}(f).$ 证明: 由方向导数的定义, $\begin{aligned}f'(\mathbf{x;y-x})&=\lim_{\alpha\to0^+}\frac{f(\mathbf{x}+\alpha(\mathbf{y-x}))-f(\mathbf{x})}{\alpha}\\ &=\lim_{\alpha\to0^+}\frac{f((1-\alpha)\mathbf{x}+\alpha\mathbf{y})-f(\mathbf{x})}{\alpha}\\ &\le\lim_{\alpha\to0^+}\frac{-\alpha f(\mathbf{x})+\alpha f(\mathbf{y})}{\alpha}\\ &=f(\mathbf{y})-f(\mathbf{x}).\end{aligned}$

在我们要计算有限个函数的最大函数的方向导数时, 我们有如下公式可以使用. 而这个公式无需任何凸性的假设.

定理9 (有限个函数的最大函数的方向导数) 设 $f(\mathbf{x})=\max\{f_1(\mathbf{x}),f_2(\mathbf{x}),\ldots,f_m(\mathbf{x})\}$ , 其中 $f_1,f_2,\ldots,f_m:\mathbb{E}\to(-\infty,\infty]$ 为正常函数. 取 $\mathbf{x}\in\bigcap_{i=1}^m\mathrm{int}(\mathrm{dom}(f_i))$ , $\mathbf{d}\in\mathbb{E}$ . 假设对 $\forall i\in\{1,2,\ldots,m\}$ , $f_i'(\mathbf{x;d})$ 都存在. 于是 $f'(\mathbf{x;d})=\max_{i\in I(\mathbf{x})}f_i'(\mathbf{x;d}),$ 其中 $I(\mathbf{x})=\{i:f_i(\mathbf{x})=f(\mathbf{x})\}$ .

证明: 对 $\forall i\in\{1,2,\ldots,m\}$ , $\lim_{t\to0^+}f_i(\mathbf{x}+t\mathbf{d})=\lim_{t\to0^+}\left[t\frac{f_i(\mathbf{x}+t\mathbf{d})-f_i(\mathbf{x})}{t}+f_i(\mathbf{x})\right]=0\cdot f_i'(\mathbf{x;d})+f_i(\mathbf{x})=f_i(\mathbf{x}).$ 由 $I(\mathbf{x})$ 的定义, $f_i(\mathbf{x})>f_j(\mathbf{x}),\forall i\in I(\mathbf{x}),j\notin I(\mathbf{x})$ . 由上式, 我们推出 $\exists\epsilon>0$ 使得 $f_i(\mathbf{x}+t\mathbf{d})>f_j(\mathbf{x}+t\mathbf{d}),\forall i\in I(\mathbf{x}),j\notin I(\mathbf{x}),t\in(0,\epsilon]$ . 因此对 $\forall t\in(0,\epsilon]$ , $f(\mathbf{x}+t\mathbf{d})=\max_{i=1,2,\ldots,m}f_i(\mathbf{x}+t\mathbf{d})=\max_{i\in I(\mathbf{x})}f_i(\mathbf{x}+t\mathbf{d}).$ 因此对 $\forall t\in(0,\epsilon]$ , $\frac{f(\mathbf{x}+t\mathbf{d})-f(\mathbf{x})}{t}=\frac{\max_{i\in I(\mathbf{x})}f_i(\mathbf{x}+t\mathbf{d})-f(\mathbf{x})}{t}=\max_{i\in I(\mathbf{x})}\frac{f_i(\mathbf{x}+t\mathbf{d})-f_i(\mathbf{x})}{t},$ 这里最后一个等式是因为 $f(\mathbf{x})=f_i(\mathbf{x}),\forall i\in I(\mathbf{x})$ . 取极限 $t\to0^+$ , 我们最终得到 $\begin{aligned}f'(\mathbf{x;d})&=\lim_{t\to0^+}\frac{f(\mathbf{x}+t\mathbf{d})-f(\mathbf{x})}{t}\\ &=\lim_{t\to0^+}\max_{i\in I(\mathbf{x})}\frac{f_i(\mathbf{x}+t\mathbf{d})-f_i(\mathbf{x})}{t}\\ &=\max_{i\in I(\mathbf{x})}\lim_{t\to0^+}\frac{f_i(\mathbf{x}+t\mathbf{d})-f_i(\mathbf{x})}{t}\\ &=\max_{i\in I(\mathbf{x})}f_i'(\mathbf{x;d}).\end{aligned}$
注意到定理9要求方向导数 $f_i'(\mathbf{x;d})$ 都存在. 这在函数 $f_1,f_2,\ldots,f_m$ 是凸函数时是自动成立的. 因此我们有如下推论.

推论3 (有限个函数的最大函数的方向导数——凸的情形) 设 $f(\mathbf{x})=\max\{f_1(\mathbf{x}),f_2(\mathbf{x}),\ldots,f_m(\mathbf{x})\}$ , 其中 $f_1,f_2,\ldots,f_m:\mathbb{E}\to(-\infty,\infty]$ 为正常凸函数. 取 $\mathbf{x}\in\bigcap_{i=1}^m\mathrm{int}(\mathrm{dom}(f_i))$ , $\mathbf{d}\in\mathbb{E}$ . 于是 $f'(\mathbf{x;d})=\max_{i\in I(\mathbf{x})}f_i'(\mathbf{x;d}),$ 其中 $I(\mathbf{x})=\{i:f_i(\mathbf{x})=f(\mathbf{x})\}$ .

3.2 极大公式

下面我们将证明一个极其重要而且使用广泛的结论——极大公式 (the max formula). 这一公式是次梯度和方向导数间的桥梁.

定理10 (极大公式) 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常凸函数. 则对 $\forall\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ , $\mathbf{d}\in\mathbb{E}$ , $f'(\mathbf{x};\mathbf{d})=\max\{\langle\mathbf{g},\mathbf{d}\rangle:\mathbf{g}\in\partial f(\mathbf{x})\}.$ 证明: 令 $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ , $\mathbf{d}\in\mathbb{E}$ . 由次梯度不等式, 对 $\forall\mathbf{g}\in\partial f(\mathbf{x})$ , 我们有 $f'(\mathbf{x;d})=\lim_{\alpha\to0^+}\frac{1}{\alpha}\left(f(\mathbf{x}+\alpha\mathbf{d})-f(\mathbf{x})\right)\ge\lim_{\alpha\to0^+}\langle\mathbf{g,d}\rangle=\langle\mathbf{g,d}\rangle,$ 从而得到一边不等式 $f'(\mathbf{x;d})\ge\max\{\langle\mathbf{g,d}\rangle:\mathbf{g}\in\partial f(\mathbf{x})\}.$ 为证反向不等式, 定义函数 $h(\mathbf{w})=f'(\mathbf{x;w})$ . 于是由引理2的(i), 我们知道 $h$ 是一个实值凸函数, 从而由推论1可推出 $h$ 在全空间 $\mathbb{E}$ 上次可微. 取 $\tilde{\mathbf{g}}\in\partial h(\mathbf{d})$ . 于是对 $\forall\mathbf{v}\in\mathbb{E},\,\alpha\ge0$ , 从 $h$ 的正齐次性 (引理2的(ii)) 即可得 $\alpha f'(\mathbf{x;v})=f'(\mathbf{x;\alpha v})=h(\alpha\mathbf{v})\ge h(\mathbf{d})+\langle\tilde{\mathbf{g}},\alpha\mathbf{v-d}\rangle=f'(\mathbf{x;d})+\langle\tilde{\mathbf{g}},\alpha\mathbf{v-d}\rangle.$ 移项可得 $\alpha\left(f'(\mathbf{x;v})-\langle\tilde{\mathbf{g}},\mathbf{v}\rangle\right)\ge f'(\mathbf{x;d})-\langle\tilde{\mathbf{g}},\mathbf{d}\rangle.$ 因上述不等式对 $\forall\alpha\ge0$ 均成立, 因此左端 $\alpha$ 的系数必定非负: $f'(\mathbf{x;v})\ge\langle\tilde{\mathbf{g}},\mathbf{v}\rangle.$ 再由引理3, 可知对 $\forall\mathbf{y}\in\mathrm{dom}(f)$ , $f(\mathbf{y})\ge f(\mathbf{x})+f'(\mathbf{x;y-x})\ge f(\mathbf{x})+\langle\tilde{\mathbf{g}},\mathbf{y-x}\rangle,$ 从而 $\tilde{\mathbf{g}}\in\partial f(\mathbf{x})$ . 再取 $\alpha=0$ , 即得反向不等式 $f'(\mathbf{x;d})\le\langle\tilde{\mathbf{g}},\mathbf{d}\rangle\le\max\{\langle\mathbf{g,d}\rangle:\mathbf{g}\in\partial f(\mathbf{x})\}.$ 证毕.

注2: 极大公式也可以用支撑函数写成更加简洁的形式: $f'(\mathbf{x;d}\rangle=\sigma_{\partial f(\mathbf{x})}(\mathbf{d}).$

3.3 可微性

定义4 (可微性 (differentiability)) 设 $f:\mathbb{E}\to(-\infty,\infty]$ , $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . 函数 $f$ 称为是在 $\mathbf{x}$ 处可微的 (differentiable), 若存在 $\mathbf{g}\in\mathbb{E}^*$ 使得 $\lim_{\mathbf{h\to0}}\frac{f(\mathbf{x+h})-f(\mathbf{x})-\langle\mathbf{g,h}\rangle}{\Vert\mathbf{h}\Vert}=0.$ 满足上述极限式的唯一¹向量 $\mathbf{g}$ 称为 $f$ 在 $\mathbf{x}$ 处的梯度², 记为 $\nabla f(\mathbf{x})$ .

注3: 上述定义的实际上是Frechet可微性.

若 $f$ 在 $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ 处可微, 则我们有方向导数的一个简单表示.

定理11 (可微点处的方向导数) 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为正常函数, $f$ 在 $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ 处可微. 则对 $\forall\mathbf{d}\in\mathbb{E}$ , $f'(\mathbf{x;d})=\langle\nabla f(\mathbf{x}),\mathbf{d}\rangle.$ 证明: $\mathbf{d}=\mathbf{0}$ 时结论显然成立. 下设 $\mathbf{d}\neq\mathbf{0}$ . 由方向导数的定义, $\begin{aligned}f'(\mathbf{x;d})&=\lim_{\alpha\to0}\frac{f(\mathbf{x+\alpha d})-f(\mathbf{x})}{\alpha}\\&=\lim_{\alpha\to0}\Vert\mathbf{d}\Vert\frac{f(\mathbf{x+\alpha d})-f(\mathbf{x})-\langle\nabla f(\mathbf{x}),\alpha\mathbf{d}\rangle}{\alpha\Vert\mathbf{d}\Vert}+\frac{\langle\nabla f(\mathbf{x}),\alpha\mathbf{d}\rangle}{\alpha}\\&=\langle\nabla f(\mathbf{x}),\mathbf{d}\rangle.\end{aligned}$ 上述第三个等号来自 $f$ 在 $\mathbf{x}$ 处可微的假设. 证毕.

例8 (有限个可微函数的最大函数的方向导数) 考虑函数 $f(\mathbf{x})=\max_{i=1,2,\ldots,m}f_i(\mathbf{x})$ , 其中 $f_i:\mathbb{E}\to(-\infty,\infty]$ 是正常函数. 设 $f_1,f_2,\ldots,f_m$ 在给定点 $\mathbf{x}\in\bigcap_{i=1}^m\mathrm{int}(\mathrm{dom}(f_i))$ 可微. 则由定理11, 对 $\forall\mathbf{d}\in\mathbb{E}$ , $f_i'(\mathbf{x;d})=\langle\nabla f_i(\mathbf{x}),\mathbf{d}\rangle$ . 再由定理9, $f'(\mathbf{x;d})=\max_{i\in I(\mathbf{x})}f_i'(\mathbf{x;d})=\max_{i\in I(\mathbf{x})}\langle\nabla f_i(\mathbf{x}),\mathbf{d}\rangle,$ 其中 $I(\mathbf{x})=\{i:f_i(\mathbf{x})=f(\mathbf{x})\}$ .

例9 ( $\frac{1}{2}d_C^2(\cdot)$ 的梯度) 设 $\mathbb{E}$ 为欧式空间, $C\subset\mathbb{E}$ 为非空闭凸集. 考虑函数 $\varphi_C:\mathbb{E}\to\mathbb{R}$ 定义为 $\varphi_C(\mathbf{x})=\frac{1}{2}d_C^2(\mathbf{x})=\frac{1}{2}\Vert\mathbf{x}-P_C(\mathbf{x})\Vert^2$ , 其中 $P_C$ 为集合 $C$ 的正交投影算子³, 定义为 $P_C(\mathbf{x})=\arg\min_{\mathbf{y}\in C}\Vert\mathbf{y-x}\Vert.$ 下面我们说明对 $\forall\mathbf{x}\in\mathbb{E}$ , $\boxed{\nabla\varphi_C(\mathbf{x})=\mathbf{x}-P_C(\mathbf{x}).}$ 为此, 固定 $\mathbf{x}\in\mathbb{E}$ , 定义函数 $g_{\mathbf{x}}$ : $g_{\mathbf{x}}(\mathbf{d})\equiv\varphi_C(\mathbf{x+d})-\varphi_C(\mathbf{x})-\langle\mathbf{d,z_x}\rangle,$ 其中 $\mathbf{z_x}=\mathbf{x}-P_C(\mathbf{x})$ . 由梯度的定义, 我们只需证明当 $\mathbf{d}\to\mathbf{0}$ , $\frac{g_{\mathbf{x}}(\mathbf{d})}{\Vert\mathbf{d}\Vert}\to0.$ 注意到正交投影的定义, 我们对 $\forall\mathbf{d}\in\mathbb{E}$ 有 $\Vert\mathbf{x+d}-P_C(\mathbf{x+d})\Vert^2\le\Vert\mathbf{x+d}-P_C(\mathbf{x})\Vert^2,$ 从而 $\begin{aligned}g_{\mathbf{x}}(\mathbf{d})&=\frac{1}{2}\Vert\mathbf{x+d}-P_C(\mathbf{x+d})\Vert^2-\frac{1}{2}\Vert\mathbf{x}-P_C(\mathbf{x})\Vert^2-\langle\mathbf{d,z_x}\rangle\\&\le\frac{1}{2}\Vert\mathbf{x+d}-P_C(\mathbf{x})\Vert^2-\frac{1}{2}\Vert\mathbf{x}-P_C(\mathbf{x})\Vert^2-\langle\mathbf{d,z_x}\rangle\\&=\frac{1}{2}\Vert\mathbf{d}\Vert^2.\end{aligned}$ 特别地, 我们有 $g_{\mathbf{x}}(-\mathbf{d})\le\frac{1}{2}\Vert\mathbf{d}\Vert^2.$ 可以证明 $d_C(\mathbf{x})$ 是凸函数⁴. 从而 $g_{\mathbf{x}}(\mathbf{d})$ 也是凸函数. 所以由Jensen不等式, $0=g_{\mathbf{x}}(\mathbf{0})=g_{\mathbf{x}}\left(\frac{\mathbf{d+(-d)}}{2}\right)\le\frac{1}{2}g_{\mathbf{x}}(\mathbf{d})+\frac{1}{2}g_{\mathbf{x}}(-\mathbf{d}).$ 移项可得 $g_{\mathbf{x}}(\mathbf{d})\ge g_{\mathbf{x}}(-\mathbf{d})\ge-\frac{1}{2}\Vert\mathbf{d}\Vert^2.$ 结合之前的不等式, 可知 $\left|g_{\mathbf{x}}(\mathbf{d})\right|\le\frac{1}{2}\Vert\mathbf{d}\Vert^2\Rightarrow\frac{g_{\mathbf{x}}(\mathbf{d})}{\Vert\mathbf{d}\Vert}\to0.$ 证毕.

注4: 定义4所定义的梯度是依赖于空间中所选取的内积的. 这与“经典”的梯度不同. 下面我们给出在两种不同内积下得到的梯度的形式. 设 $\mathbb{E}=\mathbb{R}^n$ .
(i) 内积是点积. 于是 $(\nabla f(\mathbf{x}))_i=\langle\nabla f(\mathbf{x}),\mathbf{e}_i\rangle=f'(\mathbf{x};\mathbf{e}_i).$ 即 $\nabla f(\mathbf{x})$ 的第 $i$ 个分量就是 $\frac{\partial f}{\partial x_i}(\mathbf{x})=f'(\mathbf{x;e}_i)$ . 从而此时 $\nabla f(\mathbf{x})=D_f(\mathbf{x}),\quad D_f(\mathbf{x})\equiv\begin{pmatrix}\frac{\partial f}{\partial x_1}(\mathbf{x}) & \frac{\partial f}{\partial x_2}(\mathbf{x}) & \cdots & \frac{\partial f}{\partial x_n}(\mathbf{x})\end{pmatrix}^T.$ 注意方向导数的定义并不依赖于内积的选取, 所以特别地在内积为点积的情形下, 由定理11, 可得 $f'(\mathbf{x;d})=D_f(\mathbf{x})^T\mathbf{d}=\sum_{i=1}^n\frac{\partial f}{\partial x_i}(\mathbf{x})d_i.$ (ii) 内积定义为 $\langle\mathbf{x},\mathbf{y}\rangle=\mathbf{x}^T\mathbf{Hy},$ 这里 $\mathbf{H}$ 为给定的 $n\times n$ 阶正定矩阵. 此时在考察 $\nabla f(\mathbf{x})$ 的第 $i$ 个分量: $\begin{aligned}(\nabla f(\mathbf{x}))_i&=\nabla f(\mathbf{x})^T\mathbf{e}_i\\&=\nabla f(\mathbf{x})^T\mathbf{H}(\mathbf{H}^{-1}\mathbf{e}_i)\\&=\langle\nabla f(\mathbf{x}),\mathbf{H}^{-1}\mathbf{e}_i\rangle\\&=f'(\mathbf{x};\mathbf{H}^{-1}\mathbf{e}_i)\\&=D_f(\mathbf{x})^T\mathbf{H}^{-1}\mathbf{e}_i.\end{aligned}$ 此时的梯度是经典的梯度经加权后得到的: $\nabla f(\mathbf{x})=\mathbf{H}^{-1}D_f(\mathbf{x}).$
若 $\mathbb{E}=\mathbb{R}^{m\times n}$ , 则有类似的结论:
(i) 内积是点积: $\langle\mathbf{x,y}\rangle=\mathrm{Tr}(\mathbf{x}^T\mathbf{y}),\quad\forall\mathbf{x,y}\in\mathbb{R}^{m\times n}.$ 给定一正常函数 $f:\mathbb{R}^{m\times n}\to(-\infty,\infty]$ , $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ , 若其在 $\mathbf{x}$ 处可微, 则在 $\mathbf{x}$ 处的梯度为 $\nabla f(\mathbf{x})=D_f(\mathbf{x})$ , 这里 $D_f(\mathbf{x})$ 为 $m\times n$ 矩阵 $D_f(\mathbf{x})=\left(\frac{\partial f}{\partial x_{ij}}(\mathbf{x})\right)_{i,j}.$ (ii) 内积定义为 $\langle\mathbf{x,y}\rangle=\mathrm{Tr}(\mathbf{x}^T\mathbf{Hy}),$ 这里 $\mathbf{H}$ 为 $m\times m$ 的正定阵. 则 $\nabla f(\mathbf{x})=\mathbf{H}^{-1}D_f(\mathbf{x}).$
可微性与次微分是紧密联系的.
定理12 (可微点处的次微分) 设 $f:\mathbb{E}\to(-\infty,\infty]$ 为一正常凸函数, $\mathbf{x}\in\mathrm{int}(\mathrm{dom}(f))$ . 则 $f$ 在 $\mathbf{x}$ 处可微当且仅当 $\partial f(\mathbf{x})$ 是单点集, 且此时 $\partial f(\mathbf{x})=\{\nabla f(\mathbf{x})\}$ .

证明: (必要性) 设 $f$ 在 $\mathbf{x}$ 处可微. 则由定理11, 对 $\forall\mathbf{d}\in\mathbb{E}$ , $f'(\mathbf{x;d})=\langle\nabla f(\mathbf{x}),\mathbf{d}\rangle.$ 任取 $\mathbf{g}\in\partial f(\mathbf{x})$ . 下证 $\mathbf{g}=\nabla f(\mathbf{x})$ . 由极大公式 (定理10) 可知 $\langle\nabla f(\mathbf{x}),\mathbf{d}\rangle=f'(\mathbf{x;d})\ge\langle\mathbf{g},\mathbf{d}\rangle,$ 从而 $\langle\mathbf{g}-\nabla f(\mathbf{x}),\mathbf{d}\rangle\le0,\quad\forall\mathbf{d}\in\mathbb{E}.$ 从而上式对 $\forall\mathbf{d}:\Vert\mathbf{d}\Vert\le1$ 成立, 于是 $\Vert\mathbf{g}-\nabla f(\mathbf{x})\Vert_*\le0$ . 所以 $\mathbf{g}=\nabla f(\mathbf{x})$ . 再由定理3, 可知 $\partial f(\mathbf{x})$ 非空. 因此 $\partial f(\mathbf{x})=\{\nabla f(\mathbf{x})\}$ .

(充分性) 假设 $f$ 在 $\mathbf{x}$ 处的次微分是单点集. 设 $\mathbf{g}$ 为 $f$ 在 $\mathbf{x}$ 处的唯一次梯度. 作辅助函数 $h(\mathbf{u})=f(\mathbf{x+u})-f(\mathbf{x})-\langle\mathbf{g,u}\rangle.$ 我们只需证明在 $\mathbf{u}\to\mathbf{0}$ 时, 有 $\frac{h(\mathbf{u})}{\Vert\mathbf{u}\Vert}\to0.$ 易验证 $\mathbf{0}$ 是 $h$ 在 $\mathbf{0}$ 处的唯一次梯度. 事实上, 任取 $\mathbf{y}\in\partial h(\mathbf{0})$ , 有 $f(\mathbf{x+z})-f(\mathbf{x})-\langle\mathbf{g},\mathbf{z}\rangle=h(\mathbf{z})\ge h(\mathbf{0})+\langle\mathbf{y,z}\rangle=\langle\mathbf{y,z}\rangle,\quad\forall\mathbf{z}\in\mathbb{E}.$ 从而 $f(\mathbf{x+z})\ge f(\mathbf{x})+\langle\mathbf{g+y,z}\rangle,\quad\forall\mathbf{z}\in\mathbb{E}.$ 这说明 $\mathbf{g+y}\in\partial f(\mathbf{x})$ . 但 $\partial f(\mathbf{x})=\{\mathbf{g}\}$ , 所以 $\mathbf{y}=\mathbf{0}$ .
由于 $\mathbf{0}\in\mathrm{int}(\mathrm{dom}(h))$ , 所以由极大公式 (定理10) 知对 $\forall\mathbf{d}\in\mathbb{E}$ , $h'(\mathbf{0;d})=\sigma_{\partial h(\mathbf{0})}(\mathbf{d})=0.$ 于是对 $\forall\mathbf{d}\in\mathbb{E}$ , $0=h'(\mathbf{0;d})=\lim_{\alpha\to0^+}\frac{h(\alpha\mathbf{d})-h(\mathbf{0})}{\alpha}=\lim_{\alpha\to0^+}\frac{h(\alpha\mathbf{d})}{\alpha}.$ 注意上述式子与我们的目标不同: 上式中 $\mathbf{d}$ 是固定的, 相当于是要证的目标式子中沿一条固定射线趋于原点的情形. 为证明目标式子, 我们需要利用 $\mathbf{0}$ 是 $h$ 有效域内点这一事实.
设 $\{\mathbf{v}_1,\mathbf{v}_2,\ldots,\mathbf{v}_k\}$ 为 $\mathbb{E}$ 的一组标准正交基. 由于 $\mathbf{0}\in\mathrm{int}(\mathrm{dom}(h))$ , 所以存在 $\epsilon\in(0,1)$ 使得 $\epsilon\mathbf{v}_i,-\epsilon\mathbf{v}_i\in\mathrm{dom}(h),\,i=1,2,\ldots,k$ . 因 $\mathrm{dom}(h)$ 是凸集, 于是凸包 $D=\mathrm{conv}\left(\{\pm\epsilon\mathbf{v}_i\}_{i=1}^k\right)\subset\mathrm{dom}(h)$ . 设 $\Vert\cdot\Vert$ 为 $\mathbb{E}$ 中的欧式范数. 注意到 $B_{\Vert\cdot\Vert}[\mathbf{0},\gamma]\subset D$ , 其中 $\gamma=\frac{\epsilon}{k}$ . 事实上, 任取 $\mathbf{w}\in B_{\Vert\cdot\Vert}[\mathbf{0},\gamma]$ . 于是 $\mathbf{w}=\sum_{i=1}^k\langle\mathbf{w},\mathbf{v}_i\rangle\mathbf{v}_i,\quad\Vert\mathbf{w}\Vert^2=\sum_{i=1}^k\langle\mathbf{w},\mathbf{v}_i\rangle^2.$ 因为 $\Vert\mathbf{w}\Vert\le\gamma$ , 因此由Parseval等式, $|\langle\mathbf{w},\mathbf{v}_i\rangle|\le\gamma$ . 于是 $\mathbf{w}=\sum_{i=1}^k\langle\mathbf{w,v}_i\rangle=\sum_{i=1}^k\frac{|\langle\mathbf{w,v}_i\rangle|}{\epsilon}[\mathrm{sgn}(\langle\mathbf{w,v}_i\rangle)\epsilon\mathbf{v}_i]+\left(1-\sum_{i=1}^k\frac{|\langle\mathbf{w,v}_i\rangle|}{\epsilon}\right)\cdot\mathbf{0}\in D.$ 注意上式中 $1-\sum_{i=1}^k\frac{|\langle\mathbf{w,v}_i\rangle|}{\epsilon}\ge0$ . 因此就有 $B_{\Vert\cdot\Vert}[\mathbf{0},\gamma]\subset D$ . 记 $2 k$ 个向量 $\{\pm\epsilon\mathbf{v}_i\}_{i=1}^k$ 为 $\mathbf{z}_1,\mathbf{z}_2,\ldots,\mathbf{z}_{2k}$ . 任取 $\mathbf{0}\ne\mathbf{u}\in B_{\Vert\cdot\Vert}[0,\gamma^2]$ . 我们有 $\gamma\frac{\mathbf{u}}{\Vert\mathbf{u}\Vert}\in B_{\Vert\cdot\Vert}[0,\gamma]\subset D$ , 从而存在 $\bm{\lambda}\in\Delta_{2k}$ 使得 $\gamma\frac{\mathbf{u}}{\Vert\mathbf{u}\Vert}=\sum_{i=1}^{2k}\lambda_i\mathbf{z}_i.$ 因此 $\begin{aligned}\frac{h(\mathbf{u})}{\Vert\mathbf{u}\Vert}&=\frac{h\left(\frac{\Vert\mathbf{u}\Vert}{\gamma}\gamma\frac{\mathbf{u}}{\Vert\mathbf{u}\Vert}\right)}{\Vert\mathbf{u}\Vert}=\frac{h\left(\sum_{i=1}^{2k}\lambda_i\frac{\Vert\mathbf{u}\Vert}{\gamma}\mathbf{z}_i\right)}{\Vert\mathbf{u}\Vert}\\&\le\sum_{i=1}^{2k}\lambda_i\frac{h\left(\Vert\mathbf{u}\Vert\frac{\mathbf{z}_i}{\gamma}\right)}{\Vert\mathbf{u}\Vert}\\&\le\max_{i=1,2,\ldots,2k}\left\{\frac{h\left(\Vert\mathbf{u}\Vert\frac{\mathbf{z}_i}{\gamma}\right)}{\Vert\mathbf{u}\Vert}\right\}.\end{aligned}$ 由已证的射线形式, 可推出 $\lim_{\mathbf{u}\to\mathbf{0}}\frac{h\left(\Vert\mathbf{u}\Vert\frac{\mathbf{z}_i}{\gamma}\right)}{\Vert\mathbf{u}\Vert}=\lim_{\Vert\mathbf{u}\Vert\to0}\frac{h\left(\Vert\mathbf{u}\Vert\frac{\mathbf{z}_i}{\gamma}\right)}{\Vert\mathbf{u}\Vert}=\lim_{\alpha\to0^+}\frac{h\left(\alpha\frac{\mathbf{z}_i}{\gamma}\right)}{\alpha}=0,$ 从而当 $\mathbf{u}\to\mathbf{0}$ , $\frac{h(\mathbf{u})}{\Vert\mathbf{u}\Vert}\to0.$ 证毕.

例10 ( $\ell_2$ -范数的次微分) 设 $f:\mathbb{R}^n\to\mathbb{R}$ 定义为 $f(\mathbf{x})=\Vert\mathbf{x}\Vert_2$ . $f$ 在 $\mathbf{0}$ 处的次微分已在例1中讨论了. 而当 $\mathbf{x\ne0}$ , $f$ 在 $\mathbf{x}$ 处可微且梯度为 $\frac{\mathbf{x}}{\Vert\mathbf{x}\Vert_2}$ ⁵ 利用定理12, 我们可以写出 $f$ 次微分的完整刻画: $\boxed{\partial f(\mathbf{x})=\left\{\begin{array}{ll}\left\{\frac{\mathbf{x}}{\Vert\mathbf{x}\Vert_2}\right\}, & \mathbf{x\ne0},\\B_{\Vert\cdot\Vert_2}[\mathbf{0},1], & \mathbf{x=0}.\end{array}\right.}$ 特别地, 在 $n = 1$ 的情形, 我们有一维函数 $g (x) = ∣ x ∣$ 的次微分: $\partial g(x)=\left\{\begin{array}{ll}\{\mathrm{sgn}(x)\}, & x\ne0,\\ [-1,1], & x=0.\end{array}\right.$

这里唯一性可这样得出: 设 $\mathbf{g}_1,\mathbf{g}_2$ 均满足极限式. 则相减后可得 $\lim_{\mathbf{h\to0}}\frac{\langle\mathbf{g}_1-\mathbf{g}_2,\mathbf{h}\rangle}{\Vert\mathbf{h}\Vert}=0$ .取 $\mathbf{h}=\epsilon \mathbf{e}_i$ , $\epsilon\to0$ 即可推出 $\left(\mathbf{g}_1-\mathbf{g}_2\right)_i=0$ . 取遍 $i$ , 即得 $\mathbf{g}_1=\mathbf{g}_2$ . ↩︎
这里的“梯度”与数学分析中学到的“梯度”不尽相同. 二者具有契合点. 存在区别的根本原因是可微性的定义不同. 这在后文会详细介绍. ↩︎
在集合 $C$ 非空闭凸时, 易验证 $P_C$ 是良定的. ↩︎
事实上, 对 $\forall\mathbf{x}_1,\mathbf{x}_2\in\mathbb{E},\,\lambda\in[0,1]$ , 注意到 $\begin{aligned}\min_{\mathbf{y}\in C}\Vert\mathbf{y}-[\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2]\Vert\ge\lambda\min_{\mathbf{y}\in C}\Vert\mathbf{y}-\mathbf{x}_1\Vert+(1-\lambda)\min_{\mathbf{y}\in C}\Vert\mathbf{y}-\mathbf{x}_1\Vert,\end{aligned}$ 所以 $\begin{aligned}&\Vert\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2-P_C(\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2)\Vert\\&\le\lambda\Vert\mathbf{x}_1-P_C(\mathbf{x}_1)\Vert+(1-\lambda)\Vert\mathbf{x}_2-P_C(\mathbf{x}_2)\Vert.\end{aligned}$ 进而 $\begin{aligned}d_C(\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2)&=\Vert\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2-P_C(\lambda\mathbf{x}_1+(1-\lambda)\mathbf{x}_2)\Vert\\&\le\lambda\Vert\mathbf{x}_1-P_C(\mathbf{x}_1)\Vert+(1-\lambda)\Vert\mathbf{x}_2-P_C(\mathbf{x}_2)\Vert\\&=\lambda d_C(\mathbf{x}_1)+(1-\lambda)d_C(\mathbf{x}_2).\end{aligned}$ ↩︎
默认空间为欧式空间. ↩︎