First Order Methods in Optimization Ch9. Mirror Descent

第九章: 镜像下降法

文章目录

第九章: 镜像下降法

本章讨论 镜像下降法(mirror descent method, MDM)及其变体. 镜像下降实际上是 Proj-SGM在非欧情形下的推广. 因此本章的讨论不再限制在欧式空间中.

1. 从投影次梯度法到镜像下降法

考虑优化问题 $(\mathrm{P})\quad\min\{f(\mathbf{x}):\mathbf{x}\in C\}.$ 我们对其做如下假设:

假设条件1
(i) $f:\mathbb{E}\to(-\infty,\infty]$ 是正常闭凸函数;
(ii) $C\subset\mathbb{E}$ 是非空闭凸集;
(iii) $C\subset\mathrm{int}(\mathrm{dom}(f))$ ;
(iv) 问题 $(\mathrm{P})$ 的最优解集非空, 记为 $X^*$ . 最优值记为 $f_{\mathrm{opt}}$ .

求解问题 $(\mathrm{P})$ 的Proj-SGM已在第八章讨论过. 而贯穿第八章始终的一个基本假设就是空间是欧式空间, 即 $\Vert\cdot\Vert=\sqrt{\langle\cdot,\cdot\rangle}$ . 那么欧式空间假设的作用在何处呢? 考虑Proj-SGM的一般迭代格式 $\mathbf{x}^{k+1}=P_C(\mathbf{x}^k-t_kf'(\mathbf{x}^k)),\quad f'(\mathbf{x}^k)\in\partial f(\mathbf{x}^k),$ 其中 $t_k$ 为步长. 当空间非欧时, 使用上述迭代格式就存在一个逻辑上的问题: $\mathbf{x}^k在\mathbb{E}中, 而f'(\mathbf{x}^k)在\mathbb{E}^*中.$ 当然, 元素上我们可以将 $\mathbb{E},\mathbb{E}^*$ 视作等同. 但当讨论涉及范数时, 就会陷入困境. 这便是将Proj-SGM推广到非欧空间的动机之一.

为更好地解释欧式范数在Proj-SGM中的作用, 我们将Proj-SGM迭代格式写成如下的等价形式: $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\left\{f(\mathbf{x}^k)+\langle f'(\mathbf{x}^k),\mathbf{x}-\mathbf{x}^k\rangle+\frac{1}{2t_k}\Vert\mathbf{x}-\mathbf{x}^k\Vert^2\right\},$ 等价是因为 $f(\mathbf{x}^k)+\langle f'(\mathbf{x}^k),\mathbf{x}-\mathbf{x}^k\rangle+\frac{1}{2t_k}\Vert\mathbf{x}-\mathbf{x}^k\Vert^2=\frac{1}{2t_k}\Vert\mathbf{x}-[\mathbf{x}^k-t_kf'(\mathbf{x}^k)]\Vert^2+D,$ 其中 $D$ 是与 $\mathbf{x}$ 无关的常数. 由等价形式, 我们发现Proj-SGM每步实际上就是在极小化目标函数在当前迭代点 $\mathbf{x}^k$ 处的一个线性近似外加一个二次的临近项.

当内积与范数不相容时(也就是非欧空间), 上述等价性便不成立. 但这一等价形式提醒我们, 可以将欧式距离 $\frac{1}{2}\Vert\mathbf{x-y}\Vert^2$ 换成一种与内积兼容, 同时又可度量距离的某个函数. 这里我们要使用的非欧“距离”是所谓的Bregman距离(Bregman distances).

定义1 (Bregman距离) 设 $\omega:\mathbb{E}\to(-\infty,\infty]$ 为一正常闭凸函数, 且在 $\mathrm{dom}(\partial\omega)$ 上可微. 与 $\omega$ 相关联的Bregman距离是二元函数 $B_{\omega}:\mathrm{dom}(\omega)\times\mathrm{dom}(\partial\omega)\to\mathbb{R}$ , 定义为 $B_{\omega}(\mathbf{x,y})=\omega(\mathbf{x})-\omega(\mathbf{y})-\langle\nabla\omega(\mathbf{y}),\mathbf{x-y}\rangle.$

对于给定的集合 $C$ , 我们对 $\omega$ 做如下假设.

假设条件2 ( $\omega$ 的性质)
(i) $\omega$ 是正常闭凸函数;
(ii) $\omega$ 在 $\mathrm{dom}(\partial\omega)$ 上可微;
(iii) $C\subset\mathrm{dom}(\omega)$ ;
(iv) $\omega+\delta_C$ 是 $\sigma$ -强凸函数( $\sigma>0$ ).

需要指出的是, Bregman距离并不是一个距离. 它满足非负性, 且若它为 $0$ , 它的两个参数就相同; 但除此之外, 它一般并不满足对称性和三角不等式. 我们将Bregman距离满足的性质汇总于引理1.

引理1 (Bregman距离的基本性质) 设 $C\subset\mathbb{E}$ 为非空闭凸集, $\omega$ 满足假设条件2. 设 $B_{\omega}$ 为与 $\omega$ 相关联的Bregman距离. 则
(i) $B_{\omega}(\mathbf{x,y})\ge\frac{\sigma}{2}\Vert\mathbf{x-y}\Vert^2,\,\forall\mathbf{x}\in C,\,\mathbf{y}\in C\cap\mathrm{dom}(\partial\omega)$ ;
(ii) 设 $\mathbf{x}\in C,\,\mathbf{y}\in C\cap\mathrm{dom}(\partial\omega)$ . 则

$B_{\omega}(\mathbf{x,y})\ge0$ ;
$B_{\omega}(\mathbf{x,y})=0\Leftrightarrow\mathbf{x=y}$ .

证明: (i)直接根据强凸函数的一阶刻画(第五章定理6(ii))可得. (ii)则是(i)的直接推论.

假设 $\mathbf{x}^k\in C\cap\mathrm{dom}(\partial\omega)$ . 将Proj-SGM等价迭代格式中的 $\frac{1}{2}\Vert\mathbf{x}-\mathbf{x}^k\Vert^2$ 替换成Bregman距离 $B_{\omega}(\mathbf{x},\mathbf{x}^k)$ 就有 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\left\{f(\mathbf{x}^k)+\langle f'(\mathbf{x}^k),\mathbf{x}-\mathbf{x}^k\rangle+\frac{1}{t_k}B_{\omega}(\mathbf{x},\mathbf{x}^k)\right\}.$ 忽略常数项可得 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\left\{\langle f'(\mathbf{x}^k),\mathbf{x}\rangle+\frac{1}{t_k}B_{\omega}(\mathbf{x},\mathbf{x}^k)\right\}.$ 进一步注意到 $\begin{aligned}&\langle f'(\mathbf{x}^k),\mathbf{x}\rangle+\frac{1}{t_k}B_{\omega}(\mathbf{x},\mathbf{x}^k)\\&=\frac{1}{t_k}\left[\langle t_kf'(\mathbf{x}^k)-\nabla\omega(\mathbf{x}^k),\mathbf{x}\rangle+\omega(\mathbf{x})\right]\underbrace{-\frac{1}{t_k}\omega(\mathbf{x}^k)+\frac{1}{t_k}\langle\nabla\omega(\mathbf{x}^k),\mathbf{x}^k\rangle}_{常数}.\end{aligned}$ 所以, 迭代格式¹简化为 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\{\langle t_kf'(\mathbf{x}^k)-\nabla\omega(\mathbf{x}^k),\mathbf{x}\rangle+\omega(\mathbf{x})\}.$ 于是有MDM:

在这里插入图片描述
MDM的迭代格式需要对某个 $\mathbf{a}\in\mathbb{E}^*$ , 求解如下形式的子问题 $\min_{\mathbf{x}\in C}\{\langle\mathbf{a,x}\rangle+\omega(\mathbf{x})\}.$ 为说明MDM迭代是良定义的, 我们需要证明上面子问题的解唯一取在 $C\cap\mathrm{dom}(\partial\omega)$ 中. 为此, 我们给出一个更加一般的引理.

引理2 假设

$\omega:\mathbb{E}\to(-\infty,\infty]$ 为正常闭凸函数, 且在 $\mathrm{dom}(\partial\omega)$ 上可微;
$\psi:\mathbb{E}\to(-\infty,\infty]$ 为正常闭凸函数, 且 $\mathrm{dom}(\psi)\subset\mathrm{dom}(\omega)$ ;
$\omega+\delta_{\mathrm{dom}(\psi)}$ 是 $\sigma$ -强凸函数 $(\sigma>0)$ .

则问题 $\min_{\mathbf{x}\in\mathbb{E}}\{\psi(\mathbf{x})+\omega(\mathbf{x})\}$ 的极小点唯一取在 $\mathrm{dom}(\psi)\cap\mathrm{dom}(\partial\omega)$ 中.

证明: 问题可写作 $\min_{\mathbf{x}\in\mathbb{E}}\varphi(\mathbf{x}),$ 其中 $\varphi=\psi+\omega$ . 易知 $\varphi$ 是正常闭函数. 由于 $\omega+\delta_{\mathrm{dom}(\psi)}$ 是 $\sigma$ -强凸函数, $\psi$ 是凸函数, 所以 $\psi+\omega+\delta_{\mathrm{dom}(\psi)}=\psi+\omega=\varphi$ 是 $\sigma$ -强凸函数. 根据第五章定理7(i), 问题有唯一极小点 $\mathbf{x}^*\in\mathrm{dom}(\varphi)=\mathrm{dom}(\psi)$ . 为证明 $\mathbf{x}^*\in\mathrm{dom}(\partial\omega)$ , 注意由Fermat最优性条件, $\mathbf{0}\in\partial\varphi(\mathbf{x}^*)\Rightarrow\partial\varphi(\mathbf{x}^*)\ne\emptyset$ . 而由次微分的加法法则(第三章定理15), $\partial\varphi(\mathbf{x}^*)=\partial\psi(\mathbf{x}^*)+\partial\omega(\mathbf{x}^*)$ . 所以必然有 $\partial\omega(\mathbf{x}^*)\ne\emptyset\Rightarrow\mathbf{x}^*\in\mathrm{dom}(\partial\omega)$ .

定理1 (MDM的良定性) 假定假设条件1、2成立. 设 $\mathbf{a}\in\mathbb{E}^*$ . 则问题 $\min_{\mathbf{x}\in C}\{\langle\mathbf{a,x}\rangle+\omega(\mathbf{x})\}$ 的唯一极小点取在 $C\cap\mathrm{dom}(\partial\omega)$ .

证明: 直接利用引理2, 其中 $\psi(\mathbf{x})\equiv=\langle\mathbf{a,x}\rangle+\delta_C(\mathbf{x})$ .

我们列举两个比较常见的强凸函数的选取方式.

例1 (欧式范数平方) 假定假设条件1成立, $\mathbb{E}$ 是欧式空间. 定义 $\omega(\mathbf{x})=\frac{1}{2}\Vert\mathbf{x}\Vert^2.$ 则 $\omega$ 显然满足假设条件2中的条件, 并且它是 $1$ -强凸函数. 由于 $\nabla\omega(\mathbf{x})=\mathbf{x}$ , 于是MDM迭代格式变成 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\left\{\langle t_kf'(\mathbf{x}^k)-\mathbf{x}^k,\mathbf{x}\rangle+\frac{1}{2}\Vert\mathbf{x}\Vert^2\right\}.$ 配方后易知, 这就是Proj-SGM迭代格式 $\mathbf{x}^{k+1}=P_C(\mathbf{x}^k-t_kf'(\mathbf{x}^k))$ . 这再次说明了MDM是Proj-SGM的推广.

例2 (单位单纯形上的负熵函数) 假定假设条件1成立, $\mathbb{E}=\mathbb{R}^n$ , 范数为 $\ell_1$ -范数, $C=\Delta_n$ . 我们取 $\omega$ 为非负象限上的负熵函数: $\omega(\mathbf{x})=\left\{\begin{array}{ll}\sum_{i=1}^nx_i\log x_i, & \mathbf{x}\in\mathbb{R}_+^n,\\\infty, & 其它.\end{array}\right.$ 根据第五章例10, $\omega+\delta_{\Delta_n}$ 是 $\ell_1$ -范数下的 $1$ -强凸函数. 此时 $\mathrm{dom}(\partial\omega)=\mathbb{R}_{++}^n$ , 并且事实上, $\omega$ 在其次可微点处就是可微的. 因此假设条件2成立. 对 $\forall\mathbf{x}\in\Delta_n,\,\mathbf{y}\in\Delta_n^+\equiv\{\mathbf{x}\in\mathbb{R}_{++}^n:\mathbf{e}^T\mathbf{x}=1\}$ , 与 $\omega$ 关联的Bregman距离为 $\begin{aligned}B_{\omega}(\mathbf{x,y})&=\sum_{i=1}^nx_i\log x_i-\sum_{i=1}^ny_i\log y_i-\sum_{i=1}^n(\log(y_i)+1)(x_i-y_i)\\&=\sum_{i=1}^nx_i\log(x_i/y_i)+\sum_{i=1}^ny_i-\sum_{i=1}^nx_i\\&=\sum_{i=1}^nx_i\log(x_i/y_i),\end{aligned}$ 这就是所谓的Kullback-Leibler(KL)散度距离测度(Kullback-Leibler divergence distance measure). 如此, MDM的迭代格式变成 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in\Delta_n}\left\{\sum_{i=1}^n(t_kf_i'(\mathbf{x}^k)-1-\log(x_i^k))x_i+\sum_{i=1}^nx_i\log x_i\right\},$ 其中 $f_i'(\mathbf{x}^k)$ 是 $f'(\mathbf{x}^k)$ 的第 $i$ 个分量. 根据第三章例26, 上述子问题的最优解为 $x_i^{k+1}=\frac{e^{\log(x_i^k)+1-t_kf_i'(\mathbf{x}^k)}}{\sum_{j=1}^ne^{\log(x_j^k)+1-t_kf_j'(\mathbf{x}^k)}},\quad i=1,2,\ldots,n,$ 进一步简化可得 $x_i^{k+1}=\frac{x_i^ke^{-t_kf_i'(\mathbf{x}^k)}}{\sum_{j=1}^nx_j^ke^{-t_kf_j'(\mathbf{x}^k)}},\quad i=1,2,\ldots,n.$

现在的问题在于如何选取步长. 下一节的收敛性分析会给我们一些启示.

2. 收敛性分析

2.1 分析工具

引理3 (三点引理) 假设 $\omega:\mathbb{E}\to(-\infty,\infty]$ 为正常闭凸函数, 且在 $\mathrm{dom}(\partial\omega)$ 上可微. 设 $\mathbf{a,b}\in\mathrm{dom}(\partial\omega),\,\mathbf{c}\in\mathrm{dom}(\omega)$ . 则有以下等式成立: $\langle\nabla\omega(\mathbf{b})-\nabla\omega(\mathbf{a}),\mathbf{c-a}\rangle=B_{\omega}(\mathbf{c,a})+B_{\omega}(\mathbf{a,b})-B_{\omega}(\mathbf{c,b}).$

证明: 由 $B_{\omega}$ 的定义, $\begin{aligned}B_{\omega}(\mathbf{c,a})&=\omega(\mathbf{c})-\omega(\mathbf{a})-\langle\nabla\omega(\mathbf{a}),\mathbf{c-a}\rangle,\\B_{\omega}(\mathbf{a,b})&=\omega(\mathbf{a})-\omega(\mathbf{b})-\langle\nabla\omega(\mathbf{b}),\mathbf{a-b}\rangle,\\B_{\omega}(\mathbf{c,b})&=\omega(\mathbf{c})-\omega(\mathbf{b})-\langle\nabla\omega(\mathbf{b}),\mathbf{c-b}\rangle.\end{aligned}$ 因此 $\begin{aligned}B_{\omega}(\mathbf{c,a})+B_{\omega}(\mathbf{a,b})-B_{\omega}(\mathbf{c,b})&=-\langle\nabla\omega(\mathbf{a}),\mathbf{c-a}\rangle-\langle\nabla\omega(\mathbf{b}),\mathbf{a-b}\rangle+\langle\nabla\omega(\mathbf{b}),\mathbf{c-b}\rangle\\&=\langle\nabla\omega(\mathbf{b})-\nabla\omega(\mathbf{a}),\mathbf{c-a}\rangle.\end{aligned}$

下面的定理2是非欧情形下的第二临近定理.

定理2 (非欧第二临近定理) 设

$\omega:\mathbb{E}\to(-\infty,\infty]$ 为一正常闭凸函数, 且在 $\mathrm{dom}(\partial\omega)$ 上可微;
$\psi:\mathbb{E}\to(-\infty,\infty]$ 为一正常闭凸函数, 满足 $\mathrm{dom}(\psi)\subset\mathrm{dom}(\omega)$ ;
$\omega+\delta_{\mathrm{dom}(\psi)}$ 为 $\sigma$ -强凸函数( $\sigma>0$ );

设 $\mathbf{b}\in\mathrm{dom}(\partial\omega)$ , $\mathbf{a}$ 定义为 $\mathbf{a}=\arg\min_{\mathbf{x}\in\mathbb{E}}\{\psi(\mathbf{x})+B_{\omega}(\mathbf{x,b})\}.$ 则 $\mathbf{a}\in\mathrm{dom}(\partial\omega)$ , 且对 $\forall\mathbf{u}\in\mathrm{dom}(\psi)$ , $\langle\nabla\omega(\mathbf{b})-\nabla\omega(\mathbf{a}),\mathbf{u-a}\rangle\le\psi(\mathbf{u})-\psi(\mathbf{a}).$

证明: 由 $B_{\omega}$ 的定义, $\mathbf{a}$ 的定义可以写作² $\mathbf{a}=\arg\min_{\mathbf{x}\in\mathbb{E}}\{\psi(\mathbf{x})-\langle\nabla\omega(\mathbf{b}),\mathbf{x}\rangle+\omega(\mathbf{x})\}.$ 在引理2中将 $\psi(\mathbf{x})$ 取成 $\psi(\mathbf{x})-\langle\omega(\mathbf{b}),\mathbf{x}\rangle$ 即可证明 $\mathbf{a}\in\mathrm{dom}(\partial\omega)$ . 再由Fermat最优性条件, 存在 $\psi'(\mathbf{a})\in\partial\psi(\mathbf{a})$ , 使得 $\psi'(\mathbf{a})+\nabla\omega(\mathbf{a})-\nabla\omega(\mathbf{b})=\mathbf{0}.$ 于是由次梯度不等式, 对 $\forall\mathbf{u}\in\mathrm{dom}(\psi)$ , $\langle\nabla\omega(\mathbf{b})-\nabla\omega(\mathbf{a}),\mathbf{u-a}\rangle=\langle\psi'(\mathbf{a}),\mathbf{u-a}\rangle\le\psi(\mathbf{u})-\psi(\mathbf{a}).$

利用非欧第二临近定理与三点引理, 我们可以证明类似于第八章Proj-SGM基本不等式的结论.

引理4 (MDM基本不等式³) 假定假设条件1、2成立. 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由带正步长 $\{t_k\}_{k\ge0}$ 的MDM生成的迭代序列. 则对 $\forall\mathbf{x}^*\in X^*,\,k\ge0$ , $t_k(f(\mathbf{x}^k)-f_{\mathrm{opt}})\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^k)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert_*^2.$

证明: 由MDM更新公式、非欧第二临近定理(令其中 $\mathbf{b}=\mathbf{x}^k,\,\psi(\mathbf{x})\equiv t_k(f'(\mathbf{x}^k),\mathbf{x}\rangle+\delta_C(\mathbf{x})$ , 从而 $\mathbf{a}=\mathbf{x}^{k+1}$ ), 就有对 $\forall\mathbf{u}\in C$ , $\langle\nabla\omega(\mathbf{x}^k)-\nabla\omega(\mathbf{x}^{k+1}),\mathbf{u}-\mathbf{x}^{k+1}\rangle\le t_k\langle f'(\mathbf{x}^k),\mathbf{u}-\mathbf{x}^{k+1}\rangle.$ 由三点引理(令其中 $\mathbf{a}=\mathbf{x}^{k+1},\,\mathbf{b}=\mathbf{x}^k,\,\mathbf{c=u}$ ), $\langle\nabla\omega(\mathbf{x}^k)-\nabla\omega(\mathbf{x}^{k+1}),\mathbf{u}-\mathbf{x}^{k+1}\rangle=B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})+B_{\omega}(\mathbf{x}^{k+1},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^k),$ 合起来就有 $B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})+B_{\omega}(\mathbf{x}^{k+1},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^k)\le t_k\langle f'(\mathbf{x}^k),\mathbf{u}-\mathbf{x}^{k+1}\rangle.$ 因此, $\begin{aligned}&t_k\langle f'(\mathbf{x}^k),\mathbf{x}^k-\mathbf{u}\rangle\\&\le B_{\omega}(\mathbf{u},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})-B_{\omega}(\mathbf{x}^{k+1},\mathbf{x}^k)+t_k\langle f'(\mathbf{x}^k),\mathbf{x}^k-\mathbf{x}^{k+1}\rangle\\&\le B_{\omega}(\mathbf{u},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{k+1}-\mathbf{x}^k\Vert^2\,(引理1\text{(i)})\\&=B_{\omega}(\mathbf{u},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{k+1}-\mathbf{x}^k\Vert^2+\left\langle\frac{t_k}{\sqrt{\sigma}}f'(\mathbf{x}^k),\sqrt{\sigma}(\mathbf{x}^k-\mathbf{x}^{k+1})\right\rangle\\&\overset{(*)}{\le} B_{\omega}(\mathbf{u},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{k+1}-\mathbf{x}^k\Vert^2+\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert_*^2+\frac{\sigma}{2}\Vert\mathbf{x}^{k+1}-\mathbf{x}^k\Vert^2\\&=B_{\omega}(\mathbf{u},\mathbf{x}^k)-B_{\omega}(\mathbf{u},\mathbf{x}^{k+1})+\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert^2_*,\end{aligned}$ 其中 $(*)$ 式使用了Fenchel不等式(第四章定理3)在 $\frac{1}{2}\Vert\mathbf{x}\Vert^2$ 上的应用(第四章4.15节). 代入 $\mathbf{u}=\mathbf{x}^*$ 并利用次梯度不等式, 就得到 $t_k(f(\mathbf{x}^k)-f_{\mathrm{opt}})\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^k)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert_*^2.$

完全类似于第八章定理6的证明, 我们可以推出MDM最优函数值序列 $\{f_{\mathrm{best}}^k\}_{k\ge0}$ 与 $f_{\mathrm{opt}}$ 距离的上界, 进而启发我们对步长序列的选取.

引理5 假定假设条件1、2成立, 且存在 $L_f>0$ , 使得 $\Vert f'(\mathbf{x})\Vert_*\le L_f,\,\forall\mathbf{x}\in C$ . 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由带正步长 $\{t_k\}_{k\ge0}$ 的MDM生成的迭代序列. 则对 $\forall N\ge0$ , $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{k=0}^Nt_k^2}{\sum_{k=0}^Nt_k}.$

证明: 取 $\mathbf{x}^*\in X^*$ . 由MDM基本不等式, 对 $\forall k\ge0$ , $t_k(f(\mathbf{x}^k)-f_{\mathrm{opt}})\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^k)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert_*^2.$ 对指标 $k=0,1,2\ldots, N$ 求和上述不等式可得 $\begin{aligned}\sum_{k=0}^Nt_k(f(\mathbf{x}^k)-f_{\mathrm{opt}})&\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\sum_{k=0}^N\frac{t_k^2}{2\sigma}\Vert f'(\mathbf{x}^k)\Vert_*^2\\&\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{k=0}^Nt_k^2\,(引理1(\text{ii})).\end{aligned}$ 又因为 $\left(\sum_{k=0}^Nt_k\right)(f_{\mathrm{best}}^N-f_{\mathrm{opt}})\le\sum_{k=0}^Nt_k(f(\mathbf{x}^k)-f_{\mathrm{opt}}),$ 所以 $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{k=0}^Nt_k^2}{\sum_{k=0}^Nt_k}.$

若 $B_{\omega}(\mathbf{x},\mathbf{x}^0)$ 还在 $C$ 上有界, 即存在 $\Theta(\mathbf{x}^0)$ 满足 $\Theta(\mathbf{x}^0)\ge\max_{\mathbf{x}\in C}B_{\omega}(\mathbf{x},\mathbf{x}^0).$ 代入引理5即有 $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\frac{\Theta(\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{k=0}^Nt_k^2}{\sum_{k=0}^Nt_k}.$ 总之, 我们再次发现了 $\frac{\sum_{k=0}^Nt_k^2}{\sum_{k=0}^Nt_k}$ . 这提示我们可以选取类似于第八章变步长Proj-SGM中讨论的步长准则. 但在此之前, 我们先讨论固定迭代数目的步长选取准则.

2.2 固定迭代数目的步长选取准则

我们现在固定迭代数为 $N$ , 推导“最优”的步长选取准则. 这里的最优指的是能够极小化引理5的右端项. 为此, 我们先证明以下引理6和7.

引理6 设 $\mathbf{A}\in\mathbb{R}^{m\times n},\,\mathbf{b}\in\mathbb{R}^m,\,\mathbf{c}\in\mathbb{R}^n,\,d\in\mathbb{R}$ , 其中 $\mathbf{c\ne0}$ . 则函数 $g(\mathbf{x})=\frac{\Vert\mathbf{Ax+b}\Vert^2}{\mathbf{c}^T\mathbf{x}+d}$ 在 $D=\{\mathbf{x}\in\mathbb{R}^n:\mathbf{c}^T\mathbf{x}+d>0\}$ 上是凸函数.

证明: 由仿射变换的保凸性, 我们只需证明函数 $h(\mathbf{y},t)=\frac{\Vert\mathbf{y}\Vert^2}{t}$ 在凸集 $C\equiv\left\{\begin{pmatrix}\mathbf{y}\\t\end{pmatrix}\in\mathbb{R}^{m+1}:\mathbf{y}\in\mathbb{R}^m,\,t>0\right\}$ 上是凸函数. 而 $h$ 可以进一步写成 $h=\sum_{i=1}^mh_i$ , 其中 $h_i(\mathbf{y},t)=\frac{y_i^2}{t}.$ 直接计算可得 $\nabla^2h_i(y_i,t)=2\begin{pmatrix}\frac{1}{t} & -\frac{y_i}{t^2}\\-\frac{y_i}{t^2} & \frac{y_i^2}{t^3}\end{pmatrix}.$ 由于 $\begin{aligned}\mathrm{Tr}[\nabla^2h_i(y_i,t)]&=2\left[\frac{1}{t}+\frac{y_i^2}{t^3}\right]>0,\\\det[\nabla^2h_i(y_i,t)]&=4\left[\frac{1}{t}\cdot\frac{y_i^2}{t^3}-\left(\frac{y_i}{t^2}\right)^2\right]=0,\end{aligned}$ 所以 $\nabla^2h_i$ 是半正定矩阵, $h_i$ 是凸函数, 从而 $h$ 是凸函数.

引理7 设 $\alpha,\beta>0$ , 问题 $\min_{t_1,\ldots,t_m>0}\frac{\alpha+\beta\sum_{k=1}^mt_k^2}{\sum_{k=1}^mt_k}$ 的一个最优解为 $t_k=\sqrt{\frac{\alpha}{\beta m}},\,k=1,2,\ldots,m$ . 最优值为 $2\sqrt{\frac{\alpha\beta}{m}}$ .

证明: 记目标函数为 $\phi(\mathbf{t})\equiv\frac{\alpha+\beta\sum_{k=1}^mt_k^2}{\sum_{k=1}^mt_k}.$ 注意到 $\phi$ 是个排列对称函数, 即 $\phi(\mathbf{t})=\phi(\mathbf{Pt}),\,\forall\mathbf{P}\in\Lambda_m$ . 由此我们断言, 若此问题有一最优解, 则它必有一分量全相同的最优解. 为此, 任取一最优解 $\mathbf{t}^*$ 以及排列矩阵 $\mathbf{P}\in\Lambda_m$ . 由于 $\phi(\mathbf{Pt}^*)=\phi(\mathbf{t}^*)$ , 所以 $\mathbf{Pt}^*$ 也是最优解. 由引理6, $\phi$ 是正象限上的凸函数, 所以 $\frac{1}{m!}\sum_{\mathbf{P}\in\Lambda_m}\mathbf{Pt}^*=\frac{1}{m}\begin{pmatrix}\mathbf{e}^T\mathbf{t}\\\vdots\\\mathbf{e}^T\mathbf{t}\end{pmatrix}$ 也是最优解. 这就说明存在分量全相同的最优解. 因此令 $t_1=t_2=\cdots=t_m=t$ 即可得到简化的问题 $\min_{t>0}\frac{\alpha+\beta mt^2}{mt},$ 其最优解易知为 $t=\sqrt{\frac{\alpha}{\beta m}}$ , 所以原问题的一个最优解为 $t_k=\sqrt{\frac{\alpha}{\beta m}},\,k=1,2,\ldots,m$ . 将此代入 $\phi$ 中, 就有最优值 $2\sqrt{\frac{\alpha\beta}{m}}$ .

在引理7中取 $\alpha=\Theta(\mathbf{x}^0),\,\beta=\frac{L_f^2}{2\sigma},\,m=N+1$ , 于是引理5右端项的一个极小点为 $t_k=\frac{\sqrt{2\Theta(\mathbf{x}^0)\sigma}}{L_f\sqrt{N+1}}$ .

定理3 (固定迭代数MDM的 $O(1/\sqrt{N})$ 收敛速度) 假定假设条件1、2成立, 且存在 $L_f>0$ , 使得 $\Vert f'(\mathbf{x})\Vert_*\le L_f,\,\forall\mathbf{x}\in C$ . 设 $B_{\omega}(\mathbf{x},\mathbf{x}^0)$ 在 $C$ 上有界: 存在 $\Theta(\mathbf{x}^0)$ 满足 $\Theta(\mathbf{x}^0)\ge\max_{\mathbf{x}\in C}B_{\omega}(\mathbf{x},\mathbf{x}^0).$ 设 $N$ 为正整数, $\{\mathbf{x}^k\}_{k\ge0}$ 为由步长准则为 $t_k=\frac{\sqrt{2\Theta(\mathbf{x}^0)\sigma}}{L_f\sqrt{N+1}},\quad k=0,1,\ldots,N$ 的MDM生成的迭代序列. 则 $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\frac{\sqrt{2\Theta(\mathbf{x}^0)}L_f}{\sqrt{\sigma}\sqrt{N+1}}.$

证明: 由引理5, $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\frac{\Theta(\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{k=0}^Nt_k^2}{\sum_{k=0}^Nt_k}.$ 将 $t_k$ 代入即得证.

例1 (单位单纯形上的优化问题) 考虑问题 $\min\{f(\mathbf{x}):\mathbf{x}\in\Delta_n\},$ 其中 $f:\mathbb{R}^n\to(-\infty,\infty]$ 为正常闭凸函数, 且 $\Delta_n\subset\mathrm{int}(\mathrm{dom}(f))$ . 考虑以下两种算法:

欧式空间情形: 我们假设 $\mathbb{R}^n$ 上的范数为 $\ell_2$ -范数, $\omega(\mathbf{x})=\frac{1}{2}\Vert\mathbf{x}\Vert_2^2$ . 显然 $\omega$ 在 $\ell_2$ -范数下是 $1$ -强凸的. 此时MDM就是Proj-SGM: $\mathbf{x}^{k+1}=P_{\Delta_n}(\mathbf{x}^k-t_kf'(\mathbf{x}^k)).$ 假设算法从 $\mathbf{x}^0=\frac{1}{n}\mathbf{e}$ 开始迭代. 此时 $\max_{\mathbf{x}\in\Delta_n}B_{\omega}(\mathbf{x},\mathbf{x}^0)=\max_{\mathbf{x}\in\Delta_n}\frac{1}{2}\left\Vert\mathbf{x}-\frac{1}{n}\mathbf{e}\right\Vert^2_2=\frac{1}{2}\left(1-\frac{1}{n}\right),$ 所以我们可以取 $\Theta(\mathbf{x}^0)=1$ . 由定理3, 给定正整数 $N$ , 恰当选取步长, 有 $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\underbrace{\frac{\sqrt{2}L_{f,2}}{\sqrt{N+1}}}_{C_{\text{e}}^f},$ 其中 $L_{f,2}=\max_{\mathbf{x}\in\Delta_n}\Vert f'(\mathbf{x})\Vert_2$ .
非欧式空间情形: 假设 $\mathbb{R}^n$ 上的范数是 $\ell_1$ -范数, $\omega$ 选取为负熵函数 $\omega(\mathbf{x})=\left\{\begin{array}{ll}\sum_{i=1}^nx_i\log(x_i), & \mathbf{x}\in\mathbb{R}_{+}^n,\\\infty, & 其它.\end{array}\right.$ 由例2, $\omega+\delta_{\Delta_n}$ 在 $\ell_1$ -范数下是 $1$ -强凸函数. 进而MDM更新格式为 $x_i^{k+1}=\frac{x_i^ke^{-t_kf_i'(\mathbf{x}^k)}}{\sum_{j=1}^nx_j^ke^{-t_kf_j'(\mathbf{x}^k)}},\quad i=1,2,\ldots,n.$ 同样从 $\mathbf{x}^0=\frac{1}{n}\mathbf{e}$ 出发. 此时的Bregman距离恰好是KL散度, 于是 $\begin{aligned}\max_{\mathbf{x}\in\Delta_n}B_{\omega}\left(\mathbf{x},\frac{1}{n}\mathbf{e}\right)&=\max_{\mathbf{x}\in\Delta_n}\sum_{i=1}^nx_i\log(nx_i)=\log(n)+\max_{\mathbf{x}\in\Delta_n}\sum_{i=1}^nx_i\log x_i\\&=\log(n).\end{aligned}$ 于是可取 $\Theta(\mathbf{x}^0)=\log(n)$ . 由定理3, 恰当选取步长, 有 $f_{\mathrm{best}}^N-f_{\mathrm{opt}}\le\underbrace{\frac{\sqrt{2\log(n)}L_{f,\infty}}{\sqrt{N+1}}}_{C_{\text{ne}}^f},$ 其中 $L_{f,\infty}=\max_{\mathbf{x}\in\Delta_n}\Vert f'(\mathbf{x})\Vert_{\infty}$ .

上界 $C_{\text{e}}^f,C_{\text{ne}}^f$ 的比值记为 $\rho^f=\frac{C_{\text{ne}}^f}{C_{\text{e}}^f}=\sqrt{\log(n)}\frac{L_{f,\infty}}{L_{f,2}}.$ $\rho^f$ 是大于1(意味着欧式情形的算法更好)还是小于1(意味着非欧式情形的算法更好)取决于 $f$ 的性质. 事实上, 对 $\forall\mathbf{y}\in\mathbb{R}^n$ , 恒有 $\Vert\mathbf{y}\Vert_{\infty}\le\Vert\mathbf{y}\Vert_2\le\sqrt{n}\Vert\mathbf{y}\Vert_{\infty}$ . 因此 $\frac{1}{\sqrt{n}}\le\frac{L_{f,\infty}}{L_{f,2}}\le1,$ 从而 $\frac{\sqrt{\log(n)}}{\sqrt{n}}\le\rho^f\le\sqrt{\log(n)}.$

2.3 变步长准则

2.2节讨论了固定迭代数时, 步长的一种“最优”选取方案. 其中使用步长也是固定的. 但在实际应用中, 我们往往不会去固定算法迭代的步数, 而是使用其它不同的停机准则. 这就是为什么变步长准则这么重要了. 类似于第八章中对Proj-SGM的论述, 我们也可以用MDM基本不等式建立变步长准则下MDM的收敛性质.

定理4 (变步长MDM的收敛性质) 假定假设条件1、2成立, 且存在 $L_f>0$ , 使得 $\Vert f'(\mathbf{x})\Vert_*\le L_f,\,\forall\mathbf{x}\in C$ . 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由带正步长 $\{t_k\}_{k\ge0}$ 的MDM生成的迭代序列, $\{f_{\mathrm{best}}^k\}_{k\ge0}$ 为最优函数值序列.
(i) 若 $\frac{\sum_{n=0}^kt_n^2}{\sum_{n=0}^kt_n}\to0$ , 则 $f_{\mathrm{best}}^k\to f_{\mathrm{opt}}$ ;
(ii) 若 $t_k$ 选取为

预设递减步长(predefined diminishing stepsize): $t_k=\frac{\sqrt{2\sigma}}{L_f\sqrt{k+1}}$ ; 或
自适应步长(adaptive stepsize): $t_k=\left\{\begin{array}{ll}\frac{\sqrt{2\sigma}}{\Vert f'(\mathbf{x}^k)\Vert_*\sqrt{k+1}}, & f'(\mathbf{x}^k)\ne\mathbf{0},\\\frac{\sqrt{2\sigma}}{L_f\sqrt{k+1}}, & f'(\mathbf{x}^k)=\mathbf{0},\end{array}\right.$

则对 $\forall k\ge1$ , $f_{\mathrm{best}}^k-f_{\mathrm{opt}}\le\frac{L_f}{\sqrt{2\sigma}}\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+1+\log(k+1)}{\sqrt{k+1}}.$

证明: 由MDM基本不等式, 对 $\forall n\ge0$ , $t_n(f(\mathbf{x}^n)-f_{\mathrm{opt}})\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^n)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{n+1})+\frac{t_n^2}{2\sigma}\Vert f'(\mathbf{x}^n)\Vert_*^2.$ 对指标 $n=0,1,\ldots,k$ 求和以上不等式可得 $\sum_{n=0}^kt_n(f(\mathbf{x}^n)-f_{\mathrm{opt}})\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2.$ 由于 $B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})\ge0,\,f(\mathbf{x}^n)\ge f_{\mathrm{best}}^k(n\le k)$ , 我们有 $f_{\mathrm{best}}^k-f_{\mathrm{opt}}\le\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2}{\sum_{n=0}^kt_n}.$ 由于 $\Vert f'(\mathbf{x}^n)\Vert_*\le L_f$ , 所以 $f_{\mathrm{best}}^k-f_{\mathrm{opt}}\le\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{n=0}^kt_n^2}{\sum_{n=0}^kt_n}.$ 因此若 $\frac{\sum_{n=0}^kt_n^2}{\sum_{n=0}^kt_n}\to0$ , 则 $f_{\mathrm{best}}^k\to f_{\mathrm{opt}}$ . 这就证明了(i).

下面证明(ii). 注意到两种步长准则都满足 $t_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2\le\frac{2\sigma}{n+1},\,t_n\ge\frac{\sqrt{2\sigma}}{L_f\sqrt{n+1}}$ . 所以 $f_{\mathrm{best}}^k-f_{\mathrm{opt}}\le\frac{L_f}{\sqrt{2\sigma}}\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\sum_{n=0}^k\frac{1}{n+1}}{\sum_{n=0}^k\frac{1}{\sqrt{n+1}}},$ 再由第八章引理9(i)即得证.

例4 (MDM vs. Proj-SGM——数值例子) 考虑问题 $\min\{\Vert\mathbf{Ax-b}\Vert_1:\mathbf{x}\in\Delta_n\},$ 其中 $\mathbf{A}\in\mathbb{R}^{n\times n},\,\mathbf{b}\in\mathbb{R}^n$ . 由例1, 我们考虑两种算法.

Proj-SGM. 假设 $\mathbb{R}^n$ 中的范数为 $\ell_2$ -范数. 则更新公式为 $\mathbf{x}^{k+1}=P_{\Delta_n}(\mathbf{x}^k-t_kf'(\mathbf{x}^k)),$ 其中我们取 $f'(\mathbf{x}^k)=\mathbf{A}^T\mathrm{sgn}(\mathbf{A}\mathbf{x}^k-\mathbf{b})$ , 步长为自适应步长: $t_k=\frac{\sqrt{2}}{\Vert f'(\mathbf{x}^k)\Vert_2\sqrt{k+1}}.$
MDM. 假设 $\mathbb{R}^n$ 中的范数为 $\ell_1$ -范数, $\omega$ 为负熵函数. 此时, 更新公式为 $x_i^{k+1}=\frac{x_i^ke^{-t_kf_i'(\mathbf{x}^k)}}{\sum_{j=1}^nx_j^ke^{-t_kf_j'(\mathbf{x}^k)}},\quad i=1,2,\ldots,n,$ 其中步长取为 $t_k=\frac{\sqrt{2}}{\Vert f'(\mathbf{x}^k)\Vert_{\infty}\sqrt{k+1}}.$

我们取 $n = 100$ , 按标准正太分布独立随机生成 $\mathbf{A,b}$ 的分量. 下图显示了两种方法 $f(\mathbf{x}^k)-f_{\mathrm{opt}},\,f_{\mathrm{best}}^k-f_{\mathrm{opt}}$ 的变化.

在这里插入图片描述
显然此例中, MD要优于Proj-SGM.

3. 求解组合模型的镜像下降法——镜像-C算法

本节我们讨论更加一般的模型: $\min_{\mathbf{x}\in\mathbb{E}}\{F(\mathbf{x})\equiv f(\mathbf{x})+g(\mathbf{x})\}.$ 我们对 $f, g$ 做如下假设:

假设条件3 ( $f, g$ 的性质)
(i) $f,g:\mathbb{E}\to(-\infty,\infty]$ 为正常闭凸函数;
(ii) $\mathrm{dom}(g)\subset\mathrm{int}(\mathrm{dom}(f))$ ;
(iii) $\exists L_f>0: \Vert f'(\mathbf{x})\Vert_*\le L_f,\,\forall\mathbf{x}\in\mathrm{dom}(g)$ ;
(iv) 组合模型最优解集非空, 记为 $X^*$ ; 最优值记为 $F_{\mathrm{opt}}$ .

我们同样引入函数 $\omega$ , 并对其做出如下假设⁴:

假设条件4 ( $\omega$ 的性质)
(i) $\omega$ 是正常闭凸函数;
(ii) $\omega$ 在 $\mathrm{dom}(\partial\omega)$ 上可微;
(iii) $\mathrm{dom}(g)\subset\mathrm{dom}(\omega)$ ;
(iv) $\omega+\delta_{\mathrm{dom}(g)}$ 是 $\sigma$ -强凸函数 $(\sigma>0)$ .

显然, 我们可以忽略模型的组合结构, 直接应用MDM于 $F = f + g$ , 其中 $C$ 由 $\mathrm{dom}(g)$ 代替: $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in C}\left\{\langle f'(\mathbf{x}^k)+g'(\mathbf{x}^k),\mathbf{x}\rangle+\frac{1}{t_k}B_{\omega}(\mathbf{x},\mathbf{x}^k)\right\}.$ 但这样的直接应用会带来一些问题:

我们并没有假设 $C=\mathrm{dom}(g)$ 是闭集, 因此 $\mathbf{x}^{k+1}$ 可能未定义;
即使 $\mathbf{x}^{k+1}$ 可以定义, 但我们并没有假设 $g$ 在 $C$ 上是Lipschitz的. 但这在MDM收敛性分析中是至关重要的;
再即使 $g$ 在 $C$ 上是Lipschitz的, 和函数 $F = f + g$ 的Lipschitz常数可能也要比 $f$ 的Lipschitz常数大得多. 我们希望能够设计一种仅依赖于 $f$ 在 $\mathrm{dom}(g)$ 上Lipschitz常数的算法.

我们不妨只线性化 $f$ . 于是得到如下格式: $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}}\left\{\langle f'(\mathbf{x}^k),\mathbf{x}\rangle+g(\mathbf{x})+\frac{1}{t_k}B_{\omega}(\mathbf{x},\mathbf{x}^k)\right\},$ 代入 $B_{\omega}$ 的定义, 就有 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}}\{\langle t_kf'(\mathbf{x}^k)-\nabla\omega(\mathbf{x}^k),\mathbf{x}\rangle+t_kg(\mathbf{x})+\omega(\mathbf{x})\}.$ 我们称基于此更新格式的算法为镜像-C算法(mirror-C method, MCM)⁵.

在这里插入图片描述
易知当 $g=\delta_C$ 时, MCM就回到了MDM. 为分析MCM, 我们先来说明它是良定义的, 即新的迭代点一定落在 $\mathrm{dom}(g)\cap\mathrm{dom}(\partial\omega)$ . 证明直接利用引理2.

定理5 (MCM的良定性) 假定假设条件3、4成立. 设 $\mathbf{a}\in\mathbb{E}^*$ . 则问题 $\min_{\mathbf{x}\in\mathbb{E}}\{\langle\mathbf{a,x}\rangle+g(\mathbf{x})+\omega(\mathbf{x})\}$ 的极小点唯一取在 $\mathrm{dom}(g)\cap\mathrm{dom}(\partial\omega)$ .

证明: 在引理2中令 $\psi(\mathbf{x})\equiv\langle\mathbf{a,x}\rangle+g(\mathbf{x})$ 即得证.

MCM的分析方法类似于第2节中分析MDM的. 我们先证明MCM基本不等式. 注意, 我们在此还需额外假设 $g$ 是非负函数, 且步长序列是递减的.

引理8 (MCM基本不等式) 假定假设条件3、4成立, $g$ 是非负函数. 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由带正递减步长 $\{t_k\}_{k\ge0}$ 的MCM生成的迭代序列. 则对 $\forall\mathbf{x}^*\in X^*,\,k\ge0$ , $\min_{n=0,1,\ldots,k}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{t_0g(\mathbf{x}^0)+B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2}{\sum_{n=0}^kt_n}.$

证明: 由更新公式, 并在非欧第二临近定理中令 $\mathbf{b}=\mathbf{x}^n,\,\mathbf{a}=\mathbf{x}^{n+1},\,\psi(\mathbf{x})\equiv t_n\langle f'(\mathbf{x}^n),\mathbf{x}\rangle+t_ng(\mathbf{x})$ , 就有 $\langle\nabla\omega(\mathbf{x}^n)-\nabla\omega(\mathbf{x}^{n+1}),\mathbf{u}-\mathbf{x}^{n+1}\rangle\le t_n\langle f'(\mathbf{x}^n),\mathbf{u}-\mathbf{x}^{n+1}\rangle+t_ng(\mathbf{u})-t_ng(\mathbf{x}^{n+1}).$ 由三点引理, 令其中 $\mathbf{a}=\mathbf{x}^{n+1},\,\mathbf{b}=\mathbf{x}^n,\,\mathbf{c=u}$ 就有 $\langle\nabla\omega(\mathbf{x}^n)-\nabla\omega(\mathbf{x}^{n+1}),\mathbf{u}-\mathbf{x}^{n+1}\rangle=B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})+B_{\omega}(\mathbf{x}^{n+1},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^n),$ 合起来就有 $B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})+B_{\omega}(\mathbf{x}^{n+1},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^n)\le t_n\langle f'(\mathbf{x}^n),\mathbf{u}-\mathbf{x}^{n+1}\rangle+t_ng(\mathbf{u})-t_ng(\mathbf{x}^{n+1}).$ 因此, $\begin{aligned}&t_n\langle f'(\mathbf{x}^n),\mathbf{x}^n-\mathbf{u}\rangle+t_ng(\mathbf{x}^{n+1})-t_ng(\mathbf{u})\\&\le B_{\omega}(\mathbf{u},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})-B_{\omega}(\mathbf{x}^{n+1},\mathbf{x}^n)+t_n\langle f'(\mathbf{x}^n),\mathbf{x}^n-\mathbf{x}^{n+1}\rangle\\&\le B_{\omega}(\mathbf{u},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{n+1}-\mathbf{x}^n\Vert^2+t_n\langle f'(\mathbf{x}^n),\mathbf{x}^n-\mathbf{x}^{n+1}\rangle\\&=B_{\omega}(\mathbf{u},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{n+1}-\mathbf{x}^n\Vert^2+\left\langle\frac{t_n}{\sqrt{\sigma}}f'(\mathbf{x}^n),\sqrt{\sigma}(\mathbf{x}^n-\mathbf{x}^{n+1})\right\rangle\\&\le B_{\omega}(\mathbf{u},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})-\frac{\sigma}{2}\Vert\mathbf{x}^{n+1}-\mathbf{x}^n\Vert^2+\frac{t_n^2}{2\sigma}\Vert f'(\mathbf{x}^n)\Vert_*^2+\frac{\sigma}{2}\Vert\mathbf{x}^{n+1}-\mathbf{x}^n\Vert^2\\&=B_{\omega}(\mathbf{u},\mathbf{x}^n)-B_{\omega}(\mathbf{u},\mathbf{x}^{n+1})+\frac{t_n^2}{2\sigma}\Vert f'(\mathbf{x}^n)\Vert_*^2.\end{aligned}$ 令 $\mathbf{u}=\mathbf{x}^*$ 并由次梯度不等式, 我们有 $t_n[f(\mathbf{x}^n)+g(\mathbf{x}^{n+1})-F_{\mathrm{opt}}]\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^n)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{n+1})+\frac{t_n^2}{2\sigma}\Vert f'(\mathbf{x}^n)\Vert_*^2.$ 对指标 $n=0,1,\ldots,k$ 求和以上不等式可得 $\sum_{n=0}^kt_n[f(\mathbf{x}^n)+g(\mathbf{x}^{n+1})-F_{\mathrm{opt}}]\le B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)-B_{\omega}(\mathbf{x}^*,\mathbf{x}^{k+1})+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2.$ 在两边同时加上 $t_0g(\mathbf{x}^0)-t_kg(\mathbf{x}^{k+1})$ , 并利用Bregman距离的非负性可得 $\begin{aligned}&t_0(F(\mathbf{x}^0)-F_{\mathrm{opt}})+\sum_{n=1}^k[t_nf(\mathbf{x}^n)+t_{n-1}g(\mathbf{x}^n)-t_nF_{\mathrm{opt}}]\\&\le t_ng(\mathbf{x}^0)-t_kg(\mathbf{x}^{k+1})+B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2.\end{aligned}$ 由于 $t_n\le t_{n-1},\,g(\mathbf{x}^n)\ge0$ , 因此 $\left(\sum_{n=0}^kt_n\right)\left(\min_{n=0,1,\ldots,k}F(\mathbf{x}^n)-F_{\mathrm{opt}}\right)\le\sum_{n=0}^kt_n[F(\mathbf{x}^n)-F_{\mathrm{opt}}]\le t_ng(\mathbf{x}^0)+B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2.$ 移项即得证.

有了MCM基本不等式, 我们就可以分析MCM的收敛性质.

定理6 (固定迭代数MCM的 $O(1/\sqrt{N})$ 收敛速度) 假定假设条件3、4成立, $g$ 非负. 设 $B_{\omega}(\mathbf{x},\mathbf{x}^0)$ 在 $\mathrm{dom}(g)$ 上有界: 存在 $\Theta(\mathbf{x}^0)$ , 使得 $\Theta(\mathbf{x}^0)\ge\max_{\mathbf{x}\in\mathrm{dom}(g)}B_{\omega}(\mathbf{x},\mathbf{x}^0).$ 设 $g(\mathbf{x}^0)=0$ . 令 $N$ 为一正整数. 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由步长为 $t_k=\frac{\sqrt{2\Theta(\mathbf{x}^0)\sigma}}{L_f\sqrt{N}}$ 的MCM生成的迭代序列. 则 $\min_{n=0,1,\ldots,N-1}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{\sqrt{2\Theta(\mathbf{x}^0)}L_f}{\sqrt{\sigma}\sqrt{N}}.$

证明: 由MCM基本不等式以及 $g(\mathbf{x}^0)=0,\,\Vert f'(\mathbf{x}^n)\Vert_*\le L_f,\,B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)\le\Theta(\mathbf{x}^0)$ , 就有 $\min_{n=0,1,\ldots,N-1}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{\Theta(\mathbf{x}^0)+\frac{L_f^2}{2\sigma}\sum_{n=0}^{N-1}t_n^2}{\sum_{n=0}^{N-1}t_n}.$ 再将 $t_n$ 的表达式代入即得证.

定理7 (变步长MCM的 $O(\log k/\sqrt{k})$ 收敛速度) 假定假设条件3、4成立, $g$ 非负. 设 $\{\mathbf{x}^k\}_{k\ge0}$ 为由步长准则为 $t_k=\frac{\sqrt{2\sigma}}{L_f\sqrt{k+1}}$ 的MCM生成的迭代序列. 则对 $\forall k\ge1$ , $\min_{n=0,1,\ldots,k}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{L_f}{\sqrt{2\sigma}}\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{\sqrt{2\sigma}}{L_f}g(\mathbf{x}^0)+1+\log(k+1)}{\sqrt{k+1}}.$

证明: 由MCM基本不等式以及 $t_0=\frac{\sqrt{2\sigma}}{L_f}$ , $\min_{n=0,1,\ldots,k}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{\frac{\sqrt{2\sigma}}{L_f}g(\mathbf{x}^0)+B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{1}{2\sigma}\sum_{n=0}^kt_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2}{\sum_{n=0}^kt_n}.$ 结合 $t_n^2\Vert f'(\mathbf{x}^n)\Vert_*^2\le\frac{2\sigma}{n+1},\,t_n=\frac{\sqrt{2\sigma}}{L_f\sqrt{n+1}}$ , 就有 $\min_{n=0,1,\ldots,k}F(\mathbf{x}^n)-F_{\mathrm{opt}}\le\frac{L_f}{\sqrt{2\sigma}}\frac{B_{\omega}(\mathbf{x}^*,\mathbf{x}^0)+\frac{\sqrt{2\sigma}}{L_f}g(\mathbf{x}^0)+\sum_{n=0}^k\frac{1}{n+1}}{\sum_{n=0}^k\frac{1}{\sqrt{n+1}}}.$ 再根据第八章引理9(i)即得证.

例5 假设 $\mathbb{R}^n$ 中的范数为 $\ell_2$ -范数. 设 $f:\mathbb{R}^n\to\mathbb{R}$ 为凸函数, 且在 $\mathbb{R}^n$ 上 $L_f$ -Lipschitz连续, 即 $\Vert f'(\mathbf{x})\Vert_2\le L_f,\,\forall\mathbf{x}\in\mathbb{R}^n$ . 考虑问题 $\min_{\mathbf{x}\in\mathbb{R}_{++}^n}\left\{F(\mathbf{x})\equiv f(\mathbf{x})+\sum_{i=1}^n\frac{1}{x_i}\right\},$ $\omega(\mathbf{x})=\frac{1}{2}\Vert\mathbf{x}\Vert^2$ . 现有两种选择:

Proj-SGM. 我们发现可行集 $C$ 是不明确的. 如若取 $C=\mathbb{R}_{++}^n$ , 则到 $C$ 上的投影并不存在唯一. 另外 $F$ 显然不是Lipschitz连续的. 从而无法保证收敛性.
PSGM. 可取 $g(\mathbf{x})\equiv\sum_{i=1}^n\frac{1}{x_i}+\delta_{\mathbb{R}_{++}^n}$ . 这样假设条件3、4都满足, 且 $g$ 非负. 迭代格式为 $\mathbf{x}^{k+1}=\mathrm{prox}_{t_kg}(\mathbf{x}^k-t_kf'(\mathbf{x}^k)).$ 可以验证, 在每步计算prox时需要求解 $n$ 个一元三次方程.

例6 (Proj-SGM vs. PSGM——数值例子) 假设 $\mathbb{R}^n$ 的范数为 $\ell_2$ -范数. 考虑问题 $\min_{\mathbf{x}\in\mathbb{R}^n}\{F(\mathbf{x})\equiv\Vert\mathbf{Ax-b}\Vert_1+\lambda\Vert\mathbf{x}\Vert_1\},$ 其中 $\mathbf{A}\in\mathbb{R}^{m\times n},\,\mathbf{b}\in\mathbb{R}^m,\,\lambda>0$ . 我们讨论求解此问题的两种算法:

Proj-SGM. 取 $C=\mathbb{R}^n,\,\mathrm{sgn}(\mathbf{y})\in\partial (\Vert\mathbf{y}\Vert_1)$ , 迭代格式为 $\mathbf{x}^{k+1}=\mathbf{x}^k-t_k(\mathbf{A}^T\mathrm{sgn}(\mathbf{Ax}^k-\mathbf{b})+\lambda\mathrm{sgn}(\mathbf{x})).$ 其中步长取为 $t_k=\frac{1}{\Vert F'(\mathbf{x}^k)\Vert_2\sqrt{k+1}}$ .
PSGM. 令 $f(\mathbf{x})=\Vert\mathbf{Ax-b}\Vert_1,\,g(\mathbf{x})=\lambda\Vert\mathbf{x}\Vert_1$ , 从而 $F = f + g$ . 迭代格式为 $\mathbf{x}^{k+1}=\mathrm{prox}_{s_kg}(\mathbf{x}^k-s_k\mathbf{A}^T\mathrm{sgn}(\mathbf{Ax}^k-\mathbf{b})).$ 因为 $g(\mathbf{x})=\lambda\Vert\mathbf{x}\Vert_1$ , 因此 $\mathrm{prox}_{s_kg}$ 是软阈值算子 $\mathcal{T}_{\lambda s_k}$ (第六章例2). 所以 $\mathbf{x}^{k+1}=\mathcal{T}_{\lambda s_k}(\mathbf{x}^k-s_k\mathbf{A}^T\mathrm{sgn}(\mathbf{Ax}^k-\mathbf{b})).$ 步长取为 $s_k=\frac{1}{\Vert f'(\mathbf{x}^k)\Vert_2\sqrt{k+1}}$ .

显然, Proj-SGM依赖于 $F$ 的Lipschitz常数 $L_F$ , 而PSGM仅依赖于 $f$ 的Lipschitz常数 $L_f$ . 理论上, PSGM应当比Proj-SGM表现要好. 我们按标准正态分布独立随机生成了 $\mathbf{A,b}$ 的分量. 两种算法下 $F(\mathbf{x}^k)-F_{\mathrm{opt}}$ 的变化情况可见下图.

在这里插入图片描述
从图中可知, 在此例中PSGM要显著优于Proj-SGM.

若定义 $\tilde\omega=\omega+\delta_C$ , 注意到 $\nabla\omega(\mathbf{x}^k)\in\partial\tilde\omega(\mathbf{x}^k)$ , 从而可将 $\nabla\omega(\mathbf{x}^k)$ 写成 $\tilde\omega'(\mathbf{x}^k)$ , 于是MDM的迭代格式就可写作 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in\mathbb{E}}\{\langle t_kf'(\mathbf{x}^k)-\tilde\omega'(\mathbf{x}^k),\mathbf{x}\rangle+\tilde\omega(\mathbf{x})\}.$ 根据共轭关联定理(第五章定理8(ii)), 由于 $\tilde\omega$ 是正常闭强凸函数, 所以 $\tilde\omega^*$ 是 $\mathbb{E}^*$ 上的可微函数. 再根据共轭次梯度定理第二形式(第四章推论2), 就可以得到以下MDM迭代格式: $\mathbf{x}^{k+1}=\nabla\tilde\omega^*(\tilde\omega'(\mathbf{x}^k)-t_kf'(\mathbf{x}^k)).$ ↩︎
这里写成等号是因为 $\omega+\delta_{\mathrm{dom}(\psi)}+\psi=\omega+\psi$ 强凸, 所以 $\mathbf{a}$ 是唯一确定的. ↩︎
与Proj-SGM基本不等式相比, MDM基本不等式形式上完全一致, 只是将欧式距离度量换成了Bregman距离度量, 次梯度的范数换成了对偶函数. ↩︎
与先前的假设条件2有些许不同. ↩︎
若 $\mathbb{E}$ 是欧式空间, 且 $\omega(\mathbf{x})=\frac{1}{2}\Vert\mathbf{x}\Vert^2$ , 则更新公式变成 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}}\left\{\langle t_kf'(\mathbf{x}^k),\mathbf{x}\rangle+t_kg(\mathbf{x})+\frac{1}{2}\Vert\mathbf{x}-\mathbf{x}^k\Vert^2\right\},$ 稍作整理后即得 $\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}}\left\{t_kg(\mathbf{x})+\frac{1}{2}\Vert\mathbf{x}-[\mathbf{x}^k-t_kf'(\mathbf{x}^k)\Vert^2\right\}.$ 由临近算子的定义, 即得 $\mathbf{x}^{k+1}=\mathrm{prox}_{t_kg}(\mathbf{x}^k-t_kf'(\mathbf{x}^k)).$ 这时得到的算法称作临近次梯度算法(proximal subgradient method, PSGM). 易知当 $g=\delta_C$ 时, 就回到了第八章的Proj-SGM. 我们将在第十章对其进行详细讨论. 期间将对 $f$ 做额外的可微性假设, 从而会得到更好的收敛性质. ↩︎