目录
CVX notes
Preliminaries
1. PSD
M is positive semidefinite matrix \(\iff\) all principal submatrices \(P\) of \(M\) are PSD
Note: This follows by considering the quadratic form \(x^T Mx\) and looking at the components of \(x\) corresponding to the defining subset of principal submatrix. The converse is trivially true.
M is PSD \(\iff\) all principal minors are non-negative (所有主子式非负)
将M写成二次型:
\[ x^T M x = \sum_{i,j}M_{ij}x_ix_j \]
于是取 \(x\) 为标准基 \(e_i ~\implies M_{ii} \ge 0 \implies \mathbf{tr}(M) \ge 0\) , 再取\(x\)为零向量只有 i,j两个位置为 1,则
\[ \begin{gathered} x^T M x = M_{ii}M_{jj} - M_{ij}^2 \ge 0 ~~(PSD) \\ \implies M_{ij} \le \sqrt{M_{ii}M_{jj}} \le \frac{M_{ii} + M_{jj}}{2} \end{gathered} \]
2. Matrix norm
General definition of a norm:
Matrix norm:
- Frobenius norm: \(\|A\|_F := \sqrt{\langle A,A\rangle_F} = \sqrt{\mathbf{tr}(A^*A)}\)
- Induced norm: \(\|A\|_p := \sup_\limits{\|x\|_p = 1} \|Ax\|_p\)
- Nuclear norm: \(\|A\|_{nuclear} := \sum \sigma_i(A)\) (奇异值之和)
- Spectral norm: \(\|A\|_{spectral} := \lambda_1\) (最大特征值)
Spectrial radius
3. Duality
Two ==equivalent== ways to represent a convex set:
- standard representation: The family of points in the set
- dual representation: The set of halfspaces containing the set (半平面的交集)
A closed convex set \(S\) is the intersection of all closed halfspaces \(H\) containing it.
Polar
Let \(S \subseteq \mathbb{R}^n\) be a convex set containing the origin. The polar of \(S\) is defined as follows
\[ S^{\circ} := \{y ~|~ y^Tx \le 1, ~\forall x \in S\} \]
Note
- polar is one way of representing the all halfspaces containing a convex set
- every halfspace \(a^Tx \le b\) with \(b \neq 0\) can be written as a “normalized” inequality \(y^T x \le 1\), by dividing by \(b\)
- \(S^{\circ}\) can be thought of as the normalized representations of halfspaces containing \(S\)
Properties of the polar:
- \(S^{\circ\circ} = S\)
- \(S^{\circ}\) is a closed convex set containing the origin
- When 0 is in the interior of \(S\), then \(S^{\circ}\) is bounded
- When \(S\) is non-convex, \(S^{\circ} = (\mathbf{conv}(S))^{\circ}\), and \(S^{\circ\circ} = \mathbf{conv}(S)\)
Polar duality of convex cones
Notes
- \(K^{\circ\circ} = K\)
- \(K^{\circ}\) is closed and convex
Conjugation of convex functions
Let \(f: \mathbb{R}^n \mapsto \mathbb{R}\cup\{\infty\}\) be a convex function. The ==conjugation== of \(f\) is
\[ f^*(y) := \sup_\limits{x}(y^Tx - f(x)) \]
Properties of the conjugate
- \(f^{**} = f\)
- \(f^*\) is convex (supremum of affine functions of \(y\))
Convex sets
Convex functions
affine is convex: \(f(x) = a^T x+b\)
affine 既凸也凹
任何_范数_是凸的
Proof: let \(\pi(x)\) be a norm of \(x\), then
\(f\) is convex \(\iff\) epi(\(f\)) is convex
1. Closed convex
A convex function \(f\) is called closed if its epigraph is a closed set.
- \(f\) which is convex and continuous on a closed domain is a closed function. (norms)
- all differentiable convex functions are closed with dom\(f = \mathbb{R}^n\).
- 当考虑一个凸函数时,通常认为在dom\(f\)外取值为\(\infty\)
- Jensen's inequality:
Corollary:
pf: \(f(x) = f(\sum\alpha_i x_i) \le \sum \alpha_i f(x_i) \le \max_\limits{i} f(x_i)\)
2. Level sets
Note: the convexity of level sets does not characterize convex functions, but quasiconvex functions.
- convex \(f\) is closed \(\implies\) all its level sets are closed
Some convex sets
- norm ball (\(\{x\in \mathbb{R}^n | \|x\| \le 1\}\)) is convex and closed
椭球(\(\{x | (x-a)^T Q (x-a) \le r^2\}\)) is convex and closed
pf: \(x^TQy := \langle x, y \rangle\) 满足内积的三条性质
- bilinearity
- symmetry
- positivity
上述三条性质 \(\iff\) Q is PSD
\(\epsilon\)-neighborhood:
3. Operations perserving convexity of functions
- stability under taking weighted sums: \(f,g \mapsto \lambda f + \mu g, \; \lambda,\mu \ge 0\)
- stability under affine substitutions of the argument: \(x \mapsto Ax+b\) or \(f(x) \mapsto \phi(x) = f(Ax+b)\)
- stability under taking pointwise sup: \(\{f_i\}_{i \in \mathcal{I}} \mapsto g(x) := \sup_\limits{i \in \mathcal{I}}f_i(x)\), 凸函数族 \(\{f_i\}_{i \in \mathcal{I}}\) 逐点取上确界而成的函数也是凸的
- stability under partial minimization: \(f(x,y)\) jointly convex in \((x,y)\), then \(g(x) = \inf_\limits{y} f(x,y)\) is convex (suppose g is proper, i.e., > -\(\infty\) everywhere and is finite at least at one point)
- stability under perspective: \(f(x) \mapsto g(x,t) = tf(x/t), \mathbf{dom}g = \{(x,t) | x/t \in \mathbf{dom}f, t > 0\}\)
4. Detect convexity
Necessary and Sufficient Convexity Condition for smooth function:
- 一阶可微的光滑函数 \(f\) 是凸的 \(\iff\) \(f'(x)\) 单调非减
- 二阶可微的光滑函数 \(f\) 是凸的 \(\iff\) \(f''(x) \ge 0\)
subgradient property is characteristic of convex functions:
5. Subgradient
Examples
6. Optimality conditions
凸函数的局部最优等价于全局最优。
第一充要条件(凸函数)
\(x^* \in \mathbf{dom}f\) is the minimizer \(\iff\) \(0 \in \partial f(x^*)\)
7. Strong convexity
A differentiable function f is strongly convex if
\[ f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2} \|y-x\|^2 \]
Note
- \(f\) is not necessarily differentiable, (see the equivalent definition)
- if \(f\) is non-smooth, gradient -> subgradient
- strong convexity \(\implies\) strict convexity
Note: Intuitively speaking, strong convexity means that there exists a quartic lower bound on the growth of the function.
Equivalent definition
\[ \begin{align} &(i)~f(y)\ge f(x)+\nabla f(x)^T(y-x)+\frac{\mu}{2}\lVert y-x \rVert^2,~\forall x, y.\\ &(ii)~g(x) = f(x)-\frac{\mu}{2}\lVert x \rVert^2~\text{is convex},~\forall x.\\ &(iii)~\langle \nabla f(x) - \nabla f(y),x-y \rangle \ge \mu \lVert x-y\rVert^2,~\forall x, y.\\ &(iv)~f(\alpha x+ (1-\alpha) y) \le \alpha f(x) + (1-\alpha) f(y) - \frac{\alpha (1-\alpha)\mu}{2}\Vert x-y\rVert^2,~\alpha \in [0,1].\\ &(v)~\nabla^2 f(x) \succeq \mu \boldsymbol{I} \end{align} \]
Lagrange Duality
Consider an optimization problem in standard form (not necessarily convex)
\[ \begin{array}{ll} \underset{x}{\text{minimize}} & f_0(x) \\ \text{subject to} & f_i(x) \le 0, ~i=1,\cdots,m \\ ~ & h_i(x) = 0, ~i=1,\cdots,p \end{array} \]
The Lagrangian is
\[ L(x,\boldsymbol{\lambda},\boldsymbol{\mu}) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \mu_i h_i(x) \]
The Lagrange dual function is defined as
\[ g(\lambda, \mu) = \inf_{x} L(x,\lambda,\mu) \]
Lagrange dual problem
\[ \begin{array}{ll} \underset{\lambda, \mu}{\text{maximize}} & g(\lambda, \mu) \\ \text{subject to} & \boldsymbol{\lambda} \succeq \mathbf{0} \end{array} \]
Weak duality
\[ d^* \le p^* \]
- \(d^*\): optima of dual problem
- \(p^*\): optima of primal problem
- duality gap: \(p^* - d^*\)
- always hold
Strong dualiy
\[ d^* = p^* \]
- constraint qualifications \(\implies\) strong duality
- Slater’s Constraint Qualification: a convex problem is strictly feasible (i.e., \(\exists ~x \in \mathbf{int} \mathcal{D}: x \in \Omega\))
Complementary slackness
KKT conditions
Cones
Tagent cone
Let M be a (nonempty) convex set and \(x^* \in M\), the tagent cone of \(M\) at \(x^*\) is the cone
\[ \begin{split} T_M(x^*) &= \{h \in \mathbb{R}^n | x^* + th \in M, \; \forall t > 0 \} \\ &= \{y \in \mathbb{R}^n ~|~ y - x^* \in M\} \end{split} \]
Note:
- Geometrically, this is the set of all directions leading from \(x^*\) inside \(M\)
- convex but not necessarily closed
- fact: if \(x^*\) is a minimizer, then \(\forall h \in T_M(x^*) \implies h^T \nabla f(x^*) \ge 0\). (因为tangent cone里面都是可行解,所以必须不是下降方向)
- \(T_M(x^*) = \mathbb{R}^n \iff x^* \in \mathbf{int}M\)
e.g. 多面体
\[ M = \{x | Ax \le b\} = \{x | a_i^Tx \le b_i, \; i = 1,\dots,m\} \]
the tangent cone at \(x^*\) is
\[ T_M(x^*) = \{h~|~a_i^T h \le 0, ~\forall i, ~a_i^T x^* = b_i\} \]
Normal cone: the polar cone of tangent cone
\[ N_M(x^*) = \{g \in \mathbb{R}^n ~|~ \langle g, y-x^*\rangle \le 0, ~\forall y \in M\} \]
Note:
- normal cone is the polar to tangent cone, i.e.,
\[ \begin{split} T_M(x^*) &= \{g \in \mathbb{R}^n ~|~ \langle g, y-x^*\rangle \ge 0, ~\forall y \in M\} \\ N_M(x^*) &= \{g \in \mathbb{R}^n ~|~ \langle g, y-x^*\rangle \le 0, ~\forall y \in M\} \end{split} \] - fact: if \(x^*\) is a minimizer, then \(-\nabla f(x^*) \in N_M(x^*)\).
Algorithm convergence
~ | Stepsize Rule | Convergence Rate | Iteration Complexity |
---|---|---|---|
Gradient descent | |||
strongly convex & smooth | \(\eta_t = \frac{2}{\mu + L}\) | \(O\left(\frac{\kappa -1}{\kappa +1}\right)^t\) | \(O\left(\frac{\log\frac{1}{\epsilon}}{\log\frac{\kappa+1}{\kappa-1}}\right)\) |
convex & smooth | \(\eta_t = \frac{1}{L}\) | \(O(\frac{1}{\sqrt{t}})\) | \(O(\frac{1}{\epsilon})\) |
Frank-Wolfe | |||
(strongly) convex & smooth | \(\eta_t = \frac{1}{t}\) | \(O(\frac{1}{\sqrt{t}})\) | \(O(\frac{1}{\epsilon})\) |
Projected GD | |||
convex & smooth | \(\eta_t = \frac{1}{L}\) | \(O(\frac{1}{\sqrt{t}})\) | \(O(\frac{1}{\epsilon})\) |
strongly convex & smooth | \(\eta_t = \frac{1}{L}\) | \(O\left((1-\frac{1}{\kappa})^t\right)\) | \(O(\kappa\log\frac{1}{\epsilon})\) |
Subgradient method | |||
convex & Lipschitz | \(\eta_t = \frac{1}{\sqrt{t}}\) | \(O(\frac{1}{\sqrt{t}})\) | \(O(\frac{1}{\epsilon^2})\) |
strongly convex & Lipschitz | \(\eta_t = \frac{1}{t}\) | \(O\left(\frac{1}{t}\right)\) | \(O(\frac{1}{\epsilon})\) |
Proximal GD | |||
convex & smooth (w.r.t. \(f\)) | \(\eta_t = \frac{1}{L}\) | \(O(\frac{1}{t})\) | \(O(\frac{1}{\epsilon})\) |
strongly convex & smooth (w.r.t. \(f\)) | \(\eta_t = \frac{1}{L}\) | \(O\left((1-\frac{1}{\kappa})^t\right)\) | \(O(\kappa\log\frac{1}{\epsilon})\) |