陆吾生讲座 最优化问题的数学基础

Ⅰ. Taylor expansion

对于光滑函数可以进行泰勒展开。
任意函数只要可以求导,放大看它的局部必是高阶的多项式求和的形式,根据要求的拟合误差决定需要的阶数。

One-variable case

f(x+δ)=f(x)+f(x)δ+12f′′(x)δ2+
<script id="MathJax-Element-60" type="math/tex; mode=display">f(x+\delta)=f(x)+f'(x)\delta+{1\over 2}f''(x)\delta^2+…</script>

Multi-variable case

def. Hessian Matrix: 2f(x) <script id="MathJax-Element-61" type="math/tex">\nabla^2f(x)</script>

f(x+δ)=f(x)+Tf(x)δ+12δT2f(x)δ+
<script id="MathJax-Element-62" type="math/tex; mode=display">f(x+\delta)=f(x)+\nabla^Tf(x)\delta+{1\over 2}\delta^T\nabla^2f(x)\delta+…</script>

Linear approximation of f(x) at x

δ0 <script id="MathJax-Element-63" type="math/tex">\delta\rightarrow 0</script>,f(x)是关于 δ <script id="MathJax-Element-64" type="math/tex">\delta</script>的线性函数

f(x+δ)f(x)+Tf(x)δ
<script id="MathJax-Element-65" type="math/tex; mode=display">f(x+\delta)\approx f(x)+\nabla^Tf(x)\delta</script>

Quadratic approximation of f(x) at x

δ0 <script id="MathJax-Element-66" type="math/tex">\delta\rightarrow 0</script>,f(x)是关于 δ <script id="MathJax-Element-67" type="math/tex">\delta</script>的二次函数

f(x+δ)f(x)+Tf(x)δ+12δT2f(x)δ
<script id="MathJax-Element-68" type="math/tex; mode=display">f(x+\delta)\approx f(x)+\nabla^Tf(x)\delta+{1\over 2}\delta^T\nabla^2f(x)\delta</script>


Ⅱ. Optimazation

minxRn×1f(x) <script id="MathJax-Element-69" type="math/tex">\min_{x\in \mathcal{R}^{n\times 1}}f(x)</script>

  • 原始办法
    得到两个方程,但对于复杂函数而言求导后解方程极其复杂。

    f(x)=f(x)x1f(x)x2=0
    <script id="MathJax-Element-70" type="math/tex; mode=display">\nabla f(x)=\begin{bmatrix} {\partial f(x)\over \partial x_1}\\ {\partial f(x)\over \partial x_2}\\ \end{bmatrix}=0</script>

  • A.Cauchy Method
    不通过解方程找到方程的解。
    随便找一个 xk <script id="MathJax-Element-71" type="math/tex">x_k</script>,然后使其移动得到更小的函数值,interatively,直到 f(x)0 <script id="MathJax-Element-72" type="math/tex">\nabla f(x)\rightarrow 0</script>.
    δ=f(xk) <script id="MathJax-Element-73" type="math/tex">\delta=-\nabla f(x_k)</script>, xk+1=xkf(xk) <script id="MathJax-Element-74" type="math/tex">x_{k+1}=x_k-\nabla f(x_k)</script>

    f(xk+δ)f(xk)Tf(xk)δ=||f(xk)2||<0
    <script id="MathJax-Element-75" type="math/tex; mode=display">f(x_k+\delta)-f(x_k)\approx\nabla^Tf(x_k)\delta=-||\nabla f(x_k)^2||<0 </script>故可以保证当前的 f(xk+δ) <script id="MathJax-Element-76" type="math/tex">f(x_k+\delta)</script>比原先的 f(xk) <script id="MathJax-Element-77" type="math/tex">f(x_k)</script>更小。
    以此为基础还可以设置步长 α <script id="MathJax-Element-78" type="math/tex">\alpha</script>
    F(α)=f(xkαTf(xk))
    <script id="MathJax-Element-79" type="math/tex; mode=display">F(\alpha)=f(x_k-\alpha\nabla^Tf(x_k)) </script>得到的是关于 α <script id="MathJax-Element-80" type="math/tex">\alpha</script>的非线性函数,当取到 αopt <script id="MathJax-Element-81" type="math/tex">\alpha_{opt}</script>时可以得到最小的函数值。
    alpha
    关于 α <script id="MathJax-Element-82" type="math/tex">\alpha</script>的选取,可参见Line search.
    以此迭代产生
    xk+1=xkαkf(xk)
    <script id="MathJax-Element-83" type="math/tex; mode=display">x_{k+1}=x_k-\alpha_k\nabla f(x_k) </script>
    xk+2=xk+1αk+1f(xk+1)
    <script id="MathJax-Element-84" type="math/tex; mode=display">x_{k+2}=x_{k+1}-\alpha_{k+1}\nabla f(x_{k+1}) </script>
    ...
    <script id="MathJax-Element-85" type="math/tex; mode=display">...</script>直到 f(x)0 <script id="MathJax-Element-86" type="math/tex">\nabla f(x)\rightarrow 0</script>.

  • Newton Method
    上面的 α <script id="MathJax-Element-87" type="math/tex">\alpha</script>求起来麻烦也不好估计。由于 f <script id="MathJax-Element-88" type="math/tex">f</script>同时也是关于 δ <script id="MathJax-Element-89" type="math/tex">\delta</script>的二阶多项式
    f(x+δ)f(x)+Tf(x)δ+12δT2f(x)δ
    <script id="MathJax-Element-90" type="math/tex; mode=display">f(x+\delta)\approx f(x)+\nabla^Tf(x)\delta+{1\over 2}\delta^T\nabla^2f(x)\delta </script>故要求关于 δ <script id="MathJax-Element-91" type="math/tex">\delta</script>的最小值可关于 δ <script id="MathJax-Element-92" type="math/tex">\delta</script>求导
    δ( f(xk)+Tf(xk)δ+12δT2f(xk)δ )=0
    <script id="MathJax-Element-93" type="math/tex; mode=display">\nabla_{\delta}(~f(x_k)+\nabla^Tf(x_k)\delta+{1\over 2}\delta^T\nabla^2f(x_k)\delta~)=0 </script>且有性质
    (cTx)=c
    <script id="MathJax-Element-94" type="math/tex; mode=display">\nabla(c^Tx)=c</script>可得到
    f(xk)+2f(xk)δ=0
    <script id="MathJax-Element-95" type="math/tex; mode=display">\nabla f(x_k)+\nabla^2f(x_k)\delta=0</script>则
    δ=(2f(xk))1f(xk)
    <script id="MathJax-Element-96" type="math/tex; mode=display">\delta=-(\nabla^2f(x_k))^{-1}\nabla f(x_k)</script>可以确定步长
    xk+1=xk(2f(xk))1f(xk)
    <script id="MathJax-Element-97" type="math/tex; mode=display">x_{k+1}=x_k-(\nabla^2f(x_k))^{-1}\nabla f(x_k)</script>
    如图真实函数为黑线。一开始任取 xk <script id="MathJax-Element-98" type="math/tex">x_k</script>,其Taylor展开后对函数为红线的近似。求得红线的极值点为 xk+1 <script id="MathJax-Element-99" type="math/tex">x_{k+1}</script>,又得到绿色的近似,得到绿线的极值点为 xk+2 <script id="MathJax-Element-100" type="math/tex">x_{k+2}</script>,反复迭代不断地逼近理想的极值点。
    newton

比较Cauchy和Newton的方法

比较哪种方法的质量好,可以通过 f(x) <script id="MathJax-Element-101" type="math/tex">\nabla f(x^*)</script>,越接近0的越好。同时收敛速度越快的也越好。

MethodCauchyNewton
<script id="MathJax-Element-102" type="math/tex; mode=display"> \begin{array}{c|cc} Method & \text{Cauchy} & \text{Newton}\\ \hline 收敛速度 & 慢 & 快\\ 预处理(求导等)& 快 & 慢\\ 占据内存 & 小 & 大 \end{array} </script>


Ⅲ. Quadratic Form

Quadratic approximation of f(x) at x

f(x+δ)f(x)+Tf(x)δ+12δT2f(x)δ
<script id="MathJax-Element-103" type="math/tex; mode=display">f(x+\delta)\approx f(x)+\nabla^Tf(x)\delta+{1\over 2}\delta^T\nabla^2f(x)\delta</script>中出现了Hessian Matrix H=2f(x) <script id="MathJax-Element-104" type="math/tex">H=\nabla^2f(x)</script>,最后这项是一个二次型。

想知道二次型的正负性,取决于H的特征值 eig(H)λ1,λ2,,λn <script id="MathJax-Element-105" type="math/tex">eig(H)-\lambda_1,\lambda_2,…,\lambda_n</script> - real valued

def.xTHx>0xTHx0xTHx<0xTHx0xTHx>,<0positive definite P.Dpositive semidefinite P.S.Dnegative definite N.Dnegative semidefinite N.S.Dindefiniteiffiffiffiffλi>0λi0λi<0λi0
<script id="MathJax-Element-106" type="math/tex; mode=display">def.\begin{cases} x^THx>0 & \text{positive definite P.D} & \text{iff} &\lambda_i>0 \\ x^THx\geq0 & \text{positive semidefinite P.S.D} & \text{iff} & \lambda_i\geq0\\ x^THx<0 & \text{negative definite N.D} & \text{iff} &\lambda_i<0\\ x^THx\leq0 & \text{negative semidefinite N.S.D} & \text{iff} &\lambda_i\leq0\\ x^THx>,<0 & \text{indefinite}\\ \end{cases} </script>
比如
H=[12.52.54]
<script id="MathJax-Element-107" type="math/tex; mode=display">H=\begin{bmatrix} 1&2.5\\ 2.5&4\\ \end{bmatrix} </script>
f(x)=xTHx=x21+5x1x2+4x22
<script id="MathJax-Element-108" type="math/tex; mode=display">f(x)=x^THx=x_1^2+5x_1x_2+4x_2^2</script>
det(λIH)=[λ12.52.5λ4]=(λ1)(λ4)6.25=0
<script id="MathJax-Element-109" type="math/tex; mode=display">\det(\lambda I-H)= \begin{bmatrix} \lambda-1&2.5\\ 2.5&\lambda-4\\ \end{bmatrix} =(\lambda-1)(\lambda-4)-6.25=0</script>
陆老师一眼看出它是N.D,用了主子式(leading principal minors)简化运算:
1×46.25<0
<script id="MathJax-Element-110" type="math/tex; mode=display">1\times4-6.25<0</script>妈妈我不懂……

Convex Function

对于开口向上的凸函数而言,其图像有性质:任何点的切线都在函数图象下方。
convex
则对于x点处切线 tanθ=f(x) <script id="MathJax-Element-111" type="math/tex">\tan\theta=f'(x)</script>, θ <script id="MathJax-Element-112" type="math/tex">\theta</script>为切线与x轴夹角。且如图

h=tanθ(x1x)=f(x)(x1x)
<script id="MathJax-Element-113" type="math/tex; mode=display">h=\tan\theta(x_1-x)=f'(x)(x_1-x) </script>则
f(x1)=f(x)+h+p
<script id="MathJax-Element-114" type="math/tex; mode=display">f(x_1)=f(x)+h+p </script>又由Taylor展开项
f(x+δ)f(x)+Tf(x)δ+12δT2f(x)δ
<script id="MathJax-Element-115" type="math/tex; mode=display"> f(x+\delta)\approx f(x)+\nabla^Tf(x)\delta+{1\over 2}\delta^T\nabla^2f(x)\delta </script>代入到凸函数中得到
f(x1)f(x)+Tf(x)(x1x)+12(x1x)T2f(x)(x1x)
<script id="MathJax-Element-116" type="math/tex; mode=display">f(x_1)\approx f(x)+\nabla^Tf(x)(x_1-x)+{1\over 2}(x_1-x)^T\nabla^2f(x)(x_1-x) </script>则
f(x1)f(x)Tf(x)(x1x)12(x1x)T2f(x)(x1x)0
<script id="MathJax-Element-117" type="math/tex; mode=display">f(x_1)-f(x)-\nabla^Tf(x)(x_1-x)\approx {1\over 2}(x_1-x)^T\nabla^2f(x)(x_1-x)\geq0 </script>即二次型是半正定的。
因此,求eig(H)则可得函数是否convex。

即使复杂如Logistic regression中的二阶梯度

2f(θ)=1Ni=1N(12li)2e(12li)θTx^ix^ix^Ti(1+e(12li)θTx^i)2
<script id="MathJax-Element-118" type="math/tex; mode=display"> \nabla^2 f(\theta)={1\over N}\sum_{i=1}^N{(1-2l_i)^2e^{(1-2l_i)\theta^T\hat x_i}\hat x_i\hat x_i^T \over (1+e^{(1-2l_i)\theta^T\hat x_i})^2} </script>也可判断出原函数为凸。


感谢陆老师~
ECNU的秋

©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页