支持向量机
作者: l i t t l e − x u little-xu little−xu
时间: 2021 / 1 / 20 2021/1/20 2021/1/20
间隔与支持向量
最大间隔超平面
我们有样本训练集 { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x i , y i ) } , y i ∈ { − 1 , + 1 } \{(x_1,y_1),(x_2,y_2),\cdots,(x_i,y_i)\},y_i \in \{-1,+1\} {(x1,y1),(x2,y2),⋯,(xi,yi)},yi∈{−1,+1},不妨我们用正例描述 + 1 +1 +1,负例来描述 − 1 -1 −1,在高维度我们寻找一个平面将将正例和负例划分开来
从二维扩展到多维空间中时,将 D 1 D_1 D1和 D 2 D_2 D2完全正确地划分开的 w x + b = 0 wx+b=0 wx+b=0就成了一个超平面。
那怎么去确定这样的鲁棒性平面呢?
正例有边界线,负例也有边界,将最外围的点连线,就变成一个凸集。也就是他的边界线,与我们超平面相切的
-
两类样本分别分割在该超平面的两侧
-
超平面到正例和反例的边界线距离之和最大
此处细品超平面到边界线距离:训练集 T T T中正例(负例)到超平面 ( w , b ) (w,b) (w,b)关于 T T T中所有样本点 ( x i , y i ) (x_i,y_i) (xi,yi)的几何间隔最小值
几何间隔
γ i = y i ( w x i + b ∥ w ∥ ) \begin{aligned} \gamma_i=y_i(\cfrac{wx_i+b}{\|w\|}) \end{aligned} γi=yi(∥w∥wxi+b)
疑惑:此处 y i y_i yi是来捣蛋的?
解疑:
{ w T x i + b ≥ + 1 ( 正 例 边 界 ) , y i = + 1 w T x i + b ≤ − 1 ( 负 例 边 界 ) , y i = − 1 \begin{aligned} \begin{cases} w^Tx_i+b \geq +1(正例边界) , \quad y_i=+1 \\ w^Tx_i+b \leq -1(负例边界) , \quad y_i=-1 \\ \end{cases} \end{aligned} {wTxi+b≥+1(正例边界),yi=+1wTxi+b≤−1(负例边界),yi=−1
这个又是是什么嘞?
w x + b = 1 wx+b=1 wx+b=1是我们正例边界切平面,我们将正例中任意一个样本 ( x 正 , y 正 ) (x_正,y_正) (x正,y正)带入 y = w x + b − 1 y=wx+b-1 y=wx+b−1得到 y ≥ 0 y \geq 0 y≥0,即 w x + b − 1 ≥ 0 wx+b-1 \geq 0 wx+b−1≥0,负例也是如此
由小学知识可知,距离是没有负数的。当 y i y_i yi为正例时, γ i = ( w x i + b ∥ w ∥ ) \begin{aligned}\gamma_i=(\cfrac{wx_i+b}{\|w\|})\end{aligned} γi=(∥w∥wxi+b),当 y i y_i yi为反例时, γ i = − ( w x i + b ∥ w ∥ ) \begin{aligned}\gamma_i=-(\cfrac{wx_i+b}{\|w\|})\end{aligned} γi=−(∥w∥wxi+b),所以我们这里用 y i y_i yi抵消 + 1 , − 1 +1,-1 +1,−1造成的影响
范数科普
平面 w 1 x + w 2 y + b = 0 w_1x+w_2y+b=0 w1x+w2y+b=0,数据 ( x 0 , y 0 ) (x_0,y_0) (x0,y0)到平面距离
d = ∣ w 1 x 0 + w 2 y 0 + b ∣ w 1 2 + w 2 2 = ∣ w 1 x 0 + w 2 y 0 + b ∣ ∥ w ∥ \begin{aligned} d & =\dfrac{\lvert w_1x_0 + w_2y_0 + b \rvert}{\sqrt{w_1^2 + w_2^2}} \\ & = \dfrac{\lvert w_1x_0 + w_2y_0 + b \rvert}{\|w\|} \end{aligned} d=w12+w22∣w1x0+w2y0+b∣=∥w∥∣w1x0+w2y0+b∣
目标规划
此处细品超平面到边界线距离:训练集 T T T中正例(负例)到超平面 ( w , b ) (w,b) (w,b)关于 T T T中所有样本点 ( x i , y i ) (x_i,y_i) (xi,yi)的几何间隔至少是 γ \gamma γ
m a x w , b γ 正 + γ 负 s . t . y i ( w x i + b ∥ x ∥ ) ≥ γ i \begin{aligned} & \mathop{max}\limits_{w,b} \quad \gamma_正 + \gamma_负 \\ & s.t. \quad y_i(\dfrac{wx_i+b}{ \| x \| }) \geq \gamma_i \end{aligned} w,bmaxγ正+γ负s.t.yi(∥x∥wxi+b)≥γi
由解疑中的边界切线 w x + b = ± 1 wx+b= \pm 1 wx+b=±1转化成了
m a x w , b 1 ∥ w ∥ + 1 ∥ w ∥ = 2 ∥ w ∥ = s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , ⋯ , m . s . t . m i n y i ( w T x i + b ) = 1 , i = 1 , 2 , ⋯ , m . \begin{aligned} & \mathop{max}\limits_{w,b} \quad \cfrac{1}{ \| w \|} +\cfrac{1}{ \| w \|}= \cfrac{2}{ \| w \|} \\ & = s.t. \quad y_i(w^Tx_i+b) \geq 1, \quad i=1,2,\cdots,m. \\ & s.t. \quad\mathop{min} \quad y_i(w^Tx_i+b) = 1, \quad i=1,2,\cdots,m. \end{aligned} w,bmax∥w∥1+∥w∥1=∥w∥2=s.t.yi(wTxi+b)≥1,i=1,2,⋯,m.s.t.minyi(wTxi+b)=1,i=1,2,⋯,m.
由线性规划在约束条件 s . t . s.t. s.t.下让 w w w取得最小值,等价于
m i n w , b 1 2 ∥ w ∥ 2 s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , ⋯ , m . \begin{aligned} & \mathop{min}\limits_{w,b} \quad \cfrac{1}{ 2}{ \| w \|}^2 \\ & s.t. \quad y_i(w^Tx_i+b) \geq 1, \quad i=1,2,\cdots,m. \end{aligned} w,bmin21∥w∥2s.t.yi(wTxi+b)≥1,i=1,2,⋯,m.
对偶问题
强对偶关系
m i n w , b f ( x ) = 1 2 ∥ w ∥ 2 s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , ⋯ , m . \begin{aligned} & \mathop{min}\limits_{w,b} \quad f(x) = \cfrac{1}{ 2}{ \| w \|}^2 \\ & s.t. \quad y_i(w^Tx_i+b) \geq 1, \quad i=1,2,\cdots,m. \end{aligned} w,bminf(x)=21∥w∥2s.t.yi(wTxi+b)≥1,i=1,2,⋯,m.
如何像你介绍这么精妙绝伦的结论呢?
我们对每条约束添加拉格朗日乘子 λ i \lambda_i λi
我们不妨定义,令 g ( w , b ) = 1 − y i ( w T x i + b ) g(w,b) = 1- y_i(w^Tx_i+b) g(w,b)=1−yi(wTxi+b)。于是 L ( w , b . λ ) L(w,b.\lambda) L(w,b.λ)先生诞生了
L ( w , b , λ ) = f ( w , b ) + ∑ i = 1 m λ i g ( w , b ) \begin{aligned} L(w,b,\lambda) & = f(w,b) + \sum_{i=1}^{m}\lambda_ig(w,b) \end{aligned} L(w,b,λ)=f(w,b)+i=1∑mλig(w,b)
L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)先生本来也是个正常人 f ( w , b ) f(w,b) f(w,b),小时候因为父亲离异,酗酒度日。家暴孩子导致人格分裂。损失一部分 ∑ i = 1 m λ i g ( w , b ) \sum_{i=1}^{m}\lambda_ig(w,b) ∑i=1mλig(w,b)😄😆
因为家庭背景原因, L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)从小就与正常孩子不一样。早早社会工作了。整个人事业生活都到了低谷。从此变成了另一个模样 θ ( λ ) \theta(\lambda) θ(λ)
θ ( λ ) = m i n w , b L ( w , b , λ ) \begin{aligned} \theta(\lambda) = \mathop{min}\limits_{w,b}L(w,b,\lambda) \end{aligned} θ(λ)=w,bminL(w,b,λ)
θ ( w , b ) \theta(w,b) θ(w,b)是一个用 λ \lambda λ表示而 w , b w,b w,b最小的式子。后来啊。他遇到了令他心动的女孩$ f(w*,b*)$。喜欢一个人第一感觉往往是自卑,感觉配不上。
如果 w ∗ , b ∗ w^*,b^* w∗,b∗是原问题的解,而 λ ∗ \lambda^* λ∗是对偶问题的解,则有 f ( w ∗ , b ∗ ) ≥ θ ( λ ∗ ) f(w^*,b^*)\geq \theta(\lambda^*) f(w∗,b∗)≥θ(λ∗)
证明
θ ( λ ∗ ) = m i n w , b L ( w , b , λ ∗ ) ≤ L ( w ∗ , b ∗ , λ ∗ ) = f ( w ∗ , b ∗ ) + ∑ i = 1 m λ i ∗ g ( w ∗ , b ∗ ) \begin{aligned} \theta(\lambda^*) & = \mathop{min}\limits_{w,b}L(w,b,\lambda^*) \\ & \leq L(w^*,b^*,\lambda^*) \\ & = f(w^*,b^*) + \sum_{i=1}^{m}\lambda_i^*g(w^*,b^*) \end{aligned} θ(λ∗)=w,bminL(w,b,λ∗)≤L(w∗,b∗,λ∗)=f(w∗,b∗)+i=1∑mλi∗g(w∗,b∗)
KKT条件
KKT丘比特
{ λ i ≥ 0 1 − y i ( w T x i + b ) ≤ 0 λ i ( 1 − y i ( w T x i + b ) ) = 0 ▽ w L ( w , b , λ ) = 0 ▽ b L ( w , b , λ ) = 0 ▽ λ L ( w , b , λ ) = 0 \begin{aligned} \begin{cases} \lambda_i \geq 0 \\ 1-y_i(w^Tx_i+b) \leq 0 \\ \lambda_i(1-y_i(w^Tx_i+b)) = 0 \\ \bigtriangledown_\mathbf{w}L(w,b,\lambda) = 0 \\ \bigtriangledown_\mathbf{b}L(w,b,\lambda) = 0 \\ \bigtriangledown_\mathbf{\lambda}L(w,b,\lambda) = 0 \\ \end{cases} \\ \end{aligned} ⎩⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎧λi≥01−yi(wTxi+b)≤0λi(1−yi(wTxi+b))=0▽wL(w,b,λ)=0▽bL(w,b,λ)=0▽λL(w,b,λ)=0
于是我们 θ ( λ ∗ ) \theta(\lambda^*) θ(λ∗)开始励精图治。从一个每天低头垂气的人变成一个乐光阳光向上的男孩。由 ≤ \leq ≤变为 = = =。逐渐变成了最初模样 L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)。开始向 f ( w ∗ , b ∗ ) f(w^*,b^*) f(w∗,b∗)小姐送送小礼物。但是这还不够 f ( w ∗ , b ∗ ) + ∑ i = 1 m λ i ∗ g ( w ∗ , b ∗ ) f(w^*,b^*) + \sum_{i=1}^{m}\lambda_i^*g(w^*,b^*) f(w∗,b∗)+∑i=1mλi∗g(w∗,b∗)。人家还有精神病呢,人格分裂。时不时就会发病。人家不接受他。后来 L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)所做所为感动了 K K T KKT KKT丘比特。😆。 K K T KKT KKT丘比特决定为他消除黑历史,。令 ∑ i = 1 m λ i ∗ g ( w ∗ , b ∗ ) = 0 \sum_{i=1}^{m}\lambda_i^*g(w^*,b^*)=0 ∑i=1mλi∗g(w∗,b∗)=0.治好他的病。同时让他走上巅峰。令 { ▽ w L ( w , b , λ ) = 0 ▽ b L ( w , b , λ ) = 0 \begin{cases}\bigtriangledown_\mathbf{w}L(w,b,\lambda) = 0 \\ \bigtriangledown_\mathbf{b}L(w,b,\lambda) = 0 \\\end{cases} {▽wL(w,b,λ)=0▽bL(w,b,λ)=0, L ( w , b , λ , ) L(w,b,\lambda,) L(w,b,λ,)才能取得极值
所以我们我们 θ ( λ ∗ ) \theta(\lambda^*) θ(λ∗)先生在 K K T KKT KKT丘比特帮助下终于追到了 f ( w ∗ , b ∗ ) f(w^*,b^*) f(w∗,b∗)小姐
{ λ i ≥ 0 1 − y i ( w T x i + b ) ≤ 0 λ i ( 1 − y i ( w T x i + b ) ) = 0 \begin{aligned} \begin{cases} \lambda_i \geq 0 \\ 1-y_i(w^Tx_i+b) \leq 0 \\ \lambda_i(1-y_i(w^Tx_i+b)) = 0 \\ \end{cases} \\ \end{aligned} ⎩⎪⎨⎪⎧λi≥01−yi(wTxi+b)≤0λi(1−yi(wTxi+b))=0
总结:如果没有丘比特。那请像 θ ( λ ∗ ) \theta(\lambda^*) θ(λ∗)先生一样死不要脸。
不等式约束优化
L ( w , b . a ) L(w,b.a) L(w,b.a)开始逐渐变成一个有能力有担当的青年了。
L ( w , b , a ) = 1 2 ∥ w ∥ 2 + ∑ i = 1 m λ i ( 1 − y i ( w T x i + b ) ) = 1 2 ∥ w ∥ 2 + ∑ i = 1 m ( λ i − λ i y i w T x i − λ i y i b ) = 1 2 w T w + ∑ i = 1 m λ i − ∑ i = 1 m λ i y i w T x i − ∑ i = 1 m λ i y i b \begin{aligned} L(w,b,a) & = \frac{1}{2}\| w \|^2 + \sum_{i=1}^m \lambda_i(1-y_i(w^T x_i+b)) \\ & = \frac{1}{2} \| w \|^2 + \sum_{i=1}^m (\lambda_i - \lambda_iy_i w^T x_i-\lambda_i y_i b)\\ & =\frac{1}{2}w^Tw + \sum_{i=1}^m \lambda_i -\sum_{i=1}^m \lambda_iy_i w^Tx_i-\sum_{i=1}^m \lambda_iy_ib \end{aligned} L(w,b,a)=21∥w∥2+i=1∑mλi(1−yi(wTxi+b))=21∥w∥2+i=1∑m(λi−λiyiwTxi−λiyib)=21wTw+i=1∑mλi−i=1∑mλiyiwTxi−i=1∑mλiyib
后来 L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)和 f ( w ∗ , b ∗ ) f(w^*,b^*) f(w∗,b∗)小姐发生微妙的关系。对 w \boldsymbol{w} w和 b b b分别求偏导数并令其等于 0 0 0
∂ L ∂ w = 1 2 × 2 × w + 0 − ∑ i = 1 m a i y i x i − 0 = 0 ⟹ w = ∑ i = 1 m λ i y i x i ∂ L ∂ b = 0 + 0 − 0 − ∑ i = 1 m λ i y i = 0 ⟹ ∑ i = 1 m λ i y i = 0 \begin{aligned} & \frac {\partial L}{\partial w}=\frac{1}{2} \times2 \times w + 0 - \sum_{i=1}^{m}a_i y_i x_i - 0= 0 \Longrightarrow w = \sum_{i=1}^{m} \lambda_i y_i x_i \\ & \frac {\partial L}{\partial b}=0+0-0-\sum_{i=1}^{m} \lambda_i y_i =0 \Longrightarrow \sum_{i=1}^{m} \lambda_i y_i = 0 \end{aligned} ∂w∂L=21×2×w+0−i=1∑maiyixi−0=0⟹w=i=1∑mλiyixi∂b∂L=0+0−0−i=1∑mλiyi=0⟹i=1∑mλiyi=0
后来啊他儿子诞生了 ∑ i = 1 m λ i y i = 0 \sum_{i=1}^{m} \lambda_i y_i = 0 ∑i=1mλiyi=0因此我们得到一个新的约束,并且带入 w \boldsymbol{w} w和 b b b分别求偏导数并令其等于 0 0 0的部分
m i n w , b f ( x ) = 1 2 ∥ w ∥ 2 s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , ⋯ , m . \begin{aligned} & \mathop{min}\limits_{w,b} \quad f(x) = \cfrac{1}{ 2}{ \| w \|}^2 \\ & s.t. \quad y_i(w^Tx_i+b) \geq 1, \quad i=1,2,\cdots,m. \end{aligned} w,bminf(x)=21∥w∥2s.t.yi(wTxi+b)≥1,i=1,2,⋯,m.
最终组建了一个新家庭
m a x λ ∑ i = 1 m a i − 1 2 ∑ i = 1 m ∑ j = 1 m λ i λ j y i y j x i x j s . t . { λ i ≥ 0 ∑ i = 1 n λ i y i = 0 \begin{aligned} & \mathop{max}\limits_{\lambda} \sum_{i=1}^m a_i - \frac{1}{2}\sum_{i = 1}^m \sum_{j=1}^m \lambda_i \lambda_j y_i y_j x_i x_j \\ & s.t. \quad \begin{cases} \lambda_i \geq 0 \\ \sum \limits_{i=1}^{n} \lambda_i y_i=0 \end{cases} \end{aligned} λmaxi=1∑mai−21i=1∑mj=1∑mλiλjyiyjxixjs.t.⎩⎨⎧λi≥0i=1∑nλiyi=0
分类决策函数
由 K K T KKT KKT丘比特可知
若 λ i ( 1 − y i ( w T x i + b ) ) = 0 \lambda_i(1-y_i(w^Tx_i+b)) = 0 λi(1−yi(wTxi+b))=0, λ i ≠ 0 \lambda_i \neq 0 λi=0时候, 1 − y i ( w T x i + b ) = 0 1-y_i(w^Tx_i+b)=0 1−yi(wTxi+b)=0,若 1 − y i ( w T x i + b ) ≠ 0 1-y_i(w^Tx_i+b) \neq 0 1−yi(wTxi+b)=0,那么 λ i = 0 \lambda_i = 0 λi=0,他们俩松弛互补,对求解 L ( w , b , a ) = 1 2 ∥ w ∥ 2 + ∑ i = 1 m λ i ( 1 − y i ( w T x i + b ) ) L(w,b,a) = \frac{1}{2}\| w \|^2 + \sum_{i=1}^m \lambda_i(1-y_i(w^T x_i+b)) L(w,b,a)=21∥w∥2+∑i=1mλi(1−yi(wTxi+b))没有什么本质影响
解得 L ( w , b , λ ) L(w,b,\lambda) L(w,b,λ)中 λ i \lambda_i λi后带入
w ∗ = ∑ i = 1 N λ i ∗ y i x i b ∗ = y j − ∑ i = 1 N a i ∗ y i ( x i x j ) \begin{aligned} w^* & = \sum_{i=1}^{N} \lambda_i^* y_i x_i \\ b^* & = y_j - \sum_{i=1}^{N}a_i^*y_i(x_i x_j) \end{aligned} w∗b∗=i=1∑Nλi∗yixi=yj−i=1∑Nai∗yi(xixj)
超平面为
∑ i = 1 N a i ∗ y i ( x ⋅ x i ) + b ∗ = 0 \begin{aligned} \sum_{i=1}^{N}a_i^*y_i(x \cdot x_i) + b^*=0 \end{aligned} i=1∑Nai∗yi(x⋅xi)+b∗=0
分类决策函数
f ( x ) = s i g n ( ∑ i = 1 N a i ∗ y i ( x ⋅ x i ) + b ∗ ) \begin{aligned} f(x) = sign\left(\sum_{i=1}^{N}a_i^*y_i(x \cdot x_i) + b^* \right) \end{aligned} f(x)=sign(i=1∑Nai∗yi(x⋅xi)+b∗)
根据KKT条件,我们可以对 α i \alpha_i αi的取值进行讨论,并得出支持向量机一个重要结论:训练完成后,大部分的训练样本都不需要保留,最终模型仅与支持向量(边界)有关。
SMO优化
待补充
核函数
升维
我们在训练一些样本不是线性可分。但是我们可以从高维特征空间使得样本可分。
简而言之就是把 X X X提高维度。令 ϕ ( x ) \phi(x) ϕ(x)表示 X X X从低维映射到高维的特征向量。
f ( x ) = w T ϕ ( x ) + b \begin{aligned} f(x) = w^T \phi(x) + b \end{aligned} f(x)=wTϕ(x)+b
之前的模型转化为了
m a x λ ∑ i = 1 m λ i − 1 2 ∑ i = 1 m ∑ j = 1 m λ i λ j y i y j x i x j s . t . { λ i ≥ 0 ∑ i = 1 n λ i y i = 0 \begin{aligned} & \mathop{max}\limits_{\lambda} \sum_{i=1}^m \lambda_i - \frac{1}{2}\sum_{i = 1}^m \sum_{j=1}^m \lambda_i \lambda_j y_i y_j x_i x_j \\ & s.t. \quad \begin{cases} \lambda_i \geq 0 \\ \sum \limits_{i=1}^{n} \lambda_i y_i=0 \end{cases} \end{aligned} λmaxi=1∑mλi−21i=1∑mj=1∑mλiλjyiyjxixjs.t.⎩⎨⎧λi≥0i=1∑nλiyi=0
由于涉及特征空间内积计算。我们用核函数 K ( x i , x j ) = ϕ ( x i ) ϕ ( x j ) K(x_i,x_j)=\phi(x_i)\phi(x_j) K(xi,xj)=ϕ(xi)ϕ(xj)替代
对偶问题目标函数变为
W ( λ ) = ∑ i = 1 m λ i − 1 2 ∑ i = 1 m ∑ j = 1 m λ i λ j y i y j x i x j \begin{aligned} W(\lambda) = \sum_{i=1}^m \lambda_i - \frac{1}{2}\sum_{i = 1}^m \sum_{j=1}^m \lambda_i \lambda_j y_i y_j x_i x_j \\ \end{aligned} W(λ)=i=1∑mλi−21i=1∑mj=1∑mλiλjyiyjxixj
求解后分类决策式子变为
f ( x ) = s i g n ( ∑ i = 1 N a i ∗ y i ϕ ( x ) ⋅ ϕ ( x i ) + b ∗ ) = s i g n ( ∑ i = 1 N a i ∗ y i K ( x i , x ) + b ∗ ) \begin{aligned} f(x) & = sign\left(\sum_{i=1}^{N}a_i^*y_i\phi (x) \cdot \phi(x_i) + b^* \right) \\ & = sign\left(\sum_{i=1}^{N}a_i^*y_iK(x_i,x) + b^* \right) \end{aligned} f(x)=sign(i=1∑Nai∗yiϕ(x)⋅ϕ(xi)+b∗)=sign(i=1∑Nai∗yiK(xi,x)+b∗)
核函数
名称 | 表达式 | 参数 |
---|---|---|
线形核 | k ( x i , x j ) = x i T x j k(x_i,x_j) = x_i^T x_j k(xi,xj)=xiTxj | |
多项核 | k ( x i , x j ) = ( x i T x j ) d k(x_i,x_j) = (x_i^T x_j)^d k(xi,xj)=(xiTxj)d | d ≥ 1 d\geq1 d≥1为多项式的次数 |
高斯核 | k ( x i , x j ) = e x p ( − ∥ x i − x j ∥ 2 2 σ 2 ) k(x_i,x_j) = exp(-\cfrac{\|x_i-x_j\|^2}{2\sigma^2}) k(xi,xj)=exp(−2σ2∥xi−xj∥2) | σ > 0 \sigma >0 σ>0为高斯核带宽 |
拉普拉斯核 | k ( x i , x j ) = e x p ( − ∥ x i − x j ∥ 2 σ ) k(x_i,x_j) = exp(-\cfrac{\|x_i-x_j\|^2}{\sigma}) k(xi,xj)=exp(−σ∥xi−xj∥2) | σ > 0 \sigma > 0 σ>0 |
Sigomid核 | k ( x i , x j ) = t a n h ( β x i T x j + θ ) k(x_i,x_j) = tanh(\beta x_i^Tx_j+\theta) k(xi,xj)=tanh(βxiTxj+θ) | t a n h tanh tanh为双曲线正切函数, β > 0 , θ < 0 \beta > 0,\theta< 0 β>0,θ<0 |
例题
假设输入空间 R 2 \mathcal{R}^2 R2,核函数 K ( x , z ) = ( x , z ) 2 K(x,z) = (x,z)^2 K(x,z)=(x,z)2,试图找出相关特征空间 H \mathcal{H} H和映射 ϕ ( x ) : R 2 → H \phi(x):\mathcal{R}^2 \rightarrow \mathcal{H} ϕ(x):R2→H.
解:取特征空间 H = R 3 \mathcal{H}=\mathcal{R}^3 H=R3,记 x = ( x 1 , x 2 ) T , z = ( z 1 , z 2 ) T x=(x_1,x_2)^T,z=(z_1,z_2)^T x=(x1,x2)T,z=(z1,z2)T,因为 ( x ⋅ z ) 2 = ( x 1 z 1 + x 2 z 2 ) 2 = ( x 1 z 1 ) 2 + 2 x 1 x 2 z 1 z 2 + ( z 1 z 2 ) 2 (x\cdot z)^2=(x_1z_1+x_2z_2)^2=(x_1z_1)^2+2x_1x_2z_1z_2+(z_1z_2)^2 (x⋅z)2=(x1z1+x2z2)2=(x1z1)2+2x1x2z1z2+(z1z2)2,
所以我们取映射
ϕ ( x ) = ( x 1 2 , 2 x 1 x 2 , x 2 2 ) T \begin{aligned} \phi(x) = (x_1^2,\sqrt{2}x_1x_2,x_2^2)^T \end{aligned} ϕ(x)=(x12,2x1x2,x22)T
验证 ϕ ( x ) ⋅ ϕ ( z ) = ( x ⋅ z ) 2 = K ( x , z ) \phi(x)\cdot\phi(z) = (x\cdot z)^2 = K(x,z) ϕ(x)⋅ϕ(z)=(x⋅z)2=K(x,z)
还可以取映射
ϕ ( x ) = 1 2 ( x 1 2 − x 2 2 , 2 x 1 x 2 , x 1 2 + x 2 2 ) T \begin{aligned} \phi(x) = \cfrac{1}{\sqrt2}(x_1^2-x_2^2,2x_1x_2,x_1^2+x_2^2)^T \end{aligned} ϕ(x)=21(x12−x22,2x1x2,x12+x22)T
同样可验证 ϕ ( x ) ⋅ ϕ ( z ) = ( x ⋅ z ) 2 = K ( x , z ) \phi(x)\cdot\phi(z) = (x\cdot z)^2 = K(x,z) ϕ(x)⋅ϕ(z)=(x⋅z)2=K(x,z)
软间隔与正则化
损失函数
现实生活中,数据没有那么理想,存在一些噪声。很难确定一个核函数使训练样本在特征空间中线性可分,如下图
于是我们改变优化对象
m i n w , b 1 2 ∥ w ∥ 2 + l o s s \begin{aligned} \mathop{min}\limits_{w,b} \ \cfrac{1}{2} \| w \|^2 + loss \end{aligned} w,bmin 21∥w∥2+loss
对loss解释:
假如有噪声的的话,我们所求解的 w w w斜率会发生偏离,我们加上一定的惩罚项使得他的偏离变小,类似正则化一样的作用。
我们可以采取 L 0 / 1 L_{0/1} L0/1是 0 , 1 0,1 0,1损失函数
m i n w , b 1 2 ∥ w ∥ 2 + C ∑ i = 1 m L 0 / 1 ( 1 − y i ( w T x i + b ) ) L 0 / 1 ( z ) { 1 , i f z ≤ 0 ; 2 , o t h e r w i s e . \begin{aligned} & \mathop{min}\limits_{w,b} \ \cfrac{1}{2} \| w \|^2 + C\sum_{i=1}^{m} L_{0/1}(1 - y_i(w^Tx_i+b)) \\ & L_{0/1}(z) \begin{cases} 1,\quad if \quad z \leq 0; \\ 2,\quad otherwise. \end{cases} \end{aligned} w,bmin 21∥w∥2+Ci=1∑mL0/1(1−yi(wTxi+b))L0/1(z){1,ifz≤0;2,otherwise.
其中 z = y i ( w T x i + b ) z=y_i(w^Tx_i+b) z=yi(wTxi+b),西瓜书感觉有点小问题原书是 m i n w , b 1 2 ∥ w ∥ 2 + C ∑ i = 1 m L 0 / 1 ( y i ( w T x i + b ) − 1 ) \mathop{min}\limits_{w,b} \ \cfrac{1}{2} \| w \|^2 + C\sum_{i=1}^{m} L_{0/1}(y_i(w^Tx_i+b) - 1) w,bmin 21∥w∥2+C∑i=1mL0/1(yi(wTxi+b)−1)。因为有噪声点时候,后半截是-负数,我们这里是求min。联想正则化
这里 C C C是惩罚参数,使 1 2 ∥ w ∥ 2 \cfrac{1}{2}\| w\|^2 21∥w∥2尽量小。即间隔间隔尽量大。同时使得误分类点个数尽量小。 C C C是调和两者系数。
0 , 1 0,1 0,1求解不方便非连续,性质不好舍去,我们打算寻找更好的替代
Conception
常见损失函数
h i n g e 损 失 : L h i n g e ( z ) = m a x ( 0 , 1 − z ) 指 数 损 失 ( e x p o n e n t i a l l o s s ) = L e x p ( z ) = e x p ( − z ) 对 率 损 失 ( l o g i s t i c l o s s ) : L l o g ( z ) = l o g ( 1 + e x p ( − z ) ) \begin{aligned} & hinge损失:L_{hinge}(z) = max(0,1-z) \\ & 指数损失(exponential \quad loss) =L_{exp}(z)=exp(-z) \\ & 对率损失(logistic \quad loss):L_{log}(z) = log(1 + exp(-z)) \\ \end{aligned} hinge损失:Lhinge(z)=max(0,1−z)指数损失(exponentialloss)=Lexp(z)=exp(−z)对率损失(logisticloss):Llog(z)=log(1+exp(−z))
之前的 L 0 / 1 L_{0/1} L0/1损失函数等价合页损失,写法复杂是一种判断非连续的函数
m i n w , b 1 2 ∥ w ∥ 2 + C ∑ i = 1 m m a x ( 0 , 1 − y i ( w T x i + b ) ) \begin{aligned} & \mathop{min}\limits_{w,b} \ \cfrac{1}{2} \| w \|^2 + C\sum_{i=1}^{m} max(0,1 - y_i(w^Tx_i+b)) \\ \end{aligned} w,bmin 21∥w∥2+Ci=1∑mmax(0,1−yi(wTxi+b))
软约束
由于噪声,我们可以引入一个松弛变量 ε i ≥ 0 \varepsilon_i \geq 0 εi≥0,使得函数间隔加上松弛变量 ε i \varepsilon_i εi大于等于 1 1 1
即边界线向超平面靠近
y i ( w x i + b ) ≥ 1 − ξ i \begin{aligned} y_i(wx_i+b) \geq 1- \xi_i \end{aligned} yi(wxi+b)≥1−ξi
-
当样本满足约束时$(y i f ( x i ) ≥ 1 y_if(x_i)\geq 1),ξ_i = 0 $(hinge损失也为0)
-
当样本不满足约束时( y i f ( x i ) < 1 , ξ i > 0 y_if(x_i)<1,\xi_i>0 yif(xi)<1,ξi>0)(hinge损失为 1 − y i f ( x i ) 1 − y i f ( x i ) 1−yif(xi)也大于 0 0 0))
原始问题变为
m i n w , b , ε 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i s . t . { ξ i ≥ 0 , i = 1 , 2 , ⋯ , N y i ( w x i + b ) ≥ 1 − ε i , i = 1 , 2 , ⋯ , N \begin{aligned} & \mathop{min}\limits_{w,b,\varepsilon} \quad \cfrac{1}{2} \| w \|^2 + C\sum_{i=1}^{N} \xi_i \\ & s.t. \quad \begin{cases} \xi_i \geq 0,\quad i=1,2,\cdots,N \\ y_i(wx_i+b) \geq 1-\varepsilon_i,\quad i=1,2,\cdots,N \\ \end{cases} \end{aligned} w,b,εmin21∥w∥2+Ci=1∑Nξis.t.{ξi≥0,i=1,2,⋯,Nyi(wxi+b)≥1−εi,i=1,2,⋯,N
如法炮制对偶问题,由广义拉格朗日乘子法可知
L ( w , b , ξ , α , μ ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i − ∑ i = 1 N a i ( y i ( w ⋅ x i + b ) − 1 + ξ i ) − ∑ i = 1 N μ i ξ i \begin{aligned} L(w,b,\xi,\alpha,\mu) = \cfrac{1}{2}\| w \|^2 + C\sum_{i=1}^{N}\xi_i - \sum_{i=1}^{N}a_i(y_i(w \cdot x_i + b ) -1 + \xi_i ) - \sum_{i=1}^{N} \mu_i \xi_i \end{aligned} L(w,b,ξ,α,μ)=21∥w∥2+Ci=1∑Nξi−i=1∑Nai(yi(w⋅xi+b)−1+ξi)−i=1∑Nμiξi
令 L ( w , b , ξ , α , μ ) L(w,b,\xi,\alpha,\mu) L(w,b,ξ,α,μ)对 w , b , ξ i w,b,\xi_i w,b,ξi的偏导为 0 0 0
▽ w L ( w , b , ξ , α , μ ) = w − ∑ i = 1 N a i y i x i = 0 ▽ b L ( w , b , ξ , α , μ ) = − ∑ i = 1 N a i y i = 0 ▽ ξ i L ( w , b , ξ , α , μ ) = C − a i − μ i = 0 \begin{aligned} \bigtriangledown_\mathbf{w}L(w,b,\xi,\alpha,\mu) = w - \sum_{i=1}^{N}a_iy_ix_i = 0 \\ \bigtriangledown_\mathbf{b}L(w,b,\xi,\alpha,\mu)=-\sum_{i=1}^{N}a_iy_i = 0 \\ \bigtriangledown_\mathbf{\xi_i}L(w,b,\xi,\alpha,\mu)=C-a_i-\mu_i = 0 \\ \end{aligned} ▽wL(w,b,ξ,α,μ)=w−i=1∑Naiyixi=0▽bL(w,b,ξ,α,μ)=−i=1∑Naiyi=0▽ξiL(w,b,ξ,α,μ)=C−ai−μi=0
带入解得
m i n w , b , ξ L ( w , b , ξ , α , μ ) = 1 2 ∑ i = 1 N ∑ j = 1 N a i a j y i y j ( x i ⋅ x j ) + ∑ i = 1 N a i \begin{aligned} \mathop{min}\limits_{w,b,\xi} \ L(w,b,\xi,\alpha,\mu)=\cfrac{1}{2} \sum_{i=1}^{N}\sum_{j=1}^{N} a_ia_jy_iy_j(x_i \cdot x_j) + \sum_{i=1}^{N}a_i \end{aligned} w,b,ξmin L(w,b,ξ,α,μ)=21i=1∑Nj=1∑Naiajyiyj(xi⋅xj)+i=1∑Nai
在对 m i n w , b , ξ L ( w , b , ξ , α , μ ) \mathop{min}\limits_{w,b,\xi} \ L(w,b,\xi,\alpha,\mu) w,b,ξmin L(w,b,ξ,α,μ)求极大即得到对偶问题
m a x a ∑ i = 1 m a i − 1 2 ∑ i = 1 m ∑ j = 1 m a i a j y i y j x i x j s . t . { a i ≥ 0 ∑ i = 1 n a i y i = 0 μ i ≥ 0 C − a i − μ i = 0 \begin{aligned} & \mathop{max}\limits_{a} \sum_{i=1}^m a_i - \frac{1}{2}\sum_{i = 1}^m \sum_{j=1}^m a_i a_j y_i y_j x_i x_j \\ & s.t. \quad \begin{cases} a_i \geq 0 \\ \sum \limits_{i=1}^{n} a_i y_i=0 \\ \mu_i \geq 0 \\ C-a_i-\mu_i=0 \end{cases} \end{aligned} amaxi=1∑mai−21i=1∑mj=1∑maiajyiyjxixjs.t.⎩⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎧ai≥0i=1∑naiyi=0μi≥0C−ai−μi=0
KKT条件
{ α i ≥ 0 , μ i ≥ 0 , y i f ( x i ) − 1 + ξ i ≥ 0 , α i ( y i f ( x i ) − 1 + ξ i ) = 0 ξ i ≥ 0 , μ i ξ i = 0 \begin{aligned} \begin{cases} \alpha_i \geq 0,\quad \mu_i \geq 0, \\ y_if(x_i) - 1 + \xi_i \geq 0, \\ \alpha_i(y_if(x_i)-1+\xi_i) = 0 \\ \xi_i \geq 0, \quad \mu_i\xi_i = 0 \\ \end{cases} \end{aligned} ⎩⎪⎪⎪⎨⎪⎪⎪⎧αi≥0,μi≥0,yif(xi)−1+ξi≥0,αi(yif(xi)−1+ξi)=0ξi≥0,μiξi=0