Coursera机器学习 Week7 笔记

最新推荐文章于 2022-02-01 17:49:01 发布

LVB10101111

最新推荐文章于 2022-02-01 17:49:01 发布

阅读量912

点赞数

分类专栏：机器学习Coursera By Andrew Ng 文章标签：机器学习 svm

本文链接：https://blog.csdn.net/u013515273/article/details/77482350

版权

机器学习Coursera By Andrew Ng 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

编程作业放到了github上：coursera_machine_learning

Support Vector Machine

支持向量机的目的，是将线性可分的数据集用一个超平面分开，而且让两个类的数据集都距离这个超平面尽可能的远。

将线性可分的数据集分开的超平面可以有无数个，这个用Perceptron就可以做到，但是让所有的点距离这个超平面尽可能的远，这样的解就是唯一的，使用SVM来求解。

接下来将分别从cost function、intuition和kernel来介绍SVM。

1. Optimization Objective

本节将从Logistic Regression来类比SVM。

先看Logistic Regression的“分类决策函数 $h(x)$ ”以及“损失函数 $J(\theta)$ ”：

h (x) = 1 1 + e - θ T x

$h(x)=\frac{1}{1+e^{-\theta^Tx}}$

J (θ) = - y log h (x) - (1 - y) log (1 - h (x))

$J(\theta)=-y\log{h(x)}-(1-y)\log{(1-h(x))}$

损失函数随着 $z=\theta^Tx$ 的变化如下图。

当 $y=1$ 时， $J(\theta)=-\log{h(x)}=-\log{\frac{1}{1+e^{-\theta^Tx}}}$

当 $y=0$ 时， $J(\theta)=-\log{(1-h(x))}=-\log{(1-\frac{1}{1+e^{-\theta^Tx}})}$

现在对上面的 $J(\theta)$ 做一些改变。

当 $y=1$ 时， $J(\theta)$ 为下图中的紫色线：

当 $y=0$ 时， $J(\theta)$ 为下图中的紫色线：

从上面的图中，可以知道，这个损失函数是希望，所有的负样例( $y=0$ )的 $\theta^Tx < -1$ 的，所有的正样例( $y=1$ )的 $\theta^Tx > 1$ 的。

$\theta^Tx$ 的含义其实是样例点到超平面的距离矢量。也就是说，损失函数希望所有的样例到超平面的距离都大于1，正负号代表方向(在超平面之上或者之下)。至于为什么是“1”，下面会详细讲到。

在SVM中，分类决策函数 $h(x)$ 定义如下：

h (x) = {1 i f θ T x > 0 0 o t h e r w i s e

$h(x)=\left\{\begin{matrix} 1\quad if \ \theta^Tx>0 \\ 0 \quad \ otherwise \end{matrix}\right.$

$h(x)$ 的目的很简单，就是用一个超平面把正负样例分开。

损失函数 $J(\theta)$ 定义如下：

J (θ) = C \sum i = 1 m [y (i) \times c o s t 1 (θ T x (i)) + (1 - y (i)) \times c o s t 0 (θ T x (i))] + 1 2 \sum j = 1 n θ 2 j

$J(\theta)=C\sum^{m}_{i=1}[y^\left(i\right) \times cost_1(\theta^Tx^\left(i\right)) + (1-y^\left(i\right)) \times cost_0(\theta^Tx^\left(i\right))]+\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$

c o s t 1 (θ T x) = {0 i f y = 1, θ T x > 1 s o m e v a l u e y = 1, o t h e r w i s e

$cost_1(\theta^Tx)=\left\{\begin{matrix} 0\quad if \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=1, \ \theta^Tx>1 \\ some \ value \quad \ \ \ y=1, \ otherwise \end{matrix}\right.$

c o s t 0 (θ T x) = {0 i f y = 0, θ T x < - 1 s o m e v a l u e y = 0, o t h e r w i s e

$cost_0(\theta^Tx)=\left\{\begin{matrix} 0\quad if \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0, \ \theta^Tx<-1 \\ some \ value \quad \ \ \ y=0, \ otherwise \end{matrix}\right.$

$J(\theta)$ 的目的是让正负样例尽可能地分开，距离超平面有一定的距离。希望： $y=1$ 时， $\theta^Tx>1$ ； $y=0$ 时， $\theta^Tx<-1$ 。

2. Large Margin Intuition 1

SVM有如下optimization objective：

min θ C \sum i = 1 m [y (i) \times c o s t 1 (θ T x (i)) + (1 - y (i)) \times c o s t 0 (θ T x (i))] + 1 2 \sum j = 1 n θ 2 j

$\min_{\theta}C\sum^{m}_{i=1}[y^\left(i\right) \times cost_1(\theta^Tx^\left(i\right)) + (1-y^\left(i\right)) \times cost_0(\theta^Tx^\left(i\right))]+\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$

其中 $C$ 是一个类似于 $\lambda$ 的权衡因子，可以理解为 $C=\frac{1}{\lambda}$ 。

现在来看数据集线性可分的情况下，SVM选择超平面的策略，来看看SVM的intuition。

当数据集线性可分的时候，SVM挑选的超平面必须能够满足：

对所有的正样例有 $\theta^Tx>1$ ，所有的负样例有 $\theta^Tx<-1$ 。

也就是说，所有的样例距离超平面都有一定的距离，就像下图中的蓝线所示：

如此一来，所有的 ${cost}_1=0$ ，所有的 ${cost}_0=0$ ，则优化目标变成如下形式：

$min θ 1 2 \sum j = 1 n θ 2 j$ $\min_{\theta}\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$

$s . t . θ T x > 1 i f y = 1$ $s.t. \ \theta^Tx>1 \quad \ \ if \ y=1$

$θ T x < - 1 i f y = 0$ $\quad \ \ \theta^Tx<-1 \quad if \ y=0$

接下来看看SVM会选择怎样的超平面， $\theta$ 是超平面的法向量。

样例点 $x$ 到超平面的距离 $p$ ：

p = θ T x ∥ θ ∥

$p=\frac{\theta^Tx}{\left \| \theta \right \|}$

则

θ T x = p ∥ θ ∥

$\theta^Tx=p\left \| \theta \right \|$

优化目标改写成：

min θ 1 2 \sum j = 1 n θ 2 j

$\min_{\theta}\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$

s . t . p ∥ θ ∥ > 1 i f y = 1

$s.t. \ p \left \| \theta \right \|>1 \quad \ \ if \ y=1$

p ∥ θ ∥ < - 1 i f y = 0

$\quad \ \ \ p\left \| \theta \right \|<-1 \quad if \ y=0$

第一种，margin非常小的时候：

$0<p^\left(1\right)\ll 1$ 的时候，为了使 $p^\left(1\right) \left \| \theta \right \|>1$ ，则 $\left \| \theta \right \|$ 需要很大；

$-1\ll p^\left(2\right)<0$ 的时候，为了使 $p^\left(2\right) \left \| \theta \right \|<-1$ ，则 $\left \| \theta \right \|$ 需要很大；

这样一来，就违背了 ${\min}_{\theta}\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$ ，所以，SVM不会选择这样的超平面。

第二种，margin比较大的时候：

$p^\left(1\right)>1$ 的时候， $\left \| \theta \right \|$ 不用很大，也能使 $p^\left(2\right) \left \| \theta \right \|>1$ ；

$p^\left(2\right)<-1$ 的时候， $\left \| \theta \right \|$ 不用很大，也能使 $p^\left(2\right) \left \| \theta \right \|<-1$ ；

这种情况符合 ${\min}_{\theta}\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$ ，所以，SVM会选择这样的超平面。

3. Large Margin Intuition 2

上面Ng将的intuition没有解释为什么 $\theta^Tx>1$ ，这里的1究竟是怎么选择出来的。接下来结合李航的讲解来说明一下。

首先，定义几个术语。

定义1 (关于样例点的几何间隔)：样例点到超平面的距离矢量 $p^\left(i\right)$ ，即:

p (i) = y (i) θ T x ( i ) ∥ θ ∥

$p^\left(i\right)=y^\left(i\right)\frac{\theta^Tx^\left(i\right)}{\left \| \theta \right \|}$

定义2 (关于训练集的几何间隔)：训练集中所有样例点的几何间隔的最小值，即：

p = min i p (i)

$p=\min_{i}p^\left(i\right)$

SVM的目的就是找到使 $p$ 最大的超平面。所以其optimization objective又可以写成如下形式：

max θ p

$\max_{\theta}p$

s . t . y (i) θ T x ( i ) ∥ θ ∥ ⩾ p

$s.t. \ y^\left(i\right)\frac{\theta^Tx^\left(i\right)}{\left \| \theta \right \|} \geqslant p$

明显地， $p$ 取值的变化，并不会影响到不等式的成立，只要 $\left \| \theta \right \|$ 随之等比例变动即可，而 $\left \| \theta \right \|$ 的等比例变动也不会改变超平面。

所以不如就令 $p=\frac{1}{\left \| \theta \right \|}$ ，如此一来，optimization objective被改写成如下形式：

max θ 1 ∥ θ ∥

$\max_{\theta}\frac{1}{\left \| \theta \right \|}$

s . t . y (i) (θ T x (i)) ⩾ 1

$s.t. \ y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1$

又因为 $\max_{\theta}\frac{1}{\left \| \theta \right \|}$ 等价与 $\min_{\theta}\frac{1}{2}\left \| \theta \right \|^2$ ，所以最终优化目为：

min θ 1 2 ∥ θ ∥ 2

$\min_{\theta}\frac{1}{2}\left \| \theta \right \|^2$

s . t . y (i) (θ T x (i)) ⩾ 1

$s.t. \ y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1$

与上面Ng的optimization objective吻合。

需要注意的是：正如intuition1中所言，这个优化目标的使用是有约束条件的，其约束条件就是任意的 $\ y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1$ 都要成立，也就是说，一定能够有一个超平面完美地分开数据集，即这个数据集是线性可分的。

总结一下，对于线性可分的数据集：

其分类决策函数 $h(x)$ 定义如下：

$h (x) = {1 i f θ T x > 0 0 o t h e r w i s e$ $h(x)=\left\{\begin{matrix} 1\quad if \ \theta^Tx>0 \\ 0 \quad \ otherwise \end{matrix}\right.$

其优化目标／损失函数为：

$min θ 1 2 ∥ θ ∥ 2$ $\min_{\theta}\frac{1}{2}\left \| \theta \right \|^2$

$s . t . y (i) (θ T x (i)) ⩾ 1$ $s.t. \quad y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1$

4. Outlier - 存在特异值的数据集

上面讨论的都是数据集是线性可分的情况，实际上这种数据集太少了，实际操作中的数据集多多少少都有特异值(outlier)的干扰，除去这些outlier，剩下大部分样例点组成的数据是线性可分的。

那么对于这种存在少数的样例点不满足 $y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1$ 的情况，SVM应该如何处理呢？

解决方法就是给每个样例点 $x^\left(i\right)$ 加上一个松弛变量 $\varepsilon ^\left(i\right)$ ，使得约束条件变成：

y (i) (θ T x (i)) ⩾ 1 - ε (i)

$y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1 - \varepsilon ^\left(i\right)$

对于每一个松弛变量，都要支付一个代价，于是优化目标变成：

min θ 1 2 ∥ θ ∥ 2 + C \sum i = 1 m ε (i)

$\min_{\theta}\frac{1}{2}\left \| \theta \right \|^2+C \sum^m_{i=1}\varepsilon ^\left(i\right)$

再回过来看Ng在最开始给出的损失函数 $J(\theta)$ :

J (θ) = C \sum i = 1 m [y (i) \times c o s t 1 (θ T x (i)) + (1 - y (i)) \times c o s t 0 (θ T x (i))] + 1 2 \sum j = 1 n θ 2 j

$J(\theta)=C\sum^{m}_{i=1}[y^\left(i\right) \times cost_1(\theta^Tx^\left(i\right)) + (1-y^\left(i\right)) \times cost_0(\theta^Tx^\left(i\right))]+\frac{1}{2}\sum^{n}_{j=1}\theta^2_j$

可以认为：

ε (i) = y (i) \times c o s t 1 (θ T x (i)) + (1 - y (i)) \times c o s t 0 (θ T x (i))

$\varepsilon ^\left(i\right)=y^\left(i\right) \times {cost}_1(\theta^Tx^\left(i\right)) + (1-y^\left(i\right)) \times {cost}_0(\theta^Tx^\left(i\right))$

但实际求解中，不需要这样代换，因为 ${cost}_1$ 和 ${cost}_0$ 在 $y^\left(i\right)(\theta^Tx^\left(i\right))<1$ 情况下的值是未知的，所以还不如直接将这个整体看成一个未知数，然后用其他的求最优解的算法来求得最优解即可。

总结一下，对于有outlier的数据集：

其分类决策函数 $h(x)$ 定义如下：

$h (x) = {1 i f θ T x > 0 0 o t h e r w i s e$ $h(x)=\left\{\begin{matrix} 1\quad if \ \theta^Tx>0 \\ 0 \quad \ otherwise \end{matrix}\right.$

其优化目标／损失函数为：

$min θ 1 2 ∥ θ ∥ 2 + C \sum i = 1 m ε (i)$ $\min_{\theta}\frac{1}{2}\left \| \theta \right \|^2+C \sum^m_{i=1}\varepsilon ^\left(i\right)$

$s . t . y (i) (θ T x (i)) ⩾ 1 - ε (i)$ $s.t. \quad y^\left(i\right)(\theta^Tx^\left(i\right)) \geqslant 1 - \varepsilon ^\left(i\right)$