Stanford ML - Lecture 8 - Support Vector Machines

最新推荐文章于 2018-12-20 22:55:44 发布

Quebradawill

最新推荐文章于 2018-12-20 22:55:44 发布

阅读量820

点赞数

分类专栏： ML-Stanford-Andrew Ng Machine Learning

本文链接：https://blog.csdn.net/qiudw/article/details/8685170

版权

Machine Learning 同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

ML-Stanford-Andrew Ng

12 篇文章 0 订阅

订阅专栏

1. Optimization Objective

logistic regression

$\min_{\theta} \frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \left( - \log h_{\theta} (x^{(i)}) \right ) + (1 - y^{(i)}) \left( - \log (1 - h_{\theta} x^{(i)}) \right ) \right ] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

let

$h_{\theta}(x) = \frac{1}{1+e^{-\theta^T x}}$

support vector machine

$\textrm{if } \ y = 1 \ (\textrm{want } \ \theta^T x \gg 0 ), \textrm{if } \ y = 0 \ (\textrm{want } \ \theta^T x \ll 0 ),$

$cost_1(z) \rightarrow - \log \frac{1}{1 + e^{-\theta^T x}}, \ cost_0(z) \rightarrow - \log \left(1 - \frac{1}{1 + e^{-\theta^T x}}\right )$

$\min_{\theta} C \sum_{i=1}^m \left[ y^{(i)} cost_1(\theta^T x^{(i)}) + (1 - y^{(i)}) cost_0(\theta^T x^{(i)}) \right ] + \frac{1}{2} \sum_{j=1}^n \theta_j^2$

2. Large Margin Intuition

3. The mathematics behind large margin classification (optional)

vector inner product

$||u|| = \ \textrm{length of vector} \ u = \sqrt{u_1^2 + u_2^2} \in \mathbb{R}$

$u^T v = ||u|| \cdot ||v|| \cos \varphi = p \cdot ||u|| = u_1 v_1 + u_2 v_2$

$p = ||v|| \cos \varphi = \ \textrm{signed length of projection of} \ v \ \textrm{onto} \ u$

SVM decision boundary

$\begin{align*} \min_{\theta} \frac{1}{2} \sum_{j=1}^n \theta_j^2 = \frac{1}{2} (\theta_1^2 + \theta_2^2) = \frac{1}{2} \left( \sqrt{\theta_1^2 + \theta_2^2} \right )^2 = \frac{1}{2}||\theta||^2 \\ \textrm{s.t.} \quad \theta^T x^{(i)} \geqslant 1 \qquad \qquad \textrm{if} \ y^{(i)} = \ \ 1 \\ \theta^T x^{(i)} \leqslant 1 \qquad \qquad \textrm{if} \ y^{(i)} = -1 \end{align*}$

4. Kernels I

Given $x$ , compute new feature depending on proximity to landmarks $l^{(1)}, l^{(2)}, l^{(3)}$

$f_1 = \textrm{simility}(x, l^{(1)}) = \exp \left( - \frac{|| x - l^{(1)} ||^2}{2 \sigma^2} \right )$

$f_2 = \textrm{simility}(x, l^{(2)}) = \exp \left( - \frac{|| x - l^{(2)} ||^2}{2 \sigma^2} \right )$

$f_3 = \textrm{simility}(x, l^{(3)}) = \exp \left( - \frac{|| x - l^{(3)} ||^2}{2 \sigma^2} \right )$

$\textrm{simility}(x, l) \ \textrm{is kernel}, \ \exp \left( - \frac{|| x - l||^2}{2 \sigma^2} \right ) \ \textrm{is kernel.}$

5. Kernels II

SVM with kernels
- how to choose $l$ ?
- how to compute $\theta$ ?
- Reference: http://blog.csdn.net/abcjennifer/article/details/7849812

$\begin{align*} \textrm{Given} \left( x^{(1)}, y^{(1)} \right ), \left( x^{(2)}, y^{(2)} \right ), \cdots, \left( x^{(m)}, y^{(m)} \right ) \\ \textrm{choose} \ l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \cdots, l^{(m)} = x^{(m)} \ \ \end{align*}$

SVM parameters

$C (= \frac{1}{\lambda})$

- - Large $C$ : lower bias, higher variance
  - Small $C$ : Higher bias, low variance

$\sigma^2$

- - $\textrm{large} \ \sigma^2: \ \textrm{features} \ f_i \ \textrm{vary more smoothly. higher bias, lower variance.}$
  - $\textrm{small} \ \sigma^2: \ \textrm{features} \ f_i \ \textrm{vary less smoothly. lower bias, higher variance.}$

6. Using an SVM

Use SVM software package (e.g. liblinear, libsvm, ...) to solve for parameters $\theta$ , needs to specify:
- choice of parameter $C$
- choice of kernel (similarity function)
  - e.g. no kernel ("linear kernel")

$\textrm{predict} \ ''y = 1'' \ \textrm{if} \ \theta^T x \geqslant 0$

- - Gaussian kernel

$f_i = \exp \left( - \frac{||x - l^{(i)}||^2}{2 \sigma^2} \right ), \ \textrm{where} \ l^{(i)} = x^{(i)}, \ \textrm{need to choose} \ \sigma^2$

Many off-the-shelf kernels available
- polynomial kernel
- more esoteric: string kernel, chi-square kernel, histogram intersection kernel, ...
Multi-class classification
- many SVM packages already have built-in multi-class classification functionality
- otherwise, use one-vs.-all method
logistic regression vs. SVM
- $n$ = number of features ( $x \in \mathbb{R}^{n+1}$ ), $m$ = number of training examples
  - if $n$ is large (relative to $m$ ), use logistic regression, or SVM without a kernel ("linear kernel").
  - if $n$ is small, $m$ is intermediate, use SVM with Gaussian kernel.
  - if $n$ is small, $m$ is large, create/add more features, then use logistic regression or SVM without a kernel.
  - Neural Network likely to work well for most of these settings, but may be slower to train.

Reference: http://blog.csdn.net/abcjennifer/article/details/7849812

Quebradawill

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Stanford ML - Lecture 8 - Support Vector Machines

1. Optimization Objectivelogistic regressionletsupport vector machine2. Large Margin Intuition3. The mathematics behind large margin classification (optional)
复制链接

扫一扫

专栏目录