SVM_CS229PartV_学习笔记

  1. margins and the idea of separating data with a large “gap”
  2. optimal margin classifier
  3. kernels, which give a way to apply SVMs in very high dimensional feature spaces
  4. SMO algorithm, an implementation of SVMs

Notation

Use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Use parameters w , b w, b w,b, and write classifier as h w , b ( x ) = g ( w T x + b ) h_{w,b}(x) = g(w^Tx+b) hw,b(x)=g(wTx+b), where g ( z ) = 1 g(z) = 1 g(z)=1 if z > = 0 z>=0 z>=0, and g ( z ) = − 1 g(z) = -1 g(z)=1 otherwise.
Drop the convention of letting x 0 = 1 x_0 = 1 x0=1 be an extra coordinate in the input feature vector. Thus, b b b takes θ 0 \theta_0 θ0's role, and w w w is [ θ 1 ⋯ θ n ] T [\theta_1 \cdots \theta_n]^T [θ1θn]T.

Functional and geometric margins

Given a training example ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)), we define the functional margin of ( w , b ) (w,b) (w,b) with respect to the training example as γ ( i ) = y ( i ) ( w T x + b ) \gamma^{(i)} = y^{(i)}(w^Tx+b) γ(i)=y(i)(wTx+b).
A large functional margin means a confident and correct prediction.
Given a training example ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)), we define the geometric margin of ( w , b ) (w,b) (w,b) with respect to the training example as γ ( i ) = y ( i ) ( ( w / ∣ ∣ w ∣ ∣ ) T x ( i ) + b / ∣ ∣ w ∣ ∣ ) \gamma^{(i)} = y^{(i)}\Large(\large({w}/{||w||})^T x^{(i)} + {b}/{||w||}\Large) γ(i)=y(i)((w/w)Tx(i)+b/w).
The geometric margin is invariant to rescaling of the parameters; i.e., if we replace w with 2w and b with 2b, then the geometric margin does not change. Thus, when trying to fit w and b to training data, we can impose an arbitrary scaling constraint on w without changing anything important.
Finally, given a training set S = ( x ( i ) , y ( i ) ) ; i = 1 , . . . , m S = {(x^{(i)}, y^{(i)}); i = 1, . . . , m} S=(x(i),y(i));i=1,...,m, we define the geometric margin of ( w , b ) (w,b) (w,b) with respect to S to be the smallest of the geometric margins on the individual training examples: γ = m i n γ ( i ) \gamma = min \gamma^{(i)} γ=minγ(i).

Optimal margin classifier

Goal: find a decision boundary that maximizes the geometric margin, since this would reflect a very confident set of predictions on the training set and a good “fit” to the training data.

Lagrange duality

This will lead us to optimization problem’s dual form, which solves constrained optimization problems.

Kernels

A function to present the similarity between two examples.
Gaussian Kernel: K ( x , z ) = e x p ( − ∣ ∣ x − z ∣ ∣ 2 / 2 σ 2 ) K(x,z) = exp\large(-{{||x-z||}^2}/{2\sigma^2}) K(x,z)=exp(xz2/2σ2)
If K K K is a valid kernel (also called a Mercer kernel)(i.e., if it corresponds to some feature mapping ϕ \phi ϕ), then the corresponding Kernel matrix K is symmetric positive semidefinite.
Theorem (Mercer): Let K : R n x R n R^{n}xR^n RnxRn→ R be given. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any {x(1),…,x(m)}, (m < ∞), the corresponding kernel matrix is symmetric positive semi-definite.
If you have any learning algorithm that you can write in terms of only inner products ⟨x, z⟩ between input attribute vectors, then by replacing this with K(x, z) where K is a kernel, you can “magically” allow your algorithm to work efficiently in the high dimensional feature space corresponding to K.

Regularization and non-separable case

To make the algorithm work for non-linearly separable datasets as well as be less sensitive to outliers, regularization should be used to reduce the overfitting.

SMO algorithm

Sequential Minimal Optimization, an efficient way of solving the dual problem arising from the derivation of the SVM.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值