Open Notes: Machine Learning 机器学习基础笔记(4:SVM, KP, KSVM )

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)

all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

SL,LC: Support Vector Machine (SVM)

Basic Idea:

try to optimize the margin of the hyperplane

the perceptron variant allows us to test out the best margin but we want a method to find the best margin automatically.

the benefit of large margin:

also : Data dependent VC dimension:

 

The Margin

recall that we only care about the sign, not the magnitude:

 

Hard SVM

This is a constrained optimization problem If the data is not separable, there is no w that will classify the data

No training error can be made. All support vectors are on the boundary

 

Soft SVM

to allow such "break-ins" simply:

obviously, when xi_i are all 0s the method becomes HSVM again.

the new optimization problem is now:

xi directly relates to how "off" the points are ==> xi = 0 means the points are on the right side of the margin, 0~1 means inside the margin, >1 means wrong side of the margin.

we can eliminate the xi_i and rewrite the condition as:

// penalty for points inside the margin or misclassification ==> add the diff * C

this is a

Hinge Loss:

compare to binary loss:

we see that indeed the loss function favors the best margin.

 SSVM presents the principle of:

Regularized Risk Minimization

  • Define a regularization function that penalizes over-complex hypothesis. 
    • Capacity control gives better generalization
  • Define the notion of “loss” over the training data as a function of a hypothesis
  • Learning = find the hypothesis that has lowest [Regularizer + loss on the training data]

 

SVM Objective Function

//obviously, C=0 gives hard SVM

 

Training (by SVM Optimization)

Check Convexity

The objective function is CONVEX ==> all known convex optimization methods apply, but not necessarily practical or optimal here

 Gradient Descent vs. Stochastic GD

recall SGD:

while GD simply uses the entire {(x,y)} to calculate the gradient and update.

[SGD has] Many more updates than gradient descent, but each individual update is less computationally expensive

Check Closed-Form by Derivative/Sub-derivative 

the hinge loss function is not differentiable (wrt. w)! ==> no derivative

==> check sub-derivative

Interlude: Sub-Gradient

Generalization of gradients to non-differentiable functions.

  • Recall that every tangent lies below the function for convex functions
    • Informally, a sub-tangent at a point is any line lies below the function at the point.
  • A sub-gradient is the slope of that line
  • Formally, g is a subgradient to f at x if:
  • e.g.

for the objective function:

Now we can do SGD with the sub-gradient (Stochastic Sub-gradient Descent):

 

Analysis (SVM vs. Perceptron)

recall that the regularization adds the preference for the best margin; the margin perceptron adds a simple regularization which prefers an arbitrary margin, not necessarily the best.

 

Dual Form of SVM

Let w be the minimizer of the SVM problem for some dataset with m examples: {(x_i, y_i)}

The weight vector is completely defined by training examples whose 𝛼_i are not zero; These examples are called the support vectors

 

SL: Kernel SVM (12)

Basic Idea

what about non-linearly separable data ==> migrate to higher dimension ==> what about mapping data to an infinite dimension?

===>Represent the model by training samples ===> We don’t need to represent 𝑤 explicitly, but only need a way to compute 𝑤 * 𝑥 ==> the kernel method

instead of writing w explicitly, use dual representation with a_i = 1 only for those examples that cause mistakes:

using Kernel representation:

 

Apply the Dual Form of SVM

from 

In the optimum, the solutions of the primal and dual problem have the following relation:

If 𝛼_i = 0 ⇒ the training sample doesn’t affect the prediction;

𝛼_i > 0 ⇒ support vectors

 

Common Choice of Kernels

==> K(phi(x_i), phi(x_j)) implies what mapping function of x_i you would use with the choice of the kernel function.

polynomial examples:

change to the algorithm:

  • ==> expand w with dual form and calculate the kernel
  • ==> instead of updating w, update alpha_j;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值