Open Notes: Machine Learning 机器学习基础笔记（4：SVM, KP, KSVM ）

最新推荐文章于 2022-09-12 06:00:00 发布

EverNoob

最新推荐文章于 2022-09-12 06:00:00 发布

阅读量259

点赞数

分类专栏： Machine_Learning Notes 文章标签：机器学习

本文链接：https://blog.csdn.net/maxzcl/article/details/114884488

版权

Notes 同时被 2 个专栏收录

140 篇文章 0 订阅

订阅专栏

Machine_Learning

54 篇文章 1 订阅

订阅专栏

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)

all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

SL,LC: Support Vector Machine (SVM)

Basic Idea:

try to optimize the margin of the hyperplane

the perceptron variant allows us to test out the best margin but we want a method to find the best margin automatically.

the benefit of large margin:

also : Data dependent VC dimension:

The Margin

recall that we only care about the sign, not the magnitude:

Hard SVM

This is a constrained optimization problem If the data is not separable, there is no w that will classify the data

No training error can be made. All support vectors are on the boundary

Soft SVM

to allow such "break-ins" simply:

obviously, when xi_i are all 0s the method becomes HSVM again.

the new optimization problem is now:

xi directly relates to how "off" the points are ==> xi = 0 means the points are on the right side of the margin, 0~1 means inside the margin, >1 means wrong side of the margin.

we can eliminate the xi_i and rewrite the condition as:

// penalty for points inside the margin or misclassification ==> add the diff * C

this is a

Hinge Loss:

compare to binary loss:

we see that indeed the loss function favors the best margin.

SSVM presents the principle of:

Regularized Risk Minimization

Define a regularization function that penalizes over-complex hypothesis.
- Capacity control gives better generalization
Define the notion of “loss” over the training data as a function of a hypothesis
Learning = find the hypothesis that has lowest [Regularizer + loss on the training data]

SVM Objective Function

//obviously, C=0 gives hard SVM

Training (by SVM Optimization)

Check Convexity

The objective function is CONVEX ==> all known convex optimization methods apply, but not necessarily practical or optimal here

Gradient Descent vs. Stochastic GD

recall SGD:

while GD simply uses the entire {(x,y)} to calculate the gradient and update.

[SGD has] Many more updates than gradient descent, but each individual update is less computationally expensive

Check Closed-Form by Derivative/Sub-derivative

the hinge loss function is not differentiable (wrt. w)! ==> no derivative

==> check sub-derivative

Interlude: Sub-Gradient

Generalization of gradients to non-differentiable functions.

Recall that every tangent lies below the function for convex functions
Informally, a sub-tangent at a point is any line lies below the function at the point.
A sub-gradient is the slope of that line
Formally, g is a subgradient to f at x if:
e.g.

for the objective function:

Now we can do SGD with the sub-gradient (Stochastic Sub-gradient Descent):

Analysis (SVM vs. Perceptron)

recall that the regularization adds the preference for the best margin; the margin perceptron adds a simple regularization which prefers an arbitrary margin, not necessarily the best.