by Max Z. C. Li (843995168@qq.com)
based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.
original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."
SL,LC: Support Vector Machine (SVM)
Basic Idea:
try to optimize the margin of the hyperplane
the perceptron variant allows us to test out the best margin but we want a method to find the best margin automatically.
the benefit of large margin:
also : Data dependent VC dimension:
The Margin
recall that we only care about the sign, not the magnitude:
Hard SVM
This is a constrained optimization problem If the data is not separable, there is no w that will classify the data
No training error can be made. All support vectors are on the boundary
Soft SVM
to allow such "break-ins" simply:
obviously, when xi_i are all 0s the method becomes HSVM again.
the new optimization problem is now:
xi directly relates to how "off" the points are ==> xi = 0 means the points are on the right side of the margin, 0~1 means inside the margin, >1 means wrong side of the margin.
we can eliminate the xi_i and rewrite the condition as:
// penalty for points inside the margin or misclassification ==> add the diff * C
this is a
Hinge Loss:
compare to binary loss:
we see that indeed the loss function favors the best margin.
SSVM presents the principle of:
Regularized Risk Minimization
- Define a regularization function that penalizes over-complex hypothesis.
- Capacity control gives better generalization
- Define the notion of “loss” over the training data as a function of a hypothesis
- Learning = find the hypothesis that has lowest [Regularizer + loss on the training data]
SVM Objective Function
//obviously, C=0 gives hard SVM
Training (by SVM Optimization)
Check Convexity
The objective function is CONVEX ==> all known convex optimization methods apply, but not necessarily practical or optimal here
Gradient Descent vs. Stochastic GD
recall SGD:
while GD simply uses the entire {(x,y)} to calculate the gradient and update.
[SGD has] Many more updates than gradient descent, but each individual update is less computationally expensive
Check Closed-Form by Derivative/Sub-derivative
the hinge loss function is not differentiable (wrt. w)! ==> no derivative
==> check sub-derivative
Interlude: Sub-Gradient
Generalization of gradients to non-differentiable functions.
- Recall that every tangent lies below the function for convex functions
- Informally, a sub-tangent at a point is any line lies below the function at the point.
- A sub-gradient is the slope of that line
- Formally, g is a subgradient to f at x if:
- e.g.
for the objective function:
Now we can do SGD with the sub-gradient (Stochastic Sub-gradient Descent):
Analysis (SVM vs. Perceptron)
recall that the regularization adds the preference for the best margin; the margin perceptron adds a simple regularization which prefers an arbitrary margin, not necessarily the best.
Dual Form of SVM
Let w be the minimizer of the SVM problem for some dataset with m examples: {(x_i, y_i)}
The weight vector is completely defined by training examples whose 𝛼_i are not zero; These examples are called the support vectors
SL: Kernel SVM (12)
Basic Idea
what about non-linearly separable data ==> migrate to higher dimension ==> what about mapping data to an infinite dimension?
===>Represent the model by training samples ===> We don’t need to represent 𝑤 explicitly, but only need a way to compute 𝑤 * 𝑥 ==> the kernel method
instead of writing w explicitly, use dual representation with a_i = 1 only for those examples that cause mistakes:
using Kernel representation:
Apply the Dual Form of SVM
from
In the optimum, the solutions of the primal and dual problem have the following relation:
If 𝛼_i = 0 ⇒ the training sample doesn’t affect the prediction;
𝛼_i > 0 ⇒ support vectors
Common Choice of Kernels
==> K(phi(x_i), phi(x_j)) implies what mapping function of x_i you would use with the choice of the kernel function.
polynomial examples:
change to the algorithm:
- ==> expand w with dual form and calculate the kernel
- ==> instead of updating w, update alpha_j;