14:32 2014-10-09 Thursday
start CalTech machine learning, video 14
Support Vector Machine
14:33 2014-10-09
SVM is argubaly the most successful classification method
in machine learning
14:42 2014-10-09
it's a very very neat piece of work for machine learning
14:43 2014-10-09
outline:
* maximizing the margin
* the solution
* nonlinear transforms
14:43 2014-10-09
it will be a constrained optimization method
14:44 2014-10-09
linear separation
14:44 2014-10-09
linear separable data
14:44 2014-10-09
I can get different lines. Is there advantage of
choosing one of the lines over the other.
14:45 2014-10-09
that's the margin you have
14:46 2014-10-09
margin of error
14:46 2014-10-09
if you look at this line, it does seem to have
a better margin.
14:46 2014-10-09
let me to get the most possible margin
14:47 2014-10-09
which is the best line for classification.
14:47 2014-10-09
you will choose the fat margin
14:48 2014-10-09
let us ask 2 questions:
* why the big margin better?
* can I solve a w that maximize the margin?
14:48 2014-10-09
all possible dichotomies
14:50 2014-10-09
growth function
14:50 2014-10-09
we know the growth function being big
is bad news for generalization
14:50 2014-10-09
dichotomies with fact margin
14:51 2014-10-09
I'm requied that the margin at least be something
14:53 2014-10-09
effectively by requiring the margin at least be something,
I'm putting a restriction on the growth function
14:54 2014-10-09
fat margin imply fewer dichotomies
14:54 2014-10-09
fat dichotomies have a smaller growth function, a smaller
VC dimension than if I did not restrict them at all
14:55 2014-10-09
we will find out indeed when you have a bigger margin,
you'll achieve better out-of-sample performance
14:57 2014-10-09
fat margins are good
14:57 2014-10-09
find the w that not only classifies the points correctly,
but achieve so with big possible margin.
14:57 2014-10-09
margin is just the distance from the plane to the point
14:58 2014-10-09
hyperplane that separate the points
14:58 2014-10-09
if I give you w & x, can you get a formual & plug in to
give me the distance between that plane described by w
& point xn?
15:00 2014-10-09
normalize w
15:01 2014-10-09
I have the plane, so the plane has that the signal == 0,
it doesn't touch any points, the points are linearly separable,
now when you get the signal to be positive, you're moving in
one direction, you hit the closest point, you hit more points,
the interior points so to be speak.and when you go the other direction,
it's negative, you hit other points
15:05 2014-10-09
pull out w0
15:07 2014-10-09
the plane is now :
w'x + b = 0 (no x0)
15:08 2014-10-09
computing the distance
15:08 2014-10-09
this is the geometry, I have a plane, and I have a point,
I want to estimate the distance.
15:09 2014-10-09
the vector w is orthogonal to the plane // normal vector
15:10 2014-10-09
now you can see the good old b dropped out
15:10 2014-10-09
x' - x'' // this must be orthogonal to the vector w
15:11 2014-10-09
this could be any 2 points on the plane, and the conclusion is that:
w is orthogonal to any vector on the plane
15:12 2014-10-09
now w has an explanation // w is always orthogonal to the plane
15:13 2014-10-09
if you project this fellow on this direction
15:14 2014-10-09
project vector a to vector b
15:14 2014-10-09
unit vector
15:14 2014-10-09
we're going to find an equivalent problem that is
more fiendly
15:17 2014-10-09
the optimization problem: maximize ... subject to ...
15:17 2014-10-09
does anybody see quadratic programming coming on
the horizon?
15:19 2014-10-09
when you solve it, you'll get the separating plane
with the best margin.
15:23 2014-10-09
constrained optimization // Langrange multiplier
15:24 2014-10-09
but here the constraints are "inequality constraints"
instead of "equality constraints"
15:25 2014-10-09
inequality constraints => KKT
15:26 2014-10-09
it is good to look at that picture because
it will put the analysis here in perspective
15:27 2014-10-09
what are we optimizing? and what is the constraint?
15:29 2014-10-09
objective function, constraints
15:30 2014-10-09
pass it to a package of quadratic programming
to get the solution
15:30 2014-10-09
and they become part of the objective function // constraints
15:32 2014-10-09
we're minimizing this respect to what?
respect to w & b
15:34 2014-10-09
classification, segmentation
15:38 2014-10-09
this is a "KKT condition"
15:42 2014-10-09
the solution - quadratic programming
15:43 2014-10-09
convex optimization
15:48 2014-10-09
convex function
15:49 2014-10-09
just a word of warning before we go ...
15:49 2014-10-09
it's a practical consideration, but it's an
important consideration.
15:50 2014-10-09
flirting with danger
15:51 2014-10-09
QP hands us α // QP == Quadratic Programming
15:52 2014-10-09
so the solution is vector of αs
15:53 2014-10-09
you get the αs, you plug them in, so you get the w
15:53 2014-10-09
so I would like to tell you a condition which is
very important, and it will be the key to find the
support vectors, KKT condition
15:55 2014-10-09
when you look at the vector, you just see a whole bunch
of αs that are 0
15:55 2014-10-09
for all the interior points, you're guaranteed that the
multiplier is zero
15:59 2014-10-09
we end up with no need for regularization
15:59 2014-10-09
αn > 0 => Xn is a support vector
16:00 2014-10-09
now we come to the interesting definition,α is largely
zeros, interior points, the most important points in the game
are the points that actually define the plane & the margin
these are the ones for which αs positive
16:03 2014-10-09
these are called support vectors, so I have n points,
I classifies them & I got the maximum margin, because
it's a maximum margin, it touches on some of the +1 &
some of the -1 points, those points support the plane
so to speak, and are called the "support vectors", and
other guys are interior points
16:06 2014-10-09
α greater than zero identify a support vector
16:07 2014-10-09
where are the support vectors in this picture?
they're the closest one to the plane
16:08 2014-10-09
all of these guys here & here contribute nothing,
α == 0 in this case
16:08 2014-10-09
and the support vectors achieve the margin exactly
they're the critical points
16:09 2014-10-09
this is our way to get the α back, this is the currency
we get back from the quadratic programming,and plug in
order to get the w
16:11 2014-10-09
now I notice that many of the αs are zero, and αis only
positive for support vectors,
16:12 2014-10-09
then I can say that I can sum this up over only the support
vectors(SV)
16:13 2014-10-09
seems only a minor technicality, but there is a very important
point in here
16:13 2014-10-09
α is the parameter of your model, when they're 0,
they don't count.
16:14 2014-10-09
w // weight vector
16:14 2014-10-09
now you realize that why there might be a
generalization dividend here, because I end up
here with fewer parameters
16:16 2014-10-09
solve for b using any SV(Support Vector)
16:16 2014-10-09
now you have w & b, you're ready with the classification
line or hyperplane that you have
16:17 2014-10-09
nonlinear transformation
16:17 2014-10-09
now let me close with the nonlinear transformation,
which will be a very short presentation that have
an enormous impact, we're talking about linear boundary
and we're talking about linearly separable case
16:19 2014-10-09
instead of working in the x space, we work in the z space
16:19 2014-10-09
z instead of x
16:20 2014-10-09
not just get a generic separator, you are getting
the best separator according SVM
16:23 2014-10-09
I getting the separating line or plane in the z space
16:24 2014-10-09
If I have nonlinear transformation, do I still have SV?
you have 'support vectors' sure in the z space
you get the planes there, you get the margin, the margin
will touch some points, these are your support vectors
by definition.
16:28 2014-10-09
you can identify them even without looking geometrically
at the z space because what are the support vectors?
I look at the αs I get, and the αs are positive, those
are corresponds to support vectors
16:31 2014-10-09
this is what the boundaries look like
16:32 2014-10-09
now we have alarm bells, overfitting, overfitting....
16:32 2014-10-09
so if I look at what the support vectors are in the z space,
16:33 2014-10-09
"pre-images" of support vectors
16:33 2014-10-09
but you need to be careful, because the formal definition
is in the z space
16:34 2014-10-09
now the interesting aspect here is that, if this is true,
16:34 2014-10-09
that's remarkable, because I just went to a million dimensional
space, w is a million dimensional vector,and when I did the solution,
if I only 4 support vectors
the generalization behavior will go with 4 parameters
16:36 2014-10-09
so this looks like a sophisticated surface, but it's a
sophisticated surface in disguise, it's so careful chosen,
there are lots of snakes go aound and messing up with the
generalization,this one will be the best of them, and you have
a handle on how good generalization just by counting the number
of support vectors,
16:38 2014-10-09
the distance betweent the support vectors & the surface here
are not the margin, the margins are in the linear space etc.
16:39 2014-10-09
they're likely these guys to be close to the surface,
but the distance wouldn't be the same, and there are
perhaps other vectors look like should be support vectors,
they're not! what make them to be support vectors:
they're ahieving the margin in the z space,
this is just an illustrative version of it.
16:40 2014-10-09
now we come to the generalization result that
makes this fly!
16:41 2014-10-09
generalization result: Eout <= #of SV's / (N-1)
16:42 2014-10-09
that will give you an upperbound in Eout
16:42 2014-10-09
the result is very close to this.
16:42 2014-10-09
in order to get the correct result
16:44 2014-10-09
how many support vector do you get?
16:44 2014-10-09
you ask me after you've done with this...,
how many support vectors do you get? if you have
a thousand data points, and you get 10 support vectors,
you're in pretty good shape, regardless of the dimensionality
of the z space you visited
16:46 2014-10-09
now I can go to any dimension space, it will be fine?
16:46 2014-10-09
but this is the main theoretical result that make
people use support vector
16:47 2014-10-09
kernel method, which is a modification of SVM
--------------------------------------------------------
16:53 2014-10-09
nonlinear transformation => kernel method ???
16:53 2014-10-09
in almost all the cases, the vast majority of terms
will be zero
16:54 2014-10-09
this is not a accurate statement, but it's a reasonable
statement
16:55 2014-10-09
the number of nonzero paramters, which corresponds to
VC dimension happens to be the #SV by defn
16:56 2014-10-09
so the number of SV give you a bound on Eout
16:56 2014-10-09
some of the aggregation method like boosting has a margin
on, then you compare that,
16:59 2014-10-09
basically the support vectors are there to "support
the separating plane"
16:59 2014-10-09
they're the ones dictate the margin
start CalTech machine learning, video 14
Support Vector Machine
14:33 2014-10-09
SVM is argubaly the most successful classification method
in machine learning
14:42 2014-10-09
it's a very very neat piece of work for machine learning
14:43 2014-10-09
outline:
* maximizing the margin
* the solution
* nonlinear transforms
14:43 2014-10-09
it will be a constrained optimization method
14:44 2014-10-09
linear separation
14:44 2014-10-09
linear separable data
14:44 2014-10-09
I can get different lines. Is there advantage of
choosing one of the lines over the other.
14:45 2014-10-09
that's the margin you have
14:46 2014-10-09
margin of error
14:46 2014-10-09
if you look at this line, it does seem to have
a better margin.
14:46 2014-10-09
let me to get the most possible margin
14:47 2014-10-09
which is the best line for classification.
14:47 2014-10-09
you will choose the fat margin
14:48 2014-10-09
let us ask 2 questions:
* why the big margin better?
* can I solve a w that maximize the margin?
14:48 2014-10-09
all possible dichotomies
14:50 2014-10-09
growth function
14:50 2014-10-09
we know the growth function being big
is bad news for generalization
14:50 2014-10-09
dichotomies with fact margin
14:51 2014-10-09
I'm requied that the margin at least be something
14:53 2014-10-09
effectively by requiring the margin at least be something,
I'm putting a restriction on the growth function
14:54 2014-10-09
fat margin imply fewer dichotomies
14:54 2014-10-09
fat dichotomies have a smaller growth function, a smaller
VC dimension than if I did not restrict them at all
14:55 2014-10-09
we will find out indeed when you have a bigger margin,
you'll achieve better out-of-sample performance
14:57 2014-10-09
fat margins are good
14:57 2014-10-09
find the w that not only classifies the points correctly,
but achieve so with big possible margin.
14:57 2014-10-09
margin is just the distance from the plane to the point
14:58 2014-10-09
hyperplane that separate the points
14:58 2014-10-09
if I give you w & x, can you get a formual & plug in to
give me the distance between that plane described by w
& point xn?
15:00 2014-10-09
normalize w
15:01 2014-10-09
I have the plane, so the plane has that the signal == 0,
it doesn't touch any points, the points are linearly separable,
now when you get the signal to be positive, you're moving in
one direction, you hit the closest point, you hit more points,
the interior points so to be speak.and when you go the other direction,
it's negative, you hit other points
15:05 2014-10-09
pull out w0
15:07 2014-10-09
the plane is now :
w'x + b = 0 (no x0)
15:08 2014-10-09
computing the distance
15:08 2014-10-09
this is the geometry, I have a plane, and I have a point,
I want to estimate the distance.
15:09 2014-10-09
the vector w is orthogonal to the plane // normal vector
15:10 2014-10-09
now you can see the good old b dropped out
15:10 2014-10-09
x' - x'' // this must be orthogonal to the vector w
15:11 2014-10-09
this could be any 2 points on the plane, and the conclusion is that:
w is orthogonal to any vector on the plane
15:12 2014-10-09
now w has an explanation // w is always orthogonal to the plane
15:13 2014-10-09
if you project this fellow on this direction
15:14 2014-10-09
project vector a to vector b
15:14 2014-10-09
unit vector
15:14 2014-10-09
we're going to find an equivalent problem that is
more fiendly
15:17 2014-10-09
the optimization problem: maximize ... subject to ...
15:17 2014-10-09
does anybody see quadratic programming coming on
the horizon?
15:19 2014-10-09
when you solve it, you'll get the separating plane
with the best margin.
15:23 2014-10-09
constrained optimization // Langrange multiplier
15:24 2014-10-09
but here the constraints are "inequality constraints"
instead of "equality constraints"
15:25 2014-10-09
inequality constraints => KKT
15:26 2014-10-09
it is good to look at that picture because
it will put the analysis here in perspective
15:27 2014-10-09
what are we optimizing? and what is the constraint?
15:29 2014-10-09
objective function, constraints
15:30 2014-10-09
pass it to a package of quadratic programming
to get the solution
15:30 2014-10-09
and they become part of the objective function // constraints
15:32 2014-10-09
we're minimizing this respect to what?
respect to w & b
15:34 2014-10-09
classification, segmentation
15:38 2014-10-09
this is a "KKT condition"
15:42 2014-10-09
the solution - quadratic programming
15:43 2014-10-09
convex optimization
15:48 2014-10-09
convex function
15:49 2014-10-09
just a word of warning before we go ...
15:49 2014-10-09
it's a practical consideration, but it's an
important consideration.
15:50 2014-10-09
flirting with danger
15:51 2014-10-09
QP hands us α // QP == Quadratic Programming
15:52 2014-10-09
so the solution is vector of αs
15:53 2014-10-09
you get the αs, you plug them in, so you get the w
15:53 2014-10-09
so I would like to tell you a condition which is
very important, and it will be the key to find the
support vectors, KKT condition
15:55 2014-10-09
when you look at the vector, you just see a whole bunch
of αs that are 0
15:55 2014-10-09
for all the interior points, you're guaranteed that the
multiplier is zero
15:59 2014-10-09
we end up with no need for regularization
15:59 2014-10-09
αn > 0 => Xn is a support vector
16:00 2014-10-09
now we come to the interesting definition,α is largely
zeros, interior points, the most important points in the game
are the points that actually define the plane & the margin
these are the ones for which αs positive
16:03 2014-10-09
these are called support vectors, so I have n points,
I classifies them & I got the maximum margin, because
it's a maximum margin, it touches on some of the +1 &
some of the -1 points, those points support the plane
so to speak, and are called the "support vectors", and
other guys are interior points
16:06 2014-10-09
α greater than zero identify a support vector
16:07 2014-10-09
where are the support vectors in this picture?
they're the closest one to the plane
16:08 2014-10-09
all of these guys here & here contribute nothing,
α == 0 in this case
16:08 2014-10-09
and the support vectors achieve the margin exactly
they're the critical points
16:09 2014-10-09
this is our way to get the α back, this is the currency
we get back from the quadratic programming,and plug in
order to get the w
16:11 2014-10-09
now I notice that many of the αs are zero, and αis only
positive for support vectors,
16:12 2014-10-09
then I can say that I can sum this up over only the support
vectors(SV)
16:13 2014-10-09
seems only a minor technicality, but there is a very important
point in here
16:13 2014-10-09
α is the parameter of your model, when they're 0,
they don't count.
16:14 2014-10-09
w // weight vector
16:14 2014-10-09
now you realize that why there might be a
generalization dividend here, because I end up
here with fewer parameters
16:16 2014-10-09
solve for b using any SV(Support Vector)
16:16 2014-10-09
now you have w & b, you're ready with the classification
line or hyperplane that you have
16:17 2014-10-09
nonlinear transformation
16:17 2014-10-09
now let me close with the nonlinear transformation,
which will be a very short presentation that have
an enormous impact, we're talking about linear boundary
and we're talking about linearly separable case
16:19 2014-10-09
instead of working in the x space, we work in the z space
16:19 2014-10-09
z instead of x
16:20 2014-10-09
not just get a generic separator, you are getting
the best separator according SVM
16:23 2014-10-09
I getting the separating line or plane in the z space
16:24 2014-10-09
If I have nonlinear transformation, do I still have SV?
you have 'support vectors' sure in the z space
you get the planes there, you get the margin, the margin
will touch some points, these are your support vectors
by definition.
16:28 2014-10-09
you can identify them even without looking geometrically
at the z space because what are the support vectors?
I look at the αs I get, and the αs are positive, those
are corresponds to support vectors
16:31 2014-10-09
this is what the boundaries look like
16:32 2014-10-09
now we have alarm bells, overfitting, overfitting....
16:32 2014-10-09
so if I look at what the support vectors are in the z space,
16:33 2014-10-09
"pre-images" of support vectors
16:33 2014-10-09
but you need to be careful, because the formal definition
is in the z space
16:34 2014-10-09
now the interesting aspect here is that, if this is true,
16:34 2014-10-09
that's remarkable, because I just went to a million dimensional
space, w is a million dimensional vector,and when I did the solution,
if I only 4 support vectors
the generalization behavior will go with 4 parameters
16:36 2014-10-09
so this looks like a sophisticated surface, but it's a
sophisticated surface in disguise, it's so careful chosen,
there are lots of snakes go aound and messing up with the
generalization,this one will be the best of them, and you have
a handle on how good generalization just by counting the number
of support vectors,
16:38 2014-10-09
the distance betweent the support vectors & the surface here
are not the margin, the margins are in the linear space etc.
16:39 2014-10-09
they're likely these guys to be close to the surface,
but the distance wouldn't be the same, and there are
perhaps other vectors look like should be support vectors,
they're not! what make them to be support vectors:
they're ahieving the margin in the z space,
this is just an illustrative version of it.
16:40 2014-10-09
now we come to the generalization result that
makes this fly!
16:41 2014-10-09
generalization result: Eout <= #of SV's / (N-1)
16:42 2014-10-09
that will give you an upperbound in Eout
16:42 2014-10-09
the result is very close to this.
16:42 2014-10-09
in order to get the correct result
16:44 2014-10-09
how many support vector do you get?
16:44 2014-10-09
you ask me after you've done with this...,
how many support vectors do you get? if you have
a thousand data points, and you get 10 support vectors,
you're in pretty good shape, regardless of the dimensionality
of the z space you visited
16:46 2014-10-09
now I can go to any dimension space, it will be fine?
16:46 2014-10-09
but this is the main theoretical result that make
people use support vector
16:47 2014-10-09
kernel method, which is a modification of SVM
--------------------------------------------------------
16:53 2014-10-09
nonlinear transformation => kernel method ???
16:53 2014-10-09
in almost all the cases, the vast majority of terms
will be zero
16:54 2014-10-09
this is not a accurate statement, but it's a reasonable
statement
16:55 2014-10-09
the number of nonzero paramters, which corresponds to
VC dimension happens to be the #SV by defn
16:56 2014-10-09
so the number of SV give you a bound on Eout
16:56 2014-10-09
some of the aggregation method like boosting has a margin
on, then you compare that,
16:59 2014-10-09
basically the support vectors are there to "support
the separating plane"
16:59 2014-10-09
they're the ones dictate the margin