start review CalTech machine learning,
video 09, the Linear Model II
9:08 2014-09-28
generalization analysis
9:16 2014-09-28
linear classification:
perceptron algorithm, pocket algorithm
9:17 2014-09-28
linear surface => quadratic surface
9:20 2014-09-28
think of the VC inequality as a promise
of providing you with a warrantry, in order
for the warrantry to be valid, you can not
look at the data before you choose the model,
that will forfeit the warrantry.
9:43 2014-09-28
I'm going to charge you if I do the analysis correctly.
not the VC dimension of the final guy you got. I'm going
to charge you the VC dimension of the entire hypothese space
that you explore in your mind in order to getting there.
9:44 2014-09-28
you have acted as a learning algorithm unknowningly.
9:45 2014-09-28
now you look at the data, you realized that some of
the coefficient is zero,
I don't need this, I don't need this,...
you did very quickly in your mind.
9:46 2014-09-28
so the hypothese you learned is a hierarchical learning.
9:46 2014-09-28
first you learned, then you passed to the algorithm
to learn.
9:47 2014-09-28
the entire hypothese is what you start with.
9:47 2014-09-28
Lesson learned:
looking at the data before choosing the model.
9:48 2014-09-28
this can be hazard to your health, not to your health,
but the "generalization health"
9:48 2014-09-28
if you look at the data, we say that you did the learning.
9:49 2014-09-28
this is the manifestation of biggest trap that
practioners fall into.
9:50 2014-09-28
when you go machine learning,I want you to learn from
the data, and choosing the model is very tricky.
9:51 2014-09-28
let me look at the data, and just pick something suitable.
9:52 2014-09-28
you're allowed to do that, I'm not saying that this is
against the law. you can do it, just charge accordingly.
9:52 2014-09-28
remember if you do this, and end up with a small hypothese
set, and you have a VC dimension, you have already forfeit
the warrantry that has given you by the VC inequality according
to that.
9:54 2014-09-28
you snoop into the data
9:54 2014-09-28
data snooping
9:54 2014-09-28
you look at the data before you choose the model.
9:54 2014-09-28
but there're others that are so subtle that
a smart person may fall into that.
9:55 2014-09-28
I'm not minimizing the way to choose a model.
there will be ways to choose the model, when I talk
about validation, model selection will be the order
of the day.
9:56 2014-09-28
it's a model selection that does not contaminate
the data
9:57 2014-09-28
the data here is used for choosing the model, therefore
it's contaminated, it's no longer trusted to reflect the
real performance, because you already used in learning.
9:58 2014-09-28
linear model is an economy car,
nonlinear transformation gives you a truck,
you see the truck is very strong, I can go high-dimensional
space, I can have very sophisticated surface, the I warned
you be careful when you drive the truck.
9:59 2014-09-28
logistic regression: outline
* the model
* error measure
* learning algorithm
10:00 2014-09-28
this will be very representative of what is
machine learning in large.
10:02 2014-09-28
the learning algorithm we use here will
be the same learning algorithm we'll use in
neural network next time.
10:02 2014-09-28
A third linear model: linear combination
* linear classification: h(x) = sign(s)
* linear regression
* logistic regression
10:04 2014-09-28
so let's put it into a picture, so here are you
inputs, x0, x1..., xd, x1~xd, these are your genuine inputs,
this is x0, which takes care of the threshold,
10:06 2014-09-28
weights going with these guys, and then they're summed
in order give me s, and then one linear model or the
other will do different things to s, the 1st model will
take s and pass through the threshold,in order to get plus
or minus one // +1 or -1
10:07 2014-09-28
what did we do to the signal in the case of linear regression?
10:08 2014-09-28
now when you go to the 3rd guy, which is called logistic
regression: h(x) = θ(s)
10:09 2014-09-28
take s and apply a nonlinearity to it.
10:10 2014-09-28
it's not as harsh as this nonliearity.
it's somewhere between this and leaving it alone(identity).
10:11 2014-09-28
and it looks like this.
10:11 2014-09-28
this is the least I can report, this is the
most I can report, it looks bounded like this.
10:12 2014-09-28
much like this except for the softening of it.
10:12 2014-09-28
but it's real-valued, I can return any real-value
between this & this.
10:13 2014-09-28
so it has something of the linear regression,
10:14 2014-09-28
and the main utility of logistic regression is
that the output is going to interpreted as probability
10:14 2014-09-28
and that will cover a lot of problems where we
want to estimate the probability of something.
10:15 2014-09-28
so let's be specific, let's look at the logistic functionθ
10:16 2014-09-28
θ(s) // s == sum
10:16 2014-09-28
it can serve as a probability, because it goes from
here 0 to 1.
10:16 2014-09-28
and if you look at the signal, if the signal is very
very negative, you close to probability 0.
10:17 2014-09-28
if the signal is very very positive, you get close to 1.
10:17 2014-09-28
and the signal zero is probability half.
10:17 2014-09-28
so the signal is correspond to the level of certainty
of something.
10:18 2014-09-28
if I have a huge signal, I'm pretty sure that
will eventually happen.
10:19 2014-09-28
now there're many formulas I can have to give
you this shape. this shape is what I interested in.
10:19 2014-09-28
and I'm going to choose a particular formula.
10:20 2014-09-28
it will be a very friendly formula.
10:21 2014-09-28
so this thing is called soft-threshold for obvious
reasons. the hard version will be decide this or this.
10:22 2014-09-28
so ths soften it, and give you a reliability of the
decision.
10:22 2014-09-28
so if you think talking about the credit card application,
it used to be think the customer is good or bad?
instead of deciding the customer is good or bad, which is a
binary classficiation,
10:23 2014-09-28
what is the probability that this customer will
be good or bad?
what is the probability of default?
10:25 2014-09-28
let the bandk decide what to do according this probability?
10:25 2014-09-28
the soft-threshold reflect uncertainty.
seldom do we know the binary classification is certainty.
and it might be more information gives you the certainty
as part of the deal,
and reflect in this soft threshold.
10:27 2014-09-28
it's also called sigmoid for simple reasons,
it looks like a flattend out 's'
10:28 2014-09-28
sigmoid function or soften threshold
10:28 2014-09-28
when we got to neural networks, it will be another
close-related formulas. you can invent other formulas
if you will.
10:28 2014-09-28
so this is the logistic function, and here is the model,
so we know what the model does.
10:29 2014-09-28
the main idea is the probability intepretation.
10:29 2014-09-28
so we have the model: h(x) = θ(s)
the model is: you take the linear signal s,
pass it through this logistic function, and that
will be your value of the hypothesis function at
x that give rise to this signal.
10:32 2014-09-28
so we think there is a probability sitting
out there generating examples, that say a probability
default based on credit information.
10:33 2014-09-28
example: unfortunate prediction of heart attack
10:34 2014-09-28
breakout of heart attacks based on a number of factors
10:34 2014-09-28
the kind of input you'll have is:
input x: cholesterol level, age, weight, etc.
10:35 2014-09-28
probability of heart attack
10:40 2014-09-28
what is the probability that you will get a
heart attack within the next 5 months?
10:40 2014-09-28
the signal s = w'x,
it's a linear sum of these guys
10:41 2014-09-28
2 things to observe:
* this remains linear
* you can think of this as a "risk score" // credit score
10:42 2014-09-28
you just give the importance weight, and sum
them up.
10:42 2014-09-28
although it translated to probability to make
it meaningful.
10:53 2014-09-28
I'd like to make the point that this is genuine probability.
10:54 2014-09-28
you have the hypothesis that goes from zero to one,
I'm interpreting it as a probability.
but you could think of it as a function between 0 & 1
10:55 2014-09-28
the main point here is the the output of logistic regression
genuinely as a probability even during learning.
10:56 2014-09-28
this is because the data that gives to you does not
tells you the probability.
10:57 2014-09-28
Data (x, y) with binary y
10:58 2014-09-28
I don't give you here is the 1st patient, and here
are the data, and this is supervised learning, I
have to give you the label.
10:59 2014-09-28
so the probability of getting heart-attack within 12
months is 25 percent, how would the hell could I know
that?
11:00 2014-09-28
I can only get that someone get a heart attack or
didn't get a heart attack.while that is affected by
the probability, but you didn't get access to the
probability.
11:02 2014-09-28
I give you a binary output which is affected by
the probability,
11:03 2014-09-28
so this is a noisy case,
so this is generated by the noisy target, let's put the
the noisy target in order to understand where these examples
come from.
11:05 2014-09-28
Data(x, y) with binary y, generated by a noisy target.
11:05 2014-09-28
P(y|x) = 1 or -1
11:05 2014-09-28
this is generated by the target that I want to learn.
11:06 2014-09-28
you want to learn the final hypothesis which
is called g(x), which happens to have the form of
logistic regression:
g(x) = θ(w'x)
the claim you're going to end is saying that,
this is approximately g(x) ≈ f(x)
11:07 2014-09-28
you're trying to make it as true as possible
according to some error measure we have.
11:11 2014-09-28
what is under your control is the parameters: w
// weight
11:11 2014-09-28
the question now becomes:
how do I choose the weight
such that the logistic regression hypothesis refelect
the target function,knowing that the target function
is the way the examples were generated.
11:13 2014-09-28
so let's talk about the error measure,
11:25 2014-09-28
Error measure
11:25 2014-09-28
it's a very popular error measure
11:26 2014-09-28
we have the following plausible error measure,
which is based on likelihood,
11:27 2014-09-28
likelihood is a very established notion in statistics
not without controversy, but widely applied.
11:28 2014-09-28
I'm going to grade different hypothesis according
different likelihood that they're actually the target
that generated the data.
11:29 2014-09-28
so I can use this to build comparative way to say that
this is more plausible hypothesis than another.
because the data becomes more likely under the scenario.
this hypothesis other than that hypothesis being the real
target function.
11:30 2014-09-28
so this is the idea, you ask how likely to get y from x
if h == f?
11:31 2014-09-28
what is the most probable hypothesis given the data?
11:32 2014-09-28
but here you ask: what is the most probable data given the
hypothesis, which is backwards?
11:33 2014-09-28
this is never a completely clean thing,
but we will sort of swallow that because it looks
rather reasonable.
11:35 2014-09-28
under the assumption that h == f, how likely
to get y from x?
11:36 2014-09-28
so let's use this to derive a full-fledged
version of error measure?
11:37 2014-09-28
it's already crying for a simplification:
and the simplification is this
P(y|x) = θ(yw'x)
11:45 2014-09-28
Maximizing the likelihood, which can be
transformed to minimizing an error measure.
11:47 2014-09-28
we're maximizing the likelihood of the hypothese
under the data set that we're given.
11:48 2014-09-28
what is the probability of the data set, under the
assumption that the hypothesis is indeed the target?
11:49 2014-09-28
maximizing respect to what? the parameter // weight
11:50 2014-09-28
one final thing, can I do this?
11:51 2014-09-28
all you do is instead of maximizing, you minimize
11:52 2014-09-28
Ok, we're cool, so this is the problem.
11:52 2014-09-28
very sophisticated problem, we end up with something
which is rather suspicious familiar.
11:55 2014-09-28
something that involves the value of the example (xn, yn) &
the parameters I'm trying to learn. // weight
11:55 2014-09-28
I'd like to reduce this further
11:55 2014-09-28
SGD == Stochastic Gradient Descent
11:56 2014-09-28
I'm going to officially declare it as the
in-sample error of the logistic regression.
// Ein(w)
11:57 2014-09-28
so I minimize it, it's legitimate
11:57 2014-09-28
Ein(W) // in-sample error, error measure
11:58 2014-09-28
// e(h(xn), yn)
I'm going to call it the error measure between
my hypothesis which depends on w, apply to xn, and
the value you give me as a label for that example
which is yn,
that is the way we define error measure on points.
12:00 2014-09-28
label
12:00 2014-09-28
and under that, maximize the likelihood will like
minimizing the in-sample error.
12:02 2014-09-28
there is an interesting interpretation here,
w'xn, this is what we call a risk score,
12:03 2014-09-28
that's see agreement or disagreement, and how they
affect the error measure
12:03 2014-09-28
now if the signal is very positive // W'Xn
and this guy(yn) is plus one(+1) // unfortunately you got a heart attack
agreement => contribution to the error measure is small
12:06 2014-09-28
disagreement => error is huge
12:06 2014-09-28
this will be an error measure that we're trying to minimize
12:06 2014-09-28
it's called "cross-entropy" error
12:07 2014-09-28
now we have defined the model, and we have defined
the error measure, the remaining order is to do the
learning algorithm
12:08 2014-09-28
remember linear regression, we also have an error function
12:09 2014-09-28
to minimize the linear regression error
=> pseudo-inverse // normal equation
// projection onto the column space C(A)
12:10 2014-09-28
but here we're out of luck, you can not find
a closed-form solution
12:11 2014-09-28
with the absense of closed-form solution, we
usually go for an iterative solution.
12:12 2014-09-28
we just improve, improve,...
finally we got the good solution.
12:12 2014-09-28
this is not a foreign concept to us, this is
what we do in perceptrons.
12:12 2014-09-28
here we're going to do is based on calculus,
the method we're going to use is minimization
can be applied to any error measure even nonlinear
just a little smoothness assumed.
12:14 2014-09-28
iterative method: gradient descent
12:14 2014-09-28
a function goes like this is called convex,
and it goes with "convex optimization"
12:14 2014-09-28
very simple, because wherever you start, you'll
get to the valley.
12:15 2014-09-28
imagine the most sophiscated nonlinear surface,
and then depending where you start, you sliding
down,
12:16 2014-09-28
error measure for neural networks
12:16 2014-09-28
statistical inference
12:16 2014-09-28
so what you do with gradient descent?
* general method for nonlinear optimization
what you do is you start at a point: w(0)
then you take a step, you try to make a improvement
using that step,
12:17 2014-09-28
and the step is: take a step along the steepest slope
12:18 2014-09-28
the steepest slope is not an easy notion to see in
2 dimension space,because I left or right? too many
directions.
12:18 2014-09-28
let's do the following, let's see I'm in 3D space,
in this room, I have a very nonlinear surface, going
around up & down, up & down....
12:19 2014-09-28
I'm going to assume one thing that is twice differentiable
that is what you need to invoke gradient descent.
12:21 2014-09-28
you don't have a birds'view, you only have local information
around you, so the best thing to imagine is that you sitting
on the surface, and then you close your eyes, and all you do
is feel around you, and then decide that this is a more promising
direction than this.that' all you do at one step, then you go
to the new point, repeat, repeat...
12:23 2014-09-28
until you get to the minimum
12:23 2014-09-28
these are all the iterative method you're going to use.
12:23 2014-09-28
we look at a fixed step size
12:24 2014-09-28
I'm going to do local approximations based on calculus,
Taylor's series, and I knonw this approxmation will be good
if the step size is not that big.
if I move far, the higher order terms kick in, I'm not sure
the conclusion I'm going to ...will apply
12:24 2014-09-28
I'm moving a unit vector v hat
12:26 2014-09-28
and I'm going to modulate the amount of move by
step size which I'm going to call η
12:27 2014-09-28
so OK, this is the amount of move, I already
decide on the size, but I don't know which side
to go?
12:28 2014-09-28
w(1) = w(0) + ηv
so under this condition, you're trying to derive
what is v hat?
12:29 2014-09-28
so let's actually try to solve for it.
12:29 2014-09-28
so we're really talking about the change in the direction
of the error.
12:30 2014-09-28
ΔEin // change in in-sample error, one step
12:30 2014-09-28
what I want to do is that I want this guy(ΔEin) to be
negative, as negative as possible.
12:31 2014-09-28
ΔEin = Ein(w(1)) - Ein(w(0)) // by the proper choice of w(1)
12:32 2014-09-28
ΔEin = Ein(w(0) + ηv) - Ein(w(0))
12:32 2014-09-28
using the Taylor series expansion with one term.
12:33 2014-09-28
conjugate gradient??
12:33 2014-09-28
so now you can see why it's called gradient descent,
because you descent along the gradient of your error
12:36 2014-09-28
it will take me forever to get there.
12:36 2014-09-28
but then the linear approximation may not apply,
so there is a compromise
12:37 2014-09-28
so if you look at, the best compromise is initially have a
large η, just be more careful when get close to the minimum
12:39 2014-09-28
it's not a mathematical formula, it's an observation on surface.
12:40 2014-09-28
to have the η increase with slope
12:40 2014-09-28
easy implementation:
instead of taking the direction which will not change,
now I'm trying to make η proportionally to the size of the
gradient, step size bigger when the slope is bigger.
12:41 2014-09-28
now it's not a fixed step anymore, it's a fixed
learning rate
12:42 2014-09-28
learning rate
12:43 2014-09-28
summary of linear regression algorithm:
just use "gradient descent" to update the weights w
12:44 2014-09-28
summary of linear model:
* perceptron // linear classification // accept or deny
* linear regression // credit line
* logistic regression // probability of default
12:45 2014-09-28
if you apply each of this to "credit analysis", what type
of thing do you implement?
12:45 2014-09-28
if you use logistic regression, you just decide
the probability of default, then let the bank decide
what to do?
12:47 2014-09-28
so this is from the application domain, let's see
from the tool's point of view
12:48 2014-09-28
they have different error measure,
perceptron: binary classification error // PLA, Pocket
linear regression: squared error // Pseudo-inverse
logistic regression: cross-entropy error // Gradient descent
12:49 2014-09-28
let's see the linear regression, that is the easiest,
you have pseudo-inverse, and you have one-step learning
------------------------------------------------------------------
video 09, the Linear Model II
9:08 2014-09-28
generalization analysis
9:16 2014-09-28
linear classification:
perceptron algorithm, pocket algorithm
9:17 2014-09-28
linear surface => quadratic surface
9:20 2014-09-28
think of the VC inequality as a promise
of providing you with a warrantry, in order
for the warrantry to be valid, you can not
look at the data before you choose the model,
that will forfeit the warrantry.
9:43 2014-09-28
I'm going to charge you if I do the analysis correctly.
not the VC dimension of the final guy you got. I'm going
to charge you the VC dimension of the entire hypothese space
that you explore in your mind in order to getting there.
9:44 2014-09-28
you have acted as a learning algorithm unknowningly.
9:45 2014-09-28
now you look at the data, you realized that some of
the coefficient is zero,
I don't need this, I don't need this,...
you did very quickly in your mind.
9:46 2014-09-28
so the hypothese you learned is a hierarchical learning.
9:46 2014-09-28
first you learned, then you passed to the algorithm
to learn.
9:47 2014-09-28
the entire hypothese is what you start with.
9:47 2014-09-28
Lesson learned:
looking at the data before choosing the model.
9:48 2014-09-28
this can be hazard to your health, not to your health,
but the "generalization health"
9:48 2014-09-28
if you look at the data, we say that you did the learning.
9:49 2014-09-28
this is the manifestation of biggest trap that
practioners fall into.
9:50 2014-09-28
when you go machine learning,I want you to learn from
the data, and choosing the model is very tricky.
9:51 2014-09-28
let me look at the data, and just pick something suitable.
9:52 2014-09-28
you're allowed to do that, I'm not saying that this is
against the law. you can do it, just charge accordingly.
9:52 2014-09-28
remember if you do this, and end up with a small hypothese
set, and you have a VC dimension, you have already forfeit
the warrantry that has given you by the VC inequality according
to that.
9:54 2014-09-28
you snoop into the data
9:54 2014-09-28
data snooping
9:54 2014-09-28
you look at the data before you choose the model.
9:54 2014-09-28
but there're others that are so subtle that
a smart person may fall into that.
9:55 2014-09-28
I'm not minimizing the way to choose a model.
there will be ways to choose the model, when I talk
about validation, model selection will be the order
of the day.
9:56 2014-09-28
it's a model selection that does not contaminate
the data
9:57 2014-09-28
the data here is used for choosing the model, therefore
it's contaminated, it's no longer trusted to reflect the
real performance, because you already used in learning.
9:58 2014-09-28
linear model is an economy car,
nonlinear transformation gives you a truck,
you see the truck is very strong, I can go high-dimensional
space, I can have very sophisticated surface, the I warned
you be careful when you drive the truck.
9:59 2014-09-28
logistic regression: outline
* the model
* error measure
* learning algorithm
10:00 2014-09-28
this will be very representative of what is
machine learning in large.
10:02 2014-09-28
the learning algorithm we use here will
be the same learning algorithm we'll use in
neural network next time.
10:02 2014-09-28
A third linear model: linear combination
* linear classification: h(x) = sign(s)
* linear regression
* logistic regression
10:04 2014-09-28
so let's put it into a picture, so here are you
inputs, x0, x1..., xd, x1~xd, these are your genuine inputs,
this is x0, which takes care of the threshold,
10:06 2014-09-28
weights going with these guys, and then they're summed
in order give me s, and then one linear model or the
other will do different things to s, the 1st model will
take s and pass through the threshold,in order to get plus
or minus one // +1 or -1
10:07 2014-09-28
what did we do to the signal in the case of linear regression?
10:08 2014-09-28
now when you go to the 3rd guy, which is called logistic
regression: h(x) = θ(s)
10:09 2014-09-28
take s and apply a nonlinearity to it.
10:10 2014-09-28
it's not as harsh as this nonliearity.
it's somewhere between this and leaving it alone(identity).
10:11 2014-09-28
and it looks like this.
10:11 2014-09-28
this is the least I can report, this is the
most I can report, it looks bounded like this.
10:12 2014-09-28
much like this except for the softening of it.
10:12 2014-09-28
but it's real-valued, I can return any real-value
between this & this.
10:13 2014-09-28
so it has something of the linear regression,
10:14 2014-09-28
and the main utility of logistic regression is
that the output is going to interpreted as probability
10:14 2014-09-28
and that will cover a lot of problems where we
want to estimate the probability of something.
10:15 2014-09-28
so let's be specific, let's look at the logistic functionθ
10:16 2014-09-28
θ(s) // s == sum
10:16 2014-09-28
it can serve as a probability, because it goes from
here 0 to 1.
10:16 2014-09-28
and if you look at the signal, if the signal is very
very negative, you close to probability 0.
10:17 2014-09-28
if the signal is very very positive, you get close to 1.
10:17 2014-09-28
and the signal zero is probability half.
10:17 2014-09-28
so the signal is correspond to the level of certainty
of something.
10:18 2014-09-28
if I have a huge signal, I'm pretty sure that
will eventually happen.
10:19 2014-09-28
now there're many formulas I can have to give
you this shape. this shape is what I interested in.
10:19 2014-09-28
and I'm going to choose a particular formula.
10:20 2014-09-28
it will be a very friendly formula.
10:21 2014-09-28
so this thing is called soft-threshold for obvious
reasons. the hard version will be decide this or this.
10:22 2014-09-28
so ths soften it, and give you a reliability of the
decision.
10:22 2014-09-28
so if you think talking about the credit card application,
it used to be think the customer is good or bad?
instead of deciding the customer is good or bad, which is a
binary classficiation,
10:23 2014-09-28
what is the probability that this customer will
be good or bad?
what is the probability of default?
10:25 2014-09-28
let the bandk decide what to do according this probability?
10:25 2014-09-28
the soft-threshold reflect uncertainty.
seldom do we know the binary classification is certainty.
and it might be more information gives you the certainty
as part of the deal,
and reflect in this soft threshold.
10:27 2014-09-28
it's also called sigmoid for simple reasons,
it looks like a flattend out 's'
10:28 2014-09-28
sigmoid function or soften threshold
10:28 2014-09-28
when we got to neural networks, it will be another
close-related formulas. you can invent other formulas
if you will.
10:28 2014-09-28
so this is the logistic function, and here is the model,
so we know what the model does.
10:29 2014-09-28
the main idea is the probability intepretation.
10:29 2014-09-28
so we have the model: h(x) = θ(s)
the model is: you take the linear signal s,
pass it through this logistic function, and that
will be your value of the hypothesis function at
x that give rise to this signal.
10:32 2014-09-28
so we think there is a probability sitting
out there generating examples, that say a probability
default based on credit information.
10:33 2014-09-28
example: unfortunate prediction of heart attack
10:34 2014-09-28
breakout of heart attacks based on a number of factors
10:34 2014-09-28
the kind of input you'll have is:
input x: cholesterol level, age, weight, etc.
10:35 2014-09-28
probability of heart attack
10:40 2014-09-28
what is the probability that you will get a
heart attack within the next 5 months?
10:40 2014-09-28
the signal s = w'x,
it's a linear sum of these guys
10:41 2014-09-28
2 things to observe:
* this remains linear
* you can think of this as a "risk score" // credit score
10:42 2014-09-28
you just give the importance weight, and sum
them up.
10:42 2014-09-28
although it translated to probability to make
it meaningful.
10:53 2014-09-28
I'd like to make the point that this is genuine probability.
10:54 2014-09-28
you have the hypothesis that goes from zero to one,
I'm interpreting it as a probability.
but you could think of it as a function between 0 & 1
10:55 2014-09-28
the main point here is the the output of logistic regression
genuinely as a probability even during learning.
10:56 2014-09-28
this is because the data that gives to you does not
tells you the probability.
10:57 2014-09-28
Data (x, y) with binary y
10:58 2014-09-28
I don't give you here is the 1st patient, and here
are the data, and this is supervised learning, I
have to give you the label.
10:59 2014-09-28
so the probability of getting heart-attack within 12
months is 25 percent, how would the hell could I know
that?
11:00 2014-09-28
I can only get that someone get a heart attack or
didn't get a heart attack.while that is affected by
the probability, but you didn't get access to the
probability.
11:02 2014-09-28
I give you a binary output which is affected by
the probability,
11:03 2014-09-28
so this is a noisy case,
so this is generated by the noisy target, let's put the
the noisy target in order to understand where these examples
come from.
11:05 2014-09-28
Data(x, y) with binary y, generated by a noisy target.
11:05 2014-09-28
P(y|x) = 1 or -1
11:05 2014-09-28
this is generated by the target that I want to learn.
11:06 2014-09-28
you want to learn the final hypothesis which
is called g(x), which happens to have the form of
logistic regression:
g(x) = θ(w'x)
the claim you're going to end is saying that,
this is approximately g(x) ≈ f(x)
11:07 2014-09-28
you're trying to make it as true as possible
according to some error measure we have.
11:11 2014-09-28
what is under your control is the parameters: w
// weight
11:11 2014-09-28
the question now becomes:
how do I choose the weight
such that the logistic regression hypothesis refelect
the target function,knowing that the target function
is the way the examples were generated.
11:13 2014-09-28
so let's talk about the error measure,
11:25 2014-09-28
Error measure
11:25 2014-09-28
it's a very popular error measure
11:26 2014-09-28
we have the following plausible error measure,
which is based on likelihood,
11:27 2014-09-28
likelihood is a very established notion in statistics
not without controversy, but widely applied.
11:28 2014-09-28
I'm going to grade different hypothesis according
different likelihood that they're actually the target
that generated the data.
11:29 2014-09-28
so I can use this to build comparative way to say that
this is more plausible hypothesis than another.
because the data becomes more likely under the scenario.
this hypothesis other than that hypothesis being the real
target function.
11:30 2014-09-28
so this is the idea, you ask how likely to get y from x
if h == f?
11:31 2014-09-28
what is the most probable hypothesis given the data?
11:32 2014-09-28
but here you ask: what is the most probable data given the
hypothesis, which is backwards?
11:33 2014-09-28
this is never a completely clean thing,
but we will sort of swallow that because it looks
rather reasonable.
11:35 2014-09-28
under the assumption that h == f, how likely
to get y from x?
11:36 2014-09-28
so let's use this to derive a full-fledged
version of error measure?
11:37 2014-09-28
it's already crying for a simplification:
and the simplification is this
P(y|x) = θ(yw'x)
11:45 2014-09-28
Maximizing the likelihood, which can be
transformed to minimizing an error measure.
11:47 2014-09-28
we're maximizing the likelihood of the hypothese
under the data set that we're given.
11:48 2014-09-28
what is the probability of the data set, under the
assumption that the hypothesis is indeed the target?
11:49 2014-09-28
maximizing respect to what? the parameter // weight
11:50 2014-09-28
one final thing, can I do this?
11:51 2014-09-28
all you do is instead of maximizing, you minimize
11:52 2014-09-28
Ok, we're cool, so this is the problem.
11:52 2014-09-28
very sophisticated problem, we end up with something
which is rather suspicious familiar.
11:55 2014-09-28
something that involves the value of the example (xn, yn) &
the parameters I'm trying to learn. // weight
11:55 2014-09-28
I'd like to reduce this further
11:55 2014-09-28
SGD == Stochastic Gradient Descent
11:56 2014-09-28
I'm going to officially declare it as the
in-sample error of the logistic regression.
// Ein(w)
11:57 2014-09-28
so I minimize it, it's legitimate
11:57 2014-09-28
Ein(W) // in-sample error, error measure
11:58 2014-09-28
// e(h(xn), yn)
I'm going to call it the error measure between
my hypothesis which depends on w, apply to xn, and
the value you give me as a label for that example
which is yn,
that is the way we define error measure on points.
12:00 2014-09-28
label
12:00 2014-09-28
and under that, maximize the likelihood will like
minimizing the in-sample error.
12:02 2014-09-28
there is an interesting interpretation here,
w'xn, this is what we call a risk score,
12:03 2014-09-28
that's see agreement or disagreement, and how they
affect the error measure
12:03 2014-09-28
now if the signal is very positive // W'Xn
and this guy(yn) is plus one(+1) // unfortunately you got a heart attack
agreement => contribution to the error measure is small
12:06 2014-09-28
disagreement => error is huge
12:06 2014-09-28
this will be an error measure that we're trying to minimize
12:06 2014-09-28
it's called "cross-entropy" error
12:07 2014-09-28
now we have defined the model, and we have defined
the error measure, the remaining order is to do the
learning algorithm
12:08 2014-09-28
remember linear regression, we also have an error function
12:09 2014-09-28
to minimize the linear regression error
=> pseudo-inverse // normal equation
// projection onto the column space C(A)
12:10 2014-09-28
but here we're out of luck, you can not find
a closed-form solution
12:11 2014-09-28
with the absense of closed-form solution, we
usually go for an iterative solution.
12:12 2014-09-28
we just improve, improve,...
finally we got the good solution.
12:12 2014-09-28
this is not a foreign concept to us, this is
what we do in perceptrons.
12:12 2014-09-28
here we're going to do is based on calculus,
the method we're going to use is minimization
can be applied to any error measure even nonlinear
just a little smoothness assumed.
12:14 2014-09-28
iterative method: gradient descent
12:14 2014-09-28
a function goes like this is called convex,
and it goes with "convex optimization"
12:14 2014-09-28
very simple, because wherever you start, you'll
get to the valley.
12:15 2014-09-28
imagine the most sophiscated nonlinear surface,
and then depending where you start, you sliding
down,
12:16 2014-09-28
error measure for neural networks
12:16 2014-09-28
statistical inference
12:16 2014-09-28
so what you do with gradient descent?
* general method for nonlinear optimization
what you do is you start at a point: w(0)
then you take a step, you try to make a improvement
using that step,
12:17 2014-09-28
and the step is: take a step along the steepest slope
12:18 2014-09-28
the steepest slope is not an easy notion to see in
2 dimension space,because I left or right? too many
directions.
12:18 2014-09-28
let's do the following, let's see I'm in 3D space,
in this room, I have a very nonlinear surface, going
around up & down, up & down....
12:19 2014-09-28
I'm going to assume one thing that is twice differentiable
that is what you need to invoke gradient descent.
12:21 2014-09-28
you don't have a birds'view, you only have local information
around you, so the best thing to imagine is that you sitting
on the surface, and then you close your eyes, and all you do
is feel around you, and then decide that this is a more promising
direction than this.that' all you do at one step, then you go
to the new point, repeat, repeat...
12:23 2014-09-28
until you get to the minimum
12:23 2014-09-28
these are all the iterative method you're going to use.
12:23 2014-09-28
we look at a fixed step size
12:24 2014-09-28
I'm going to do local approximations based on calculus,
Taylor's series, and I knonw this approxmation will be good
if the step size is not that big.
if I move far, the higher order terms kick in, I'm not sure
the conclusion I'm going to ...will apply
12:24 2014-09-28
I'm moving a unit vector v hat
12:26 2014-09-28
and I'm going to modulate the amount of move by
step size which I'm going to call η
12:27 2014-09-28
so OK, this is the amount of move, I already
decide on the size, but I don't know which side
to go?
12:28 2014-09-28
w(1) = w(0) + ηv
so under this condition, you're trying to derive
what is v hat?
12:29 2014-09-28
so let's actually try to solve for it.
12:29 2014-09-28
so we're really talking about the change in the direction
of the error.
12:30 2014-09-28
ΔEin // change in in-sample error, one step
12:30 2014-09-28
what I want to do is that I want this guy(ΔEin) to be
negative, as negative as possible.
12:31 2014-09-28
ΔEin = Ein(w(1)) - Ein(w(0)) // by the proper choice of w(1)
12:32 2014-09-28
ΔEin = Ein(w(0) + ηv) - Ein(w(0))
12:32 2014-09-28
using the Taylor series expansion with one term.
12:33 2014-09-28
conjugate gradient??
12:33 2014-09-28
so now you can see why it's called gradient descent,
because you descent along the gradient of your error
12:36 2014-09-28
it will take me forever to get there.
12:36 2014-09-28
but then the linear approximation may not apply,
so there is a compromise
12:37 2014-09-28
so if you look at, the best compromise is initially have a
large η, just be more careful when get close to the minimum
12:39 2014-09-28
it's not a mathematical formula, it's an observation on surface.
12:40 2014-09-28
to have the η increase with slope
12:40 2014-09-28
easy implementation:
instead of taking the direction which will not change,
now I'm trying to make η proportionally to the size of the
gradient, step size bigger when the slope is bigger.
12:41 2014-09-28
now it's not a fixed step anymore, it's a fixed
learning rate
12:42 2014-09-28
learning rate
12:43 2014-09-28
summary of linear regression algorithm:
just use "gradient descent" to update the weights w
12:44 2014-09-28
summary of linear model:
* perceptron // linear classification // accept or deny
* linear regression // credit line
* logistic regression // probability of default
12:45 2014-09-28
if you apply each of this to "credit analysis", what type
of thing do you implement?
12:45 2014-09-28
if you use logistic regression, you just decide
the probability of default, then let the bank decide
what to do?
12:47 2014-09-28
so this is from the application domain, let's see
from the tool's point of view
12:48 2014-09-28
they have different error measure,
perceptron: binary classification error // PLA, Pocket
linear regression: squared error // Pseudo-inverse
logistic regression: cross-entropy error // Gradient descent
12:49 2014-09-28
let's see the linear regression, that is the easiest,
you have pseudo-inverse, and you have one-step learning
------------------------------------------------------------------