CalTech machine learning, video 9 Review note(the Linear Model II)

最新推荐文章于 2017-11-07 10:49:00 发布

「已注销」

最新推荐文章于 2017-11-07 10:49:00 发布

阅读量914

点赞数

文章标签： CalTech big data machine learning optimization gradient descent

CalTech video note 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

start review CalTech machine learning,

video 09, the Linear Model II

9:08 2014-09-28
generalization analysis

9:16 2014-09-28
linear classification:

perceptron algorithm, pocket algorithm

9:17 2014-09-28
linear surface => quadratic surface

9:20 2014-09-28
think of the VC inequality as a promise

of providing you with a warrantry, in order

for the warrantry to be valid, you can not

look at the data before you choose the model,

that will forfeit the warrantry.

9:43 2014-09-28
I'm going to charge you if I do the analysis correctly.

not the VC dimension of the final guy you got. I'm going

to charge you the VC dimension of the entire hypothese space

that you explore in your mind in order to getting there.

9:44 2014-09-28
you have acted as a learning algorithm unknowningly.

9:45 2014-09-28
now you look at the data, you realized that some of

the coefficient is zero,

I don't need this, I don't need this,...

you did very quickly in your mind.

9:46 2014-09-28
so the hypothese you learned is a hierarchical learning.

9:46 2014-09-28
first you learned, then you passed to the algorithm

to learn.

9:47 2014-09-28
the entire hypothese is what you start with.

9:47 2014-09-28
Lesson learned:

looking at the data before choosing the model.

9:48 2014-09-28
this can be hazard to your health, not to your health,

but the "generalization health"

9:48 2014-09-28
if you look at the data, we say that you did the learning.

9:49 2014-09-28
this is the manifestation of biggest trap that

practioners fall into.

9:50 2014-09-28
when you go machine learning,I want you to learn from

the data, and choosing the model is very tricky.

9:51 2014-09-28
let me look at the data, and just pick something suitable.

9:52 2014-09-28
you're allowed to do that, I'm not saying that this is

against the law. you can do it, just charge accordingly.

9:52 2014-09-28
remember if you do this, and end up with a small hypothese

set, and you have a VC dimension, you have already forfeit

the warrantry that has given you by the VC inequality according

to that.

9:54 2014-09-28
you snoop into the data

9:54 2014-09-28
data snooping

9:54 2014-09-28
you look at the data before you choose the model.

9:54 2014-09-28
but there're others that are so subtle that

a smart person may fall into that.

9:55 2014-09-28
I'm not minimizing the way to choose a model.

there will be ways to choose the model, when I talk

about validation, model selection will be the order

of the day.

9:56 2014-09-28
it's a model selection that does not contaminate

the data

9:57 2014-09-28
the data here is used for choosing the model, therefore

it's contaminated, it's no longer trusted to reflect the

real performance, because you already used in learning.

9:58 2014-09-28
linear model is an economy car,

nonlinear transformation gives you a truck,

you see the truck is very strong, I can go high-dimensional

space, I can have very sophisticated surface, the I warned

you be careful when you drive the truck.

9:59 2014-09-28
logistic regression: outline

* the model

* error measure

* learning algorithm

10:00 2014-09-28
this will be very representative of what is

machine learning in large.

10:02 2014-09-28
the learning algorithm we use here will

be the same learning algorithm we'll use in

neural network next time.

10:02 2014-09-28
A third linear model: linear combination

* linear classification: h(x) = sign(s)

* linear regression

* logistic regression

10:04 2014-09-28
so let's put it into a picture, so here are you

inputs, x0, x1..., xd, x1~xd, these are your genuine inputs,

this is x0, which takes care of the threshold,

10:06 2014-09-28
weights going with these guys, and then they're summed

in order give me s, and then one linear model or the

other will do different things to s, the 1st model will

take s and pass through the threshold,in order to get plus

or minus one // +1 or -1

10:07 2014-09-28
what did we do to the signal in the case of linear regression?

10:08 2014-09-28
now when you go to the 3rd guy, which is called logistic

regression: h(x) = θ(s)

10:09 2014-09-28
take s and apply a nonlinearity to it.

10:10 2014-09-28
it's not as harsh as this nonliearity.

it's somewhere between this and leaving it alone(identity).

10:11 2014-09-28
and it looks like this.

10:11 2014-09-28
this is the least I can report, this is the

most I can report, it looks bounded like this.

10:12 2014-09-28
much like this except for the softening of it.

10:12 2014-09-28
but it's real-valued, I can return any real-value

between this & this.

10:13 2014-09-28
so it has something of the linear regression,

10:14 2014-09-28
and the main utility of logistic regression is

that the output is going to interpreted as probability

10:14 2014-09-28
and that will cover a lot of problems where we

want to estimate the probability of something.

10:15 2014-09-28
so let's be specific, let's look at the logistic functionθ

10:16 2014-09-28
θ(s) // s == sum

10:16 2014-09-28
it can serve as a probability, because it goes from

here 0 to 1.

10:16 2014-09-28
and if you look at the signal, if the signal is very

very negative, you close to probability 0.

10:17 2014-09-28
if the signal is very very positive, you get close to 1.

10:17 2014-09-28
and the signal zero is probability half.

10:17 2014-09-28
so the signal is correspond to the level of certainty

of something.

10:18 2014-09-28
if I have a huge signal, I'm pretty sure that

will eventually happen.

10:19 2014-09-28
now there're many formulas I can have to give

you this shape. this shape is what I interested in.

10:19 2014-09-28
and I'm going to choose a particular formula.

10:20 2014-09-28
it will be a very friendly formula.

10:21 2014-09-28
so this thing is called soft-threshold for obvious

reasons. the hard version will be decide this or this.

10:22 2014-09-28
so ths soften it, and give you a reliability of the

decision.

10:22 2014-09-28
so if you think talking about the credit card application,

it used to be think the customer is good or bad?

instead of deciding the customer is good or bad, which is a

binary classficiation,

10:23 2014-09-28
what is the probability that this customer will

be good or bad?

what is the probability of default?

10:25 2014-09-28
let the bandk decide what to do according this probability?

10:25 2014-09-28
the soft-threshold reflect uncertainty.

seldom do we know the binary classification is certainty.

and it might be more information gives you the certainty

as part of the deal,

and reflect in this soft threshold.

10:27 2014-09-28
it's also called sigmoid for simple reasons,

it looks like a flattend out 's'

10:28 2014-09-28
sigmoid function or soften threshold

10:28 2014-09-28
when we got to neural networks, it will be another

close-related formulas. you can invent other formulas

if you will.

10:28 2014-09-28
so this is the logistic function, and here is the model,

so we know what the model does.

10:29 2014-09-28
the main idea is the probability intepretation.

10:29 2014-09-28
so we have the model: h(x) = θ(s)

the model is: you take the linear signal s,

pass it through this logistic function, and that

will be your value of the hypothesis function at

x that give rise to this signal.

10:32 2014-09-28
so we think there is a probability sitting

out there generating examples, that say a probability

default based on credit information.

10:33 2014-09-28
example: unfortunate prediction of heart attack

10:34 2014-09-28
breakout of heart attacks based on a number of factors

10:34 2014-09-28
the kind of input you'll have is:

input x: cholesterol level, age, weight, etc.

10:35 2014-09-28
probability of heart attack

10:40 2014-09-28
what is the probability that you will get a

heart attack within the next 5 months?

10:40 2014-09-28
the signal s = w'x,

it's a linear sum of these guys

10:41 2014-09-28
2 things to observe:

* this remains linear

* you can think of this as a "risk score" // credit score

10:42 2014-09-28
you just give the importance weight, and sum

them up.

10:42 2014-09-28
although it translated to probability to make

it meaningful.

10:53 2014-09-28
I'd like to make the point that this is genuine probability.

10:54 2014-09-28
you have the hypothesis that goes from zero to one,

I'm interpreting it as a probability.

but you could think of it as a function between 0 & 1

10:55 2014-09-28
the main point here is the the output of logistic regression

genuinely as a probability even during learning.

10:56 2014-09-28
this is because the data that gives to you does not

tells you the probability.

10:57 2014-09-28
Data (x, y) with binary y

10:58 2014-09-28
I don't give you here is the 1st patient, and here

are the data, and this is supervised learning, I

have to give you the label.

10:59 2014-09-28
so the probability of getting heart-attack within 12

months is 25 percent, how would the hell could I know

that?

11:00 2014-09-28
I can only get that someone get a heart attack or

didn't get a heart attack.while that is affected by

the probability, but you didn't get access to the

probability.

11:02 2014-09-28
I give you a binary output which is affected by

the probability,

11:03 2014-09-28
so this is a noisy case,

so this is generated by the noisy target, let's put the

the noisy target in order to understand where these examples

come from.

11:05 2014-09-28
Data(x, y) with binary y, generated by a noisy target.

11:05 2014-09-28
P(y|x) = 1 or -1

11:05 2014-09-28
this is generated by the target that I want to learn.

11:06 2014-09-28
you want to learn the final hypothesis which

is called g(x), which happens to have the form of

logistic regression:

g(x) = θ(w'x)

the claim you're going to end is saying that,

this is approximately g(x) ≈ f(x)

11:07 2014-09-28
you're trying to make it as true as possible

according to some error measure we have.

11:11 2014-09-28
what is under your control is the parameters: w

// weight

11:11 2014-09-28
the question now becomes:

how do I choose the weight

such that the logistic regression hypothesis refelect

the target function,knowing that the target function

is the way the examples were generated.

11:13 2014-09-28
so let's talk about the error measure,

11:25 2014-09-28
Error measure

11:25 2014-09-28
it's a very popular error measure

11:26 2014-09-28
we have the following plausible error measure,

which is based on likelihood,

11:27 2014-09-28
likelihood is a very established notion in statistics

not without controversy, but widely applied.

11:28 2014-09-28
I'm going to grade different hypothesis according

different likelihood that they're actually the target

that generated the data.

11:29 2014-09-28
so I can use this to build comparative way to say that

this is more plausible hypothesis than another.

because the data becomes more likely under the scenario.

this hypothesis other than that hypothesis being the real

target function.

11:30 2014-09-28
so this is the idea, you ask how likely to get y from x

if h == f?

11:31 2014-09-28
what is the most probable hypothesis given the data?

11:32 2014-09-28
but here you ask: what is the most probable data given the

hypothesis, which is backwards?

11:33 2014-09-28
this is never a completely clean thing,

but we will sort of swallow that because it looks

rather reasonable.

11:35 2014-09-28
under the assumption that h == f, how likely

to get y from x?

11:36 2014-09-28
so let's use this to derive a full-fledged

version of error measure?

11:37 2014-09-28
it's already crying for a simplification:

and the simplification is this

P(y|x) = θ(yw'x)

11:45 2014-09-28
Maximizing the likelihood, which can be

transformed to minimizing an error measure.

11:47 2014-09-28
we're maximizing the likelihood of the hypothese

under the data set that we're given.

11:48 2014-09-28
what is the probability of the data set, under the

assumption that the hypothesis is indeed the target?

11:49 2014-09-28
maximizing respect to what? the parameter // weight

11:50 2014-09-28
one final thing, can I do this?

11:51 2014-09-28
all you do is instead of maximizing, you minimize

11:52 2014-09-28
Ok, we're cool, so this is the problem.

11:52 2014-09-28
very sophisticated problem, we end up with something

which is rather suspicious familiar.

11:55 2014-09-28
something that involves the value of the example (xn, yn) &

the parameters I'm trying to learn. // weight

11:55 2014-09-28
I'd like to reduce this further

11:55 2014-09-28
SGD == Stochastic Gradient Descent

11:56 2014-09-28
I'm going to officially declare it as the

in-sample error of the logistic regression.

// Ein(w)

11:57 2014-09-28
so I minimize it, it's legitimate

11:57 2014-09-28
Ein(W) // in-sample error, error measure

11:58 2014-09-28
// e(h(xn), yn)

I'm going to call it the error measure between

my hypothesis which depends on w, apply to xn, and

the value you give me as a label for that example

which is yn,

that is the way we define error measure on points.

12:00 2014-09-28
label

12:00 2014-09-28
and under that, maximize the likelihood will like

minimizing the in-sample error.

12:02 2014-09-28
there is an interesting interpretation here,

w'xn, this is what we call a risk score,

12:03 2014-09-28
that's see agreement or disagreement, and how they

affect the error measure

12:03 2014-09-28
now if the signal is very positive // W'Xn

and this guy(yn) is plus one(+1) // unfortunately you got a heart attack

agreement => contribution to the error measure is small

12:06 2014-09-28
disagreement => error is huge

12:06 2014-09-28
this will be an error measure that we're trying to minimize

12:06 2014-09-28
it's called "cross-entropy" error

12:07 2014-09-28
now we have defined the model, and we have defined

the error measure, the remaining order is to do the

learning algorithm

12:08 2014-09-28
remember linear regression, we also have an error function

12:09 2014-09-28
to minimize the linear regression error

=> pseudo-inverse // normal equation

// projection onto the column space C(A)

12:10 2014-09-28
but here we're out of luck, you can not find

a closed-form solution

12:11 2014-09-28
with the absense of closed-form solution, we

usually go for an iterative solution.

12:12 2014-09-28
we just improve, improve,...

finally we got the good solution.

12:12 2014-09-28
this is not a foreign concept to us, this is

what we do in perceptrons.

12:12 2014-09-28
here we're going to do is based on calculus,

the method we're going to use is minimization

can be applied to any error measure even nonlinear

just a little smoothness assumed.

12:14 2014-09-28
iterative method: gradient descent

12:14 2014-09-28
a function goes like this is called convex,

and it goes with "convex optimization"

12:14 2014-09-28
very simple, because wherever you start, you'll

get to the valley.

12:15 2014-09-28
imagine the most sophiscated nonlinear surface,

and then depending where you start, you sliding

down,

12:16 2014-09-28
error measure for neural networks

12:16 2014-09-28
statistical inference

12:16 2014-09-28
so what you do with gradient descent?

* general method for nonlinear optimization

what you do is you start at a point: w(0)

then you take a step, you try to make a improvement

using that step,

12:17 2014-09-28
and the step is: take a step along the steepest slope

12:18 2014-09-28
the steepest slope is not an easy notion to see in

2 dimension space,because I left or right? too many

directions.

12:18 2014-09-28
let's do the following, let's see I'm in 3D space,

in this room, I have a very nonlinear surface, going

around up & down, up & down....

12:19 2014-09-28
I'm going to assume one thing that is twice differentiable

that is what you need to invoke gradient descent.

12:21 2014-09-28
you don't have a birds'view, you only have local information

around you, so the best thing to imagine is that you sitting

on the surface, and then you close your eyes, and all you do

is feel around you, and then decide that this is a more promising

direction than this.that' all you do at one step, then you go

to the new point, repeat, repeat...

12:23 2014-09-28
until you get to the minimum

12:23 2014-09-28
these are all the iterative method you're going to use.

12:23 2014-09-28
we look at a fixed step size

12:24 2014-09-28
I'm going to do local approximations based on calculus,

Taylor's series, and I knonw this approxmation will be good

if the step size is not that big.

if I move far, the higher order terms kick in, I'm not sure

the conclusion I'm going to ...will apply

12:24 2014-09-28
I'm moving a unit vector v hat

12:26 2014-09-28
and I'm going to modulate the amount of move by

step size which I'm going to call η

12:27 2014-09-28
so OK, this is the amount of move, I already

decide on the size, but I don't know which side

to go?

12:28 2014-09-28
w(1) = w(0) + ηv

so under this condition, you're trying to derive

what is v hat?

12:29 2014-09-28
so let's actually try to solve for it.

12:29 2014-09-28
so we're really talking about the change in the direction

of the error.

12:30 2014-09-28
ΔEin // change in in-sample error, one step

12:30 2014-09-28
what I want to do is that I want this guy(ΔEin) to be

negative, as negative as possible.

12:31 2014-09-28
ΔEin = Ein(w(1)) - Ein(w(0)) // by the proper choice of w(1)

12:32 2014-09-28
ΔEin = Ein(w(0) + ηv) - Ein(w(0))

12:32 2014-09-28
using the Taylor series expansion with one term.

12:33 2014-09-28
conjugate gradient??

12:33 2014-09-28
so now you can see why it's called gradient descent,

because you descent along the gradient of your error

12:36 2014-09-28
it will take me forever to get there.

12:36 2014-09-28
but then the linear approximation may not apply,

so there is a compromise

12:37 2014-09-28
so if you look at, the best compromise is initially have a

large η, just be more careful when get close to the minimum

12:39 2014-09-28
it's not a mathematical formula, it's an observation on surface.

12:40 2014-09-28
to have the η increase with slope

12:40 2014-09-28
easy implementation:

instead of taking the direction which will not change,

now I'm trying to make η proportionally to the size of the

gradient, step size bigger when the slope is bigger.

12:41 2014-09-28
now it's not a fixed step anymore, it's a fixed

learning rate

12:42 2014-09-28
learning rate

12:43 2014-09-28
summary of linear regression algorithm:

just use "gradient descent" to update the weights w

12:44 2014-09-28
summary of linear model:

* perceptron // linear classification // accept or deny

* linear regression // credit line

* logistic regression // probability of default

12:45 2014-09-28
if you apply each of this to "credit analysis", what type

of thing do you implement?

12:45 2014-09-28
if you use logistic regression, you just decide

the probability of default, then let the bank decide

what to do?

12:47 2014-09-28
so this is from the application domain, let's see

from the tool's point of view

12:48 2014-09-28
they have different error measure,

perceptron: binary classification error // PLA, Pocket

linear regression: squared error // Pseudo-inverse

logistic regression: cross-entropy error // Gradient descent

12:49 2014-09-28
let's see the linear regression, that is the easiest,

you have pseudo-inverse, and you have one-step learning

------------------------------------------------------------------

「已注销」

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CalTech machine learning, video 9 Review note(the Linear Model II)

start review CalTech machine learning,video 09, the Linear Model II9:08 2014-09-28generalization analysis9:16 2014-09-28linear classification:perceptron algorithm, pocket a
复制链接

扫一扫

专栏目录