8:39 2014-09-27
start CalTech machine learning,
video 9, the Linear Model II
8:40 2014-09-27
Bias-Variance decomposition of the
out-of-sample error
8:41 2014-09-27
* linear classification
* linear regression
* logistic regression
8:54 2014-09-27
the tradeoff between approximation & generalization
8:55 2014-09-27
the ability to generalization of linear classification
8:55 2014-09-27
nonlinear transformation
8:57 2014-09-27
feature space
8:57 2014-09-27
linear surface => quadratic surface
8:59 2014-09-27
almost separable:
this guy is errouneously classified.
9:06 2014-09-27
the lesson learned from this is that:
if you look at the data before choosing the model,
can be hazardous to your (Eout) health, not your health
but the generalization health.
9:15 2014-09-27
if you look at the data, we said that you did learning
9:17 2014-09-27
VC dimension of the hypotheses set
9:17 2014-09-27
this is the manifestation of the biggest
trap that practioners fall into.
9:18 2014-09-27
when you go into machine learning, learning
from the data, choosing the model is very tricky
9:19 2014-09-27
it's very tempting, let me just look at the data,
and pick something suitable
9:20 2014-09-27
it's not against the law, you can do it,
but just charge accordingly.
9:20 2014-09-27
if you look at the data before choosing
your model, you have already forfeit the warrantry
that is given by the VC inequality.
9:22 2014-09-27
this is the manifestation of basically snooping,
you snoop into the data in a way that is not allowed.
9:22 2014-09-27
data snooping
9:22 2014-09-27
when you do this, bad things happen.
9:23 2014-09-27
validation, model selection
9:24 2014-09-27
it will be a legitimate way of select a model,
it's a model selection that does not contaminat
the data,
9:25 2014-09-27
it's no longer trusted to reflect the real performance
because you already used in learning
9:26 2014-09-27
linear model is a economy car, nonlinear model
gives you a truck,
9:28 2014-09-27
logistic regression
9:28 2014-09-27
the model: what is the hypothese set
9:28 2014-09-27
soft threshold
9:36 2014-09-27
there is a proability sitting there generating
examples.
9:37 2014-09-27
credit score, risk score
9:41 2014-09-27
this is supervised learning, I have to give you tags.
9:44 2014-09-27
error measure based on likelihood
9:51 2014-09-27
the data is generated by this target function
9:52 2014-09-27
if that probability is very small, then your
assumption must be poor.
9:52 2014-09-27
and if that probability is high, then your assumption
has more plausibility.
9:52 2014-09-27
so I can use this to build comparative way to say
that this is more plausible
9:53 2014-09-27
what is the probability of generating this data
if your assumption is true?
// result => causal ???
9:54 2014-09-27
what is the most probable hypothesis given the data?
what is the probability of the data given the hypothesis?
9:57 2014-09-27
prior
9:57 2014-09-27
if I choose a hypothesis under which having the
data is very plausible, it look like this hypothesis
is very likely, hence the likelihood name
9:59 2014-09-27
what is the likelihood of this whole data set?
10:06 2014-09-27
maximizing the likelihood => minimizing the error measure
10:08 2014-09-27
we're maximizing the likelihood of this hypothesis
under this data set.
10:12 2014-09-27
cross-entropy error
10:19 2014-09-27
learning algorithm
10:19 2014-09-27
How to minimize Ein?
10:20 2014-09-27
linear regression => logistic regression
10:20 2014-09-27
iterative solution, closed-form solution
10:21 2014-09-27
iterative method: gradient descent
10:22 2014-09-27
convex optimization
10:24 2014-09-27
you're sitting on the surface, then you close
your eyes, and all you do is feel around you,
and then dicide that this is a more promising direction
than this, that's all you do in one step.
10:28 2014-09-27
when you go the new point, repeat, repeat,...
10:29 2014-09-27
until you get to the minimum.
10:29 2014-09-27
that' all the iterative method you're going to use.
10:29 2014-09-27
fixed-step size
10:30 2014-09-27
Iterative method: gradient descent
General method for nonlinear optimization,
start at w(0); take a step along steepest slope
fixed step size.
10:30 2014-09-27
under this situation, you're going to derive
what is v hat?
10:34 2014-09-27
gradient descent
10:34 2014-09-27
how do I choose the direction in order to
make this as negative as possible?
10:38 2014-09-27
Fixed-size step?
10:44 2014-09-27
logistic regression algorithm
// using gradient descent
10:50 2014-09-27
summary of linear model:
* perceptron // linear classification
* linear regression
* logistic regression
10:52 2014-09-27
Apply to credit analysis
* perceptron => Approve or Deny => binary classification error
(PLA, Pocket)
* linear regression => Amount of Credit => squared error
(Pseudo-inverse)
* logistic regression => Probability of Default => cross-entropy error
(Gradient descent)
10:53 2014-09-27
I will stop here, and then we'll start after a short break.
10:57 2014-09-27
let's start the Q & A
10:57 2014-09-27
there is the question of "learning rate"
10:58 2014-09-27
there're other questions of "initialization"
10:58 2014-09-27
so let's set up a target error, so if I don't
got to the target error, I won't stop.
11:00 2014-09-27
local minimum, global minimum
11:01 2014-09-27
termination is tricky, a combination of criteria
is the best way.
11:05 2014-09-27
in many situations you just doing a gradient descent
in a simple way & get a very good result.
11:08 2014-09-27
you're applying the algorithm faithfully, and ...
11:09 2014-09-27
from a practical point of view, starting from different
initialization point, so each of them will go to it's
local minimum.
11:10 2014-09-27
ordinarilly it will give you a good local minimum,
but getting a global minimum is NP hard.
11:12 2014-09-27
for entropy, you get a function based on the probability.
11:15 2014-09-27
because you will be charged for that.
11:36 2014-09-27
because I use the CPU cycles but does not improve much
11:36 2014-09-27
Neural Networks & hidden layers
start CalTech machine learning,
video 9, the Linear Model II
8:40 2014-09-27
Bias-Variance decomposition of the
out-of-sample error
8:41 2014-09-27
* linear classification
* linear regression
* logistic regression
8:54 2014-09-27
the tradeoff between approximation & generalization
8:55 2014-09-27
the ability to generalization of linear classification
8:55 2014-09-27
nonlinear transformation
8:57 2014-09-27
feature space
8:57 2014-09-27
linear surface => quadratic surface
8:59 2014-09-27
almost separable:
this guy is errouneously classified.
9:06 2014-09-27
the lesson learned from this is that:
if you look at the data before choosing the model,
can be hazardous to your (Eout) health, not your health
but the generalization health.
9:15 2014-09-27
if you look at the data, we said that you did learning
9:17 2014-09-27
VC dimension of the hypotheses set
9:17 2014-09-27
this is the manifestation of the biggest
trap that practioners fall into.
9:18 2014-09-27
when you go into machine learning, learning
from the data, choosing the model is very tricky
9:19 2014-09-27
it's very tempting, let me just look at the data,
and pick something suitable
9:20 2014-09-27
it's not against the law, you can do it,
but just charge accordingly.
9:20 2014-09-27
if you look at the data before choosing
your model, you have already forfeit the warrantry
that is given by the VC inequality.
9:22 2014-09-27
this is the manifestation of basically snooping,
you snoop into the data in a way that is not allowed.
9:22 2014-09-27
data snooping
9:22 2014-09-27
when you do this, bad things happen.
9:23 2014-09-27
validation, model selection
9:24 2014-09-27
it will be a legitimate way of select a model,
it's a model selection that does not contaminat
the data,
9:25 2014-09-27
it's no longer trusted to reflect the real performance
because you already used in learning
9:26 2014-09-27
linear model is a economy car, nonlinear model
gives you a truck,
9:28 2014-09-27
logistic regression
9:28 2014-09-27
the model: what is the hypothese set
9:28 2014-09-27
soft threshold
9:36 2014-09-27
there is a proability sitting there generating
examples.
9:37 2014-09-27
credit score, risk score
9:41 2014-09-27
this is supervised learning, I have to give you tags.
9:44 2014-09-27
error measure based on likelihood
9:51 2014-09-27
the data is generated by this target function
9:52 2014-09-27
if that probability is very small, then your
assumption must be poor.
9:52 2014-09-27
and if that probability is high, then your assumption
has more plausibility.
9:52 2014-09-27
so I can use this to build comparative way to say
that this is more plausible
9:53 2014-09-27
what is the probability of generating this data
if your assumption is true?
// result => causal ???
9:54 2014-09-27
what is the most probable hypothesis given the data?
what is the probability of the data given the hypothesis?
9:57 2014-09-27
prior
9:57 2014-09-27
if I choose a hypothesis under which having the
data is very plausible, it look like this hypothesis
is very likely, hence the likelihood name
9:59 2014-09-27
what is the likelihood of this whole data set?
10:06 2014-09-27
maximizing the likelihood => minimizing the error measure
10:08 2014-09-27
we're maximizing the likelihood of this hypothesis
under this data set.
10:12 2014-09-27
cross-entropy error
10:19 2014-09-27
learning algorithm
10:19 2014-09-27
How to minimize Ein?
10:20 2014-09-27
linear regression => logistic regression
10:20 2014-09-27
iterative solution, closed-form solution
10:21 2014-09-27
iterative method: gradient descent
10:22 2014-09-27
convex optimization
10:24 2014-09-27
you're sitting on the surface, then you close
your eyes, and all you do is feel around you,
and then dicide that this is a more promising direction
than this, that's all you do in one step.
10:28 2014-09-27
when you go the new point, repeat, repeat,...
10:29 2014-09-27
until you get to the minimum.
10:29 2014-09-27
that' all the iterative method you're going to use.
10:29 2014-09-27
fixed-step size
10:30 2014-09-27
Iterative method: gradient descent
General method for nonlinear optimization,
start at w(0); take a step along steepest slope
fixed step size.
10:30 2014-09-27
under this situation, you're going to derive
what is v hat?
10:34 2014-09-27
gradient descent
10:34 2014-09-27
how do I choose the direction in order to
make this as negative as possible?
10:38 2014-09-27
Fixed-size step?
10:44 2014-09-27
logistic regression algorithm
// using gradient descent
10:50 2014-09-27
summary of linear model:
* perceptron // linear classification
* linear regression
* logistic regression
10:52 2014-09-27
Apply to credit analysis
* perceptron => Approve or Deny => binary classification error
(PLA, Pocket)
* linear regression => Amount of Credit => squared error
(Pseudo-inverse)
* logistic regression => Probability of Default => cross-entropy error
(Gradient descent)
10:53 2014-09-27
I will stop here, and then we'll start after a short break.
10:57 2014-09-27
let's start the Q & A
10:57 2014-09-27
there is the question of "learning rate"
10:58 2014-09-27
there're other questions of "initialization"
10:58 2014-09-27
so let's set up a target error, so if I don't
got to the target error, I won't stop.
11:00 2014-09-27
local minimum, global minimum
11:01 2014-09-27
termination is tricky, a combination of criteria
is the best way.
11:05 2014-09-27
in many situations you just doing a gradient descent
in a simple way & get a very good result.
11:08 2014-09-27
you're applying the algorithm faithfully, and ...
11:09 2014-09-27
from a practical point of view, starting from different
initialization point, so each of them will go to it's
local minimum.
11:10 2014-09-27
ordinarilly it will give you a good local minimum,
but getting a global minimum is NP hard.
11:12 2014-09-27
for entropy, you get a function based on the probability.
11:15 2014-09-27
because you will be charged for that.
11:36 2014-09-27
because I use the CPU cycles but does not improve much
11:36 2014-09-27
Neural Networks & hidden layers