review CalTech machine learning, video 10 note(Neural Network)

最新推荐文章于 2024-03-07 17:39:20 发布

「已注销」

最新推荐文章于 2024-03-07 17:39:20 发布

阅读量773

点赞数

CalTech video note 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

8:32 2014-09-29 Monday
start review CalTech machine learning,

video 9, Neural Network

8:32 2014-09-29
gradient descent => stochastic gradient descent

8:53 2014-09-29
SGD == Stochastic Gradient Descent

8:54 2014-09-29
SVM == Support Vector Machine

8:54 2014-09-29
outline:

* Stochastic gradient descent

* Neural network model

* Backpropagation algorithm

8:55 2014-09-29
gradient descent // batch gradient descent

9:00 2014-09-29
SGD // Stochastic Gradient Descent

9:00 2014-09-29
one example per time

9:03 2014-09-29
average direction

9:03 2014-09-29
randomized GD // SGD == Stochastic Gradient Descent

9:04 2014-09-29
Benefits of SGD:

* cheaper computation

* randomization

* simple

9:10 2014-09-29
we're going for cheap computation, and we're

going for this one for free(randomization to

escape local minimum)

9:10 2014-09-29
randomization helps

9:14 2014-09-29
learning rate:

tells us how far we go.

9:14 2014-09-29
SGD in action

9:15 2014-09-29
SGD to solve "movie rating"

9:17 2014-09-29
we're going to match the taste of the user

to the content of the movie

9:18 2014-09-29
batch gradient descent => stochastic gradient descent

9:21 2014-09-29
biological inspiration of neural networks

9:22 2014-09-29
biological function => biological structure

9:24 2014-09-29
maybe if we put a bunch of perceptrons together

in a network, we maybe able to achieve the intelligence

in a learning that a biological system does.

9:26 2014-09-29
combination of perceptrons rather than a single one.

9:31 2014-09-29
combining this very simple unit does achieve something

9:32 2014-09-29
the famous problem where perceptrons failed

9:32 2014-09-29
can we do this with more than one perceptrons

combined in a right way?

9:33 2014-09-29
that's create the full multilayer perceptron

9:38 2014-09-29
this is the original input space

10:12 2014-09-29
so this multilayer perceptron implements the function

that single perceptron failed.

10:14 2014-09-29
"feedforward"

10:15 2014-09-29
I can get a very sophisticated surface under

the constaints of this hierarchical thing.

10:15 2014-09-29
powerful model

10:16 2014-09-29
for us, being powerful is good, but with

2 red flags:

* generalization

* optimization

10:17 2014-09-29
what is the combination of weights that

matches a function.

10:23 2014-09-29
so let's look at the neural networks

10:25 2014-09-29
the neural network looks like this.

10:25 2014-09-29
each layer has a nonlinearity

10:29 2014-09-29
θis used in logistic regression as

a logistic function

10:30 2014-09-29
I'm here used here genetically for any

nonlinearity you want.

10:30 2014-09-29
I could have a label for this depend on

where this happens.

10:31 2014-09-29
follow the rules of derivation from one layer to another

10:34 2014-09-29
the intermediate layers, we're going to call them

"hidden layers", because the user didn't see them.

10:35 2014-09-29
it's a soft threshold, I'm going to use the tanh

(hyperbolic tangent),

10:38 2014-09-29
it is the combination of hard threshold & linearity

10:39 2014-09-29
if your signal is very small, it's as if you're linear

if your signal is extremely large, it's as if you're hard threshold

10:39 2014-09-29
and you get the benefit of one function that is analytic

& very well behaved for good optimization.

10:41 2014-09-29
the notation will be more elaborate than the perceptron.

10:41 2014-09-29
although it's only a notaional view graph, it's an

important view graph to follow, because if you decide to implement

a neural network, you just print this view graph, and

call it. and you have your neural network.

10:43 2014-09-29
the parameters of neural network are called w,

the weights happen to belong to any layer to any

neuron.

10:44 2014-09-29
you keep repeating until you get the final

11:03 2014-09-29
you apply x to the input terminal

11:03 2014-09-29
d is the dimensionality of your network

11:03 2014-09-29
backpropagation algorithm

11:08 2014-09-29
Applying SGD:

all the weights determines h // hypothesis

11:11 2014-09-29
target label: yn

11:11 2014-09-29
get error on example (xn, yn)

e(h(xn), yn) = e(w)

11:12 2014-09-29
to implement SGD, we need the gradient

▽e(w)

11:13 2014-09-29
the idea here is just doing it efficiently

11:14 2014-09-29
backpropagation algorithm:

take one example at a time, apply it to the network,

then adjust all the weights of the network in the

direction of the negative gradient according to that

single example, that is what makes it stochastic.

11:17 2014-09-29
the parameters are all the weights.

11:17 2014-09-29
you have different neurons in different layers,

so this is just a funny array.

11:17 2014-09-29
by definition, I have some error measure,

e(h(xn), yn), and this happens to be the function

of weights of the network.

h is determined by w, because this is active quantity

when we're learning

11:18 2014-09-29
to implement SGD, all you need to implement is

the gradient of this quantity:

▽e(w) // take lots of partial derivative i, j, l

11:21 2014-09-29
so all you need to do is compute this partial

derivative for every i, j, l

11:22 2014-09-29
the gradient vector is a huge vector,

each partial derivative is a component.

11:23 2014-09-29
then you take this entire vector of stuff,

then you move in the space along the negative

of that gradient. // gradient descent

11:23 2014-09-29
there is a bigger difference when you find

an efficient algorithm to do something.

11:24 2014-09-29
let take part of the network,

11:26 2014-09-29
feeding through some weight into this guy.

11:26 2014-09-29
that error will change if you change w,

this what partial e partial w tells us. // partial derivate

// rate of change

11:28 2014-09-29
output is a function of the previous layer

of the previous layer of the previous layer...

until I arrive here

11:30 2014-09-29
I have the network that has tons of weights.

11:30 2014-09-29
a trick for efficient computation

11:35 2014-09-29
but this one is as good as the original one

11:35 2014-09-29
this quantity can be computed recursively.

11:36 2014-09-29
the more trouble one, I'm going to call it δ

11:37 2014-09-29
the interesting thing is that the derivative

of this error respect to weight

11:38 2014-09-29
so let's get δ for the final layer

11:38 2014-09-29
the change in the w will be proportion to these 2 guys:

one of them is x here, and one of them is δ here

11:40 2014-09-29
so we will change weight according to two quanities,

that the weight is sanwiched between, that is a pretty

attractive one.

11:44 2014-09-29
let's get δ for the final layer

11:44 2014-09-29
when we compute the thing, we got x for the 1st layer,

put the input, then we propagate forward, got the output.

11:44 2014-09-29
if you get δ later, you'll be get δ earlier.

11:45 2014-09-29
so this will be propagate backward, hence the name

backpropagation

11:46 2014-09-29
so we're going to start with the δ at the output.

11:46 2014-09-29
so when you look at the output layer,

11:47 2014-09-29
what is e(w)?

e(w) is the error measure between the value of

your hypothesis,that is the value of the neuron network

in its current state with the weights frozen, you apply xn,

you go forward, until you get the output, which is h(xn),

you compare that to the target output, which is the label

of the example yn

11:56 2014-09-29
so we have δ for the final layer

11:59 2014-09-29
the next is to back propagate down to other δs

so this is the essence of the algorithm

12:00 2014-09-29
I'm learning backward.

12:02 2014-09-29
so now I need to take into consideration all

the ways that affect the output,I'm drawing all

relevant of the network.

12:04 2014-09-29
so I'm going to apply the chain rule again.

12:04 2014-09-29
if affect all of these guys

12:05 2014-09-29
we're done, we just have to keep doing this,

and we get all the δs.

12:10 2014-09-29
the form for the δ is interesting.

12:11 2014-09-29
this look exactly like the forward path.

12:12 2014-09-29
you see the reverse, now we're going down.

12:13 2014-09-29
that's what we do in the backward propagation.

12:14 2014-09-29
backpropagation algorithm:

you pick the input, you compute x as forward,

you get the error, you compute the δs backward,

the δ & the x determine the weight in between.

12:16 2014-09-29
update the weights

12:16 2014-09-29
you're perfectly at the top of the hill

12:18 2014-09-29
but as long as you're there, you're not moving

12:19 2014-09-29
we just want to break the symmetry,

we introduce randomness, we shake the food a little bit,

12:22 2014-09-29
one final remark: the hidden layer

12:23 2014-09-29
the hidden layers are just a means for us

to get more sophisticated dependency.

12:23 2014-09-29
if you think what the hidden layers do,

they just do a nonlinear transform.

12:24 2014-09-29
I can look these guys, and consider them

features, these one will be features of features...

12:25 2014-09-29
these are learned features

12:25 2014-09-29
don't look at the data before you choose the transform.

12:26 2014-09-29
the network is looking at the data all it wants.

it's actually adjusting the weights to get the proper

transform the fit the data.

12:27 2014-09-29
this is not bothering me, because I'm already charge

the network for the proper VC

12:28 2014-09-29
the weights here that constitute that guy contribute

to the VC dimension

12:28 2014-09-29
VC dimension is more or less a number of weights, that's

the rule of thumb here.

12:29 2014-09-29
looking at the data without accounting for it is bad

12:29 2014-09-29
it's a nonlinear transformation with a view that

matches very specific dependence that I'm after.

that's the source of efficiency there.

12:30 2014-09-29
can I interpret what the hidden layers are doing?

12:31 2014-09-29
can you please tell what the hidden layers are doing?

12:31 2014-09-29
factor number 7 is important in your case.

12:33 2014-09-29
very common in machine learning, when the learning

algorithm tries to learn, it tries to use the right

hypothesis, it didn't tries to explain to you what the

right hypothesis is. that was the goal.

----------------------------------------------------------
12:37 2014-09-29
batch gradient descent => stochastic gradient descent(SGD)

12:38 2014-09-29
conjugate gradient

12:38 2014-09-29
convex optimization

12:38 2014-09-29
initially the neural networks are going to solve

the problems of the universe.

12:39 2014-09-29
because the simplicity of the network & the simplicity

of the algorithm, people use them in many applications.

it became the standard tool.

12:41 2014-09-29
they have some serious competitors, for example, SVM(Support Vector

Machine), lots of other models, but they still in use.

12:42 2014-09-29
not the top choice nowadays, but every now and then, someone

will publish something and he will use the neural network and

get good results.

12:44 2014-09-29
how to choose the number of layers?

OK, this is "model selection", the neural network

is really a class of hypothese set.

how many layers, how many units per layer.

12:47 2014-09-29
that's not an approximation question, we're

talking about learning question.

12:48 2014-09-29
the real questions is: how many weights can I afford?

because it's reflect directly at the VC dimension &

the number of examples we need.

12:48 2014-09-29
the question of organizing is less severe.

12:50 2014-09-29
given a particular architecture, it tries to kill

some weights in order to reduce the number of parameters

as a method for regularization.

12:51 2014-09-29
basically this is a model selection questions:

validation, regularization

12:52 2014-09-29
I want the analytically property of differentiation.

12:53 2014-09-29
but the whole idea is that you're going to arrive

at a minimum.

12:54 2014-09-29
you just start from different starting point.

different randomization for the presentation.

12:55 2014-09-29
we're dealing with sophisticated model:

12:57 2014-09-29
2 red flags: generalization & optimization

12:58 2014-09-29
when you have a powerful model, the question of generalization comes

in, means you can express a lot of things,

12:59 2014-09-29
the VC dimension summarizes all generalization considerations

13:00 2014-09-29
but at least we're under control, because we

have the numbers that describe it.

13:02 2014-09-29
I'm given the data set: (input, output)

and I have multilayer perceptron, each layer which

is computing a perceptron function of perceptron function of...

13:03 2014-09-29
the VC dimension is roughly the number of parameters,

that has stood the time of practice.

13:05 2014-09-29
the minimum of the error function will happens

at the particular combination of the weights.

13:14 2014-09-29
neural network is a model, genetic algorithm is

an optimization technique, there is no relationship

between them.

13:17 2014-09-29
the learning algorithm looks at the data,

but we already choose the hypothese set.

13:19 2014-09-29
I recommend to use a package for neural networks

13:20 2014-09-29
neural networks have been studies for a great

level of detail

13:21 2014-09-29
if you're very keen on the interpretation aspect.

13:21 2014-09-29
when you're combining perceptrons, you're going

to implement more interesting functions.

13:22 2014-09-29
the structure of multilayer is an interesting model

to study, from then on, it becomes a learning question

13:24 2014-09-29
we have a neural network, we no longer to look at the

target function, try to choose the neuron, we try to

put it as a model, we let the learning algorithm to choose

the weight.that is the backpropagation in this case.

13:25 2014-09-29
could you please describe early stopping?

it's basically a way to prevent overfitting,

// validation, regularization (model selection)

13:27 2014-09-29
what are the tools to deal with overfitting?

validation & regularization

then early stopping will be very easily explained.

13:28 2014-09-29
instead of choose points at random, you choose

random permutation, from 1 to n, and then go through

that in order, for the next epoch, you do another

permutation,...

13:31 2014-09-29
If you do this way, eventually every example will

contribute the same, but an epoch will be difficult

to define,

13:32 2014-09-29
Does have layers no loops limit the power of the

neural network?

13:34 2014-09-29
VC dimension roughly depends on the number of parameters(weights)

「已注销」

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
review CalTech machine learning, video 10 note(Neural Network)

8:32 2014-09-29 Mondaystart review CalTech machine learning, video 9, Neural Network8:32 2014-09-29gradient descent => stochastic gradient descent8:53 2014-09-29SGD == Stochast
复制链接

扫一扫

专栏目录