review CalTech machine learning, video 10 note(Neural Network)

8:32 2014-09-29 Monday
start review CalTech machine learning, 


video 9, Neural Network


8:32 2014-09-29
gradient descent => stochastic gradient descent


8:53 2014-09-29
SGD == Stochastic Gradient Descent


8:54 2014-09-29
SVM == Support Vector Machine


8:54 2014-09-29
outline:


* Stochastic gradient descent


* Neural network model


* Backpropagation algorithm


8:55 2014-09-29
gradient descent // batch gradient descent


9:00 2014-09-29
SGD // Stochastic Gradient Descent


9:00 2014-09-29
one example per time


9:03 2014-09-29
average direction


9:03 2014-09-29
randomized GD // SGD == Stochastic Gradient Descent


9:04 2014-09-29
Benefits of SGD:


* cheaper computation


* randomization


* simple


9:10 2014-09-29
we're going for cheap computation, and we're 


going for this one for free(randomization to 


escape local minimum)


9:10 2014-09-29
randomization helps


9:14 2014-09-29
learning rate:


tells us how far we go.


9:14 2014-09-29
SGD in action


9:15 2014-09-29
SGD to solve "movie rating"


9:17 2014-09-29
we're going to match the taste of the user


to the content of the movie


9:18 2014-09-29
batch gradient descent => stochastic gradient descent


9:21 2014-09-29
biological inspiration of neural networks


9:22 2014-09-29
biological function => biological structure


9:24 2014-09-29
maybe if we put a bunch of perceptrons together 


in a network, we maybe able to achieve the intelligence 


in a learning that a biological system does.


9:26 2014-09-29
combination of perceptrons rather than a single one.


9:31 2014-09-29
combining this very simple unit does achieve something


9:32 2014-09-29
the famous problem where perceptrons failed


9:32 2014-09-29
can we do this with more than one perceptrons 


combined in a right way?


9:33 2014-09-29
that's create the full multilayer perceptron


9:38 2014-09-29
this is the original input space


10:12 2014-09-29
so this multilayer perceptron implements the function


that single perceptron failed.


10:14 2014-09-29
"feedforward"


10:15 2014-09-29
I can get a very sophisticated surface under 


the constaints of this hierarchical thing.


10:15 2014-09-29
powerful model


10:16 2014-09-29
for us, being powerful is good, but with


2 red flags:


* generalization


* optimization


10:17 2014-09-29
what is the combination of weights that 


matches a function.


10:23 2014-09-29
so let's look at the neural networks


10:25 2014-09-29
the neural network looks like this.


10:25 2014-09-29
each layer has a nonlinearity


10:29 2014-09-29
θis used in logistic regression as


a logistic function


10:30 2014-09-29
I'm here used here genetically for any 


nonlinearity you want.


10:30 2014-09-29
I could have a label for this depend on


where this happens.


10:31 2014-09-29
follow the rules of derivation from one layer to another


10:34 2014-09-29
the intermediate layers, we're going to call them 


"hidden layers", because the user didn't see them.


10:35 2014-09-29
it's a soft threshold, I'm going to use the tanh


(hyperbolic tangent), 


10:38 2014-09-29
it is the combination of hard threshold & linearity


10:39 2014-09-29
if your signal is very small, it's as if you're linear


if your signal is extremely large, it's as if you're hard threshold


10:39 2014-09-29
and you get the benefit of one function that is analytic 


& very well behaved for good optimization.


10:41 2014-09-29
the notation will be more elaborate than the perceptron.


10:41 2014-09-29
although it's only a notaional view graph, it's an


important view graph to follow, because if you decide to implement 


a neural network, you just print this view graph, and


call it. and you have your neural network.


10:43 2014-09-29
the parameters of neural network are called w,


the weights happen to belong to any layer to any 


neuron.


10:44 2014-09-29
you keep repeating until you get the final


11:03 2014-09-29
you apply x to the input terminal


11:03 2014-09-29
d is the dimensionality of your network


11:03 2014-09-29
backpropagation algorithm


11:08 2014-09-29
Applying SGD:


all the weights determines h // hypothesis


11:11 2014-09-29
target label: yn


11:11 2014-09-29
get error on example (xn, yn)


e(h(xn), yn) = e(w)


11:12 2014-09-29
to implement SGD, we need the gradient


▽e(w)


11:13 2014-09-29
the idea here is just doing it efficiently


11:14 2014-09-29
backpropagation algorithm: 


take one example at a time, apply it to the network,


then adjust all the weights of the network in the 


direction of the negative gradient according to that


single example, that is what makes it stochastic.


11:17 2014-09-29
the parameters are all the weights.


11:17 2014-09-29
you have different neurons in different layers,


so this is just a funny array.


11:17 2014-09-29
by definition, I have some error measure,


e(h(xn), yn), and this happens to be the function


of weights of the network.


h is determined by w, because this is active quantity


when we're learning


11:18 2014-09-29
to implement SGD, all you need to implement is 


the gradient of this quantity:


▽e(w)  // take lots of partial derivative i, j, l


11:21 2014-09-29
so all you need to do is compute this partial 


derivative for every i, j, l


11:22 2014-09-29
the gradient vector is a huge vector,


each partial derivative is a component.


11:23 2014-09-29
then you take this entire vector of stuff,


then you move in the space along the negative


of that gradient. // gradient descent


11:23 2014-09-29
there is a bigger difference when you find


an efficient algorithm to do something.


11:24 2014-09-29
let take part of the network,


11:26 2014-09-29
feeding through some weight into this guy.


11:26 2014-09-29
that error will change if you change w,


this what partial e partial w tells us. // partial derivate


// rate of change


11:28 2014-09-29
output is a function of the previous layer


of the previous layer of the previous layer...


until I arrive here


11:30 2014-09-29
I have the network that has tons of weights.


11:30 2014-09-29
a trick for efficient computation


11:35 2014-09-29
but this one is as good as the original one


11:35 2014-09-29
this quantity can be computed recursively.


11:36 2014-09-29
the more trouble one, I'm going to call it δ


11:37 2014-09-29
the interesting thing is that the derivative


of this error respect to weight


11:38 2014-09-29
so let's get δ for the final layer


11:38 2014-09-29
the change in the w will be proportion to these 2 guys:


one of them is x here, and one of them is δ here


11:40 2014-09-29
so we will change weight according to two quanities,


that the weight is sanwiched between, that is a pretty


attractive one.


11:44 2014-09-29
let's get δ for the final layer


11:44 2014-09-29
when we compute the thing, we got x for the 1st layer,


put the input, then we propagate forward, got the output.


11:44 2014-09-29
if you get δ later, you'll be get δ earlier.


11:45 2014-09-29
so this will be propagate backward, hence the name


backpropagation


11:46 2014-09-29
so we're going to start with the δ at the output.


11:46 2014-09-29
so when you look at the output layer,


11:47 2014-09-29
what is e(w)?


e(w) is the error measure between the value of 


your hypothesis,that is the value of the neuron network


in its current state with the weights frozen, you apply xn,


you go forward, until you get the output, which is h(xn),


you compare that to the target output, which is the label


of the example yn


11:56 2014-09-29
so we have δ for the final layer


11:59 2014-09-29
the next is to back propagate down to other δs


so this is the essence of the algorithm


12:00 2014-09-29
I'm learning backward.


12:02 2014-09-29
so now I need to take into consideration all 


the ways that affect the output,I'm drawing all


relevant of the network.


12:04 2014-09-29
so I'm going to apply the chain rule again.


12:04 2014-09-29
if affect all of these guys


12:05 2014-09-29
we're done, we just have to keep doing this,


and we get all the δs.


12:10 2014-09-29
the form for the δ is interesting.


12:11 2014-09-29
this look exactly like the forward path.


12:12 2014-09-29
you see the reverse, now we're going down.


12:13 2014-09-29
that's what we do in the backward propagation.


12:14 2014-09-29
backpropagation algorithm:


you pick the input, you compute x as forward,


you get the error, you compute the δs backward,


the δ & the x determine the weight in between.


12:16 2014-09-29
update the weights


12:16 2014-09-29
you're perfectly at the top of the hill


12:18 2014-09-29
but as long as you're there, you're not moving


12:19 2014-09-29
we just want to break the symmetry, 


we introduce randomness, we shake the food a little bit,


12:22 2014-09-29
one final remark: the hidden layer


12:23 2014-09-29
the hidden layers are just a means for us


to get more sophisticated dependency.


12:23 2014-09-29
if you think what the hidden layers do,


they just do a nonlinear transform.


12:24 2014-09-29
I can look these guys, and consider them


features, these one will be features of features...


12:25 2014-09-29
these are learned features 


12:25 2014-09-29
don't look at the data before you choose the transform.


12:26 2014-09-29
the network is looking at the data all it wants.


it's actually adjusting the weights to get the proper 


transform the fit the data.


12:27 2014-09-29
this is not bothering me, because I'm already charge


the network for the proper VC


12:28 2014-09-29
the weights here that constitute that guy contribute


to the VC dimension


12:28 2014-09-29
VC dimension is more or less a number of weights, that's


the rule of thumb here.


12:29 2014-09-29
looking at the data without accounting for it is bad


12:29 2014-09-29
it's a nonlinear transformation with a view that 


matches very specific dependence that I'm after.


that's the source of efficiency there.


12:30 2014-09-29
can I interpret what the hidden layers are doing?


12:31 2014-09-29
can you please tell what the hidden layers are doing?


12:31 2014-09-29
factor number 7 is important in your case.


12:33 2014-09-29
very common in machine learning, when the learning


algorithm tries to learn, it tries to use the right 


hypothesis, it didn't tries to explain to you what the


right hypothesis is. that was the goal.


----------------------------------------------------------
12:37 2014-09-29
batch gradient descent => stochastic gradient descent(SGD)


12:38 2014-09-29
conjugate gradient


12:38 2014-09-29
convex optimization


12:38 2014-09-29
initially the neural networks are going to solve


the problems of the universe.


12:39 2014-09-29
because the simplicity of the network & the simplicity


of the algorithm, people use them in many applications.


it became the standard tool.


12:41 2014-09-29
they have some serious competitors, for example, SVM(Support Vector


Machine), lots of other models, but they still in use.


12:42 2014-09-29
not the top choice nowadays, but every now and then, someone


will publish something and he will use the neural network and 


get good results.


12:44 2014-09-29
how to choose the number of layers?


OK, this is "model selection", the neural network


is really a class of hypothese set.


how many layers, how many units per layer.


12:47 2014-09-29
that's not an approximation question, we're


talking about learning question.


12:48 2014-09-29
the real questions is: how many weights can I afford? 


because it's reflect directly at the VC dimension &


the number of examples we need.


12:48 2014-09-29
the question of organizing is less severe.


12:50 2014-09-29
given a particular architecture, it tries to kill


some weights in order to reduce the number of parameters


as a method for regularization.


12:51 2014-09-29
basically this is a model selection questions:


validation, regularization


12:52 2014-09-29
I want the analytically property of differentiation.


12:53 2014-09-29
but the whole idea is that you're going to arrive 


at a minimum.


12:54 2014-09-29
you just start from different starting point.


different randomization for the presentation.


12:55 2014-09-29
we're dealing with sophisticated model:


12:57 2014-09-29
2 red flags: generalization & optimization


12:58 2014-09-29
when you have a powerful model, the question of generalization comes 


in, means you can express a lot of things, 


12:59 2014-09-29
the VC dimension summarizes all generalization considerations


13:00 2014-09-29
but at least we're under control, because we


have the numbers that describe it.


13:02 2014-09-29
I'm given the data set: (input, output)


and I have multilayer perceptron, each layer which 


is computing a perceptron function of perceptron function of...


13:03 2014-09-29
the VC dimension is roughly the number of parameters,


that has stood the time of practice.


13:05 2014-09-29
the minimum of the error function will happens 


at the particular combination of the weights.


13:14 2014-09-29
neural network is a model, genetic algorithm is 


an optimization technique, there is no relationship


between them.


13:17 2014-09-29
the learning algorithm looks at the data, 


but we already choose the hypothese set.


13:19 2014-09-29
I recommend to use a package for neural networks


13:20 2014-09-29
neural networks have been studies for a great 


level of detail


13:21 2014-09-29
if you're very keen on the interpretation aspect.


13:21 2014-09-29
when you're combining perceptrons, you're going


to implement more interesting functions.


13:22 2014-09-29
the structure of multilayer is an interesting model


to study, from then on, it becomes a learning question


13:24 2014-09-29
we have a neural network, we no longer to look at the 


target function, try to choose the neuron, we try to 


put it as a model, we let the learning algorithm to choose


the weight.that is the backpropagation in this case.


13:25 2014-09-29
could you please describe early stopping?


it's basically a way to prevent overfitting,


// validation, regularization (model selection)


13:27 2014-09-29
what are the tools to deal with overfitting?


validation & regularization


then early stopping will be very easily explained.


13:28 2014-09-29
instead of choose points at random, you choose


random permutation, from 1 to n, and then go through


that in order, for the next epoch, you do another 


permutation,...


13:31 2014-09-29
If you do this way, eventually every example will


contribute the same, but an epoch will be difficult 


to define, 


13:32 2014-09-29
Does have layers no loops limit the power of the 


neural network?


13:34 2014-09-29
VC dimension roughly depends on the number of parameters(weights)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值