4.1 - deep L-layer neural network
We have seen forward propagation and backward propagation in the context of a neural network with a single hidden layer as well we the logistic regression, and we learn about vectorization and why it’s important initialize the parameters randomly. By now we have actually seen most of the ideas we need to implement a deep neural network. What we are going to do now is take those ideas and put them together so that we will be able to implement our own deep neural network.
shallow versus depth is a matter of degree. When we count the layer in neural network, we don’t count the input layer, we just count the hidden layers as well as the output layer. There are functions that very deep neural network can learn, but the shallow models are unable to. Although for any given problem it’s may be hard to predict in advance exactly how deep neural network you would want, so it would be reasonable to try logistic regression, try one and then two hidden layers and view the number of hidden layers as another hyperparamenters.
Let’s now go though the notation we used to descibe the deep neural network.
L L = #(layers), = #(units in layer l l ), for example:
a[l] a [ l ] = activation in layer, a[l]=g(z[l]) a [ l ] = g ( z [ l ] ) , a[0]=x a [ 0 ] = x and a[L]=y^ a [ L ] = y ^
W[l],b[l] W [ l ] , b [ l ] to denote the weights for computing the value z[l] z [ l ] in layer l l
4.2 - forward propagation in a deep network
Now we will discuss how we can perform forward propagation in a deep network.
let’s first go over what forward propagation will look like for a single trining example , and then later on we will talk about the vectorized version when we want to carry out forward propagation on the entire set at one same time.
so for one training example, the general rule for forward propagation equations is :
how about for doing this in a vectorized
way for the whole training set at the same time.
Now bear in mind that
X
X
is just equal to , just stack the training examples in different columns, similiary we just taking this vectors
z[l](i)
z
[
l
]
(
i
)
or
a[l](i)
a
[
l
]
(
i
)
and stack them up and calling this
Z[l]
Z
[
l
]
or
A[l]
A
[
l
]
.
4.3 - getting your matrix dimensions right
for the vectorized version, the dimension of W W and would be stays the same, but instead of be (n[l],1) ( n [ l ] , 1 ) , the dimension of Z[l] Z [ l ] will be (n[l],m) ( n [ l ] , m ) , where m m is the size of training examples.
4.4 - why deep representations
what is the deep network computing. If you are building a feac recognition or face detection system, here is the deep netowrk could be doing. Intuitively, you can think of the earlier layer of the neural network is detecting simpler functions like edges, and then composing them together in the later layer of a neural network, so that they can learn one more complex functions. So deep neural network with multiple hidden layers might be able to have the earlier layer learn these low levels simpler features and then have the deep layer to put together the simpler things that’s detected in order to detect more complex things.
4.5 - building block of deep neural network
We have seen the basic building blocks of forward propagation and backward propagation, the key compontent we need to implement a deep neural network. Now let’s see how we can put them together to build a deep net.
Let’s pick one layer and look at the computation focus on just that layer for now,
For
l
l
for forward
:
- parameters:
- input: a[l−1] a [ l − 1 ]
- output: a[l] a [ l ]
- z[l]=W[l]a[l−1]+b[l] z [ l ] = W [ l ] a [ l − 1 ] + b [ l ]
- a[l]=g[l](z[l]) a [ l ] = g [ l ] ( z [ l ] )
- cache z[l] z [ l ] will be useful for backward propagation step later
for
backward
- input: da[l],cache:z[l] d a [ l ] , c a c h e : z [ l ]
- output: da[l−1],dW[l],db[l] d a [ l − 1 ] , d W [ l ] , d b [ l ]
This is the basic structure of how we implement forward and backward propagations step.
Now we’ve seen one of the basic building blocks for implement a deep neural network, in each layer these is a forward propagation step and these is a corresponding backward propagation step, and a cache deposit informations from one to another.
4.6 - forward and backward propagation
forward propagation:
- input a[l−1] a [ l − 1 ]
output a[l] a [ l ] , cache w[l],b[l],z[l],a[l−1] w [ l ] , b [ l ] , z [ l ] , a [ l − 1 ]
for one single example:
z[l]=Wa[l−1]+b[l] z [ l ] = W a [ l − 1 ] + b [ l ]
a[l]=g[l](z[l]) a [ l ] = g [ l ] ( z [ l ] )
for entire training examples or vectorized version:
Z[l]=WA[l−1]+b[l] Z [ l ] = W A [ l − 1 ] + b [ l ]
A[l]=g[l](Z[l]) A [ l ] = g [ l ] ( Z [ l ] )
barkward propagation:
- input da[l] d a [ l ]
output da[l−1],dW[l],db[l] d a [ l − 1 ] , d W [ l ] , d b [ l ]
for one single example:
dz[l]=da[l]∗g[l]′(z[l])dw[l]=dz[l]⋅a[l−1]db[l]=dz[l]da[l−1]=w[l]T⋅dz[l] d z [ l ] = d a [ l ] ∗ g [ l ] ′ ( z [ l ] ) d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] d b [ l ] = d z [ l ] d a [ l − 1 ] = w [ l ] T ⋅ d z [ l ]reminder the equations we used in neural network with just one single hidden layer:
dz[l]=w[l]T⋅dz[l]∗g[l]′(z[l]) d z [ l ] = w [ l ] T ⋅ d z [ l ] ∗ g [ l ] ′ ( z [ l ] )for entire training examples or vectorized version:
dZ[l]=dA[l]∗g[l]′(Z[l])dW[l]=1mdZ[l]⋅A[l−1]Tdb[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)dA[l−1]=W[l]T⋅dZ[l] d Z [ l ] = d A [ l ] ∗ g [ l ] ′ ( Z [ l ] ) d W [ l ] = 1 m d Z [ l ] ⋅ A [ l − 1 ] T d b [ l ] = 1 m n p . s u m ( d Z [ l ] , a x i s = 1 , k e e p d i m s = T r u e ) d A [ l − 1 ] = W [ l ] T ⋅ d Z [ l ]according to the equations in barkward propagation for any layer l l , we know we need to cache the from the forward propagation.
4.7 - parameter vs hyperparameter
parameters: W[1],b[1],W[2],b[2],⋯ W [ 1 ] , b [ 1 ] , W [ 2 ] , b [ 2 ] , ⋯
hyperparamenters:
- learning rate
- #iteration
- #hidden layer
- #unit size
- choice of activation function
these are parameters that contral W W and , so we call these things hyperparameters.
4.8 - summary