文章目录
Classic Networks
LeNet-5
Le-Cun et al, 1998. Gradient-based learning applied to document recognition
AlexNet
Krizhevsky et al, 2012. ImageNet classificaiton with deep convolutional neural networks
VGG-16
Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition.
Residual Networks
The Problem of Deep Neural Networks
The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).
During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:
Architecture
It turns out that if we use your standard optimization algorithm such as a gradient descent or one of the fancier optimization algorithms to train a plain network. Without all the extra residual, without all the extra short cuts or skip connections, empirically, we will find that as increase the number of layers, the training error will tend to decrease after a while but then tend to go back up.
He et al, 2015. Deep residual networks for image recogntion
Why ResNets Works?
In theory as we make a neural network deeper, it should only do better and better on the training set. So, in theory, having a deeper network should only help. But in practice or in reality, having a plain network that is very deep means that all the optimization algorithm just has a much harder time training. And so, in reality, the training error gets worse if you pick a network that’s too deep. But what happens with ResNet is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Even if we train a network with over a hundred layers.
Just like the iamge upon and that output is
a
[
l
]
a^{[l]}
a[l], and let’s add two layers at the end of the pipline. And just for output
a
[
l
+
2
]
a^{[l+2]}
a[l+2]. And let’s make this a ResNet block, a residual block with that extra short cut. And for the sake our argument, let’s say throughout this network we’re using the relu activation functions. So, all the activations are going to be greater than or equal to zero, with the possible exception of the input X. Then we get that:
a
[
l
+
2
]
=
g
(
W
[
l
+
2
]
a
[
l
+
1
]
+
b
[
l
+
2
]
+
a
[
l
]
)
a^{[l+2]} = g(W^{[l+2]}a^{[l+1]}+b^{[l+2]} + a^{[l]})
a[l+2]=g(W[l+2]a[l+1]+b[l+2]+a[l])
Now notice something, if using L2 regularisation or weights decay, that will tend to shrink the value of
W
[
l
+
2
]
W^{[l+2]}
W[l+2]. And if
W
[
l
+
2
]
=
0
W^{[l+2]}=0
W[l+2]=0. And let’s say for the sake of argument that
b
[
l
+
2
]
=
0
b^{[l+2]}=0
b[l+2]=0, then
a
[
l
+
2
]
=
g
(
a
[
l
]
)
a^{[l+2]}=g(a^{[l]})
a[l+2]=g(a[l]), because we assumed we’re using the relu activation function. And so
a
[
l
+
2
]
=
g
(
a
[
l
]
)
=
a
[
l
]
a^{[l+2]}=g(a^{[l]})=a^{[l]}
a[l+2]=g(a[l])=a[l].
So, what this shows is that the identity function is easy for residual block to learn. And it’s easy to get
a
[
l
+
2
]
a^{[l+2]}
a[l+2] equals to
a
[
l
]
a^{[l]}
a[l] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn’t really hurt your neural network’s ability to do as well as this simpler network without these two extra layers, because it’s quite easy for it to learn the identity function using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn’t hurt performance.
But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these heading units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better. And I think the main reason the residual network works is that it’s so easy for these extra layers to learn the identity function that you’re kind of guaranteed that it doesn’t hurt performance and then a lot of time you maybe get lucky and then even helps performance. At least is easier to go from a decent baseline of not hurting performance and then great in decent can only improve the solution from there.
One more detail in the residual network that’s worth discussing which is through this edition here, we’re assuming that z [ l + 2 ] z^{[l+2]} z[l+2] and a [ l ] a^{[l]} a[l] have the same dimensions. And so what you see in ResNet is a lot of use of same convolutions.In case the input and output have different dimensions, what you would do is add an extra matrix and then call that W s W_s Ws over here, and W s W_s Ws could be a matrix of parameters by learning. Or it could be a fixed matrix that just implements zero paddings that takes a [ l ] a^{[l]} a[l] and then zero pads it be the same dimensions as z [ l + 2 ] z^{[l+2]} z[l+2].
Networks in network and 1x1 convolutions
One way to think about the one by one convolution is that, it is basically having a fully connected neuron network, that applies to each of the different positions. And by doing this at each of the 36 positions, each of the six by six positions, you end up with an output that is six by six by the number of filters. This can carry out a pretty non-trivial computation on your input volume. And this idea is often called a one by one convolution but it’s sometimes also called Network in Network.
By one by one convolutions we can increase or decrease the channel number without changing number of width and height of the volume. And it is very useful in the Inception networks for reducing computation and changing channel number.
Inception Network
Architecture
When designing a layer for a ConvNet, you might have to pick, do you want a 1 by 3 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer? What the inception network does is it says, why not you do them all?
Szegedy et al, 2014. Going deeper with convolutions
The basic idea is that instead of needing to pick one of these filter sizes or pooling you want and committing to that, we can do them all and just concatenate all the outputs, and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants.
Now it turns out that there is a problem with the inception layer as we’ve described it here, which is computational cost.Just look at the two images below, we can see the difference, and how to use one by one convolution to reduce computation.
It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn’t seem to hurt the performance, but saves you a lot of computation.
Inception Module
Then put them all, we can get the inception module:
Transfer Learning
If you’re building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download weights that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in.
One pattern is if you have more data, the number of layers you’ve freeze could be smaller and then the number of layers you train on top could be greater. And the idea is that if you pick a data set and maybe have enough data not just to train a single softmax unit but to train some other size neural network that comprises the last few layers of this final network that you end up using. Finally, if you have a lot of data, one thing you might do is take this open source network and ways and use the whole thing just as initialization and train the whole network.
Data Argumentation
Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems. There are some ways to increase your data set.
- Mirroring
- Random Cropping
- Rotation
- SHearing
- Local warping
- Color shifting
- …