Class4-Week2 Case study

Classic Networks

LeNet-5

在这里插入图片描述

Le-Cun et al, 1998. Gradient-based learning applied to document recognition

AlexNet

在这里插入图片描述

Krizhevsky et al, 2012. ImageNet classificaiton with deep convolutional neural networks

VGG-16

在这里插入图片描述

Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition.


Residual Networks

The Problem of Deep Neural Networks

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:
在这里插入图片描述

Architecture

在这里插入图片描述

It turns out that if we use your standard optimization algorithm such as a gradient descent or one of the fancier optimization algorithms to train a plain network. Without all the extra residual, without all the extra short cuts or skip connections, empirically, we will find that as increase the number of layers, the training error will tend to decrease after a while but then tend to go back up.

在这里插入图片描述

He et al, 2015. Deep residual networks for image recogntion

Why ResNets Works?

In theory as we make a neural network deeper, it should only do better and better on the training set. So, in theory, having a deeper network should only help. But in practice or in reality, having a plain network that is very deep means that all the optimization algorithm just has a much harder time training. And so, in reality, the training error gets worse if you pick a network that’s too deep. But what happens with ResNet is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Even if we train a network with over a hundred layers.

在这里插入图片描述

Just like the iamge upon and that output is a [ l ] a^{[l]} a[l], and let’s add two layers at the end of the pipline. And just for output a [ l + 2 ] a^{[l+2]} a[l+2]. And let’s make this a ResNet block, a residual block with that extra short cut. And for the sake our argument, let’s say throughout this network we’re using the relu activation functions. So, all the activations are going to be greater than or equal to zero, with the possible exception of the input X. Then we get that:
a [ l + 2 ] = g ( W [ l + 2 ] a [ l + 1 ] + b [ l + 2 ] + a [ l ] ) a^{[l+2]} = g(W^{[l+2]}a^{[l+1]}+b^{[l+2]} + a^{[l]}) a[l+2]=g(W[l+2]a[l+1]+b[l+2]+a[l])
Now notice something, if using L2 regularisation or weights decay, that will tend to shrink the value of W [ l + 2 ] W^{[l+2]} W[l+2]. And if W [ l + 2 ] = 0 W^{[l+2]}=0 W[l+2]=0. And let’s say for the sake of argument that b [ l + 2 ] = 0 b^{[l+2]}=0 b[l+2]=0, then a [ l + 2 ] = g ( a [ l ] ) a^{[l+2]}=g(a^{[l]}) a[l+2]=g(a[l]), because we assumed we’re using the relu activation function. And so a [ l + 2 ] = g ( a [ l ] ) = a [ l ] a^{[l+2]}=g(a^{[l]})=a^{[l]} a[l+2]=g(a[l])=a[l].
So, what this shows is that the identity function is easy for residual block to learn. And it’s easy to get a [ l + 2 ] a^{[l+2]} a[l+2] equals to a [ l ] a^{[l]} a[l] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn’t really hurt your neural network’s ability to do as well as this simpler network without these two extra layers, because it’s quite easy for it to learn the identity function using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn’t hurt performance.

But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these heading units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better. And I think the main reason the residual network works is that it’s so easy for these extra layers to learn the identity function that you’re kind of guaranteed that it doesn’t hurt performance and then a lot of time you maybe get lucky and then even helps performance. At least is easier to go from a decent baseline of not hurting performance and then great in decent can only improve the solution from there.

One more detail in the residual network that’s worth discussing which is through this edition here, we’re assuming that z [ l + 2 ] z^{[l+2]} z[l+2] and a [ l ] a^{[l]} a[l] have the same dimensions. And so what you see in ResNet is a lot of use of same convolutions.In case the input and output have different dimensions, what you would do is add an extra matrix and then call that W s W_s Ws over here, and W s W_s Ws could be a matrix of parameters by learning. Or it could be a fixed matrix that just implements zero paddings that takes a [ l ] a^{[l]} a[l] and then zero pads it be the same dimensions as z [ l + 2 ] z^{[l+2]} z[l+2].


Networks in network and 1x1 convolutions

在这里插入图片描述

One way to think about the one by one convolution is that, it is basically having a fully connected neuron network, that applies to each of the different positions. And by doing this at each of the 36 positions, each of the six by six positions, you end up with an output that is six by six by the number of filters. This can carry out a pretty non-trivial computation on your input volume. And this idea is often called a one by one convolution but it’s sometimes also called Network in Network.

By one by one convolutions we can increase or decrease the channel number without changing number of width and height of the volume. And it is very useful in the Inception networks for reducing computation and changing channel number.


Inception Network

Architecture

When designing a layer for a ConvNet, you might have to pick, do you want a 1 by 3 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer? What the inception network does is it says, why not you do them all?
在这里插入图片描述

Szegedy et al, 2014. Going deeper with convolutions

The basic idea is that instead of needing to pick one of these filter sizes or pooling you want and committing to that, we can do them all and just concatenate all the outputs, and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants.

Now it turns out that there is a problem with the inception layer as we’ve described it here, which is computational cost.Just look at the two images below, we can see the difference, and how to use one by one convolution to reduce computation.
在这里插入图片描述
It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn’t seem to hurt the performance, but saves you a lot of computation.
在这里插入图片描述

Inception Module

Then put them all, we can get the inception module:
在这里插入图片描述
在这里插入图片描述


Transfer Learning

If you’re building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download weights that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in.

在这里插入图片描述
One pattern is if you have more data, the number of layers you’ve freeze could be smaller and then the number of layers you train on top could be greater. And the idea is that if you pick a data set and maybe have enough data not just to train a single softmax unit but to train some other size neural network that comprises the last few layers of this final network that you end up using. Finally, if you have a lot of data, one thing you might do is take this open source network and ways and use the whole thing just as initialization and train the whole network.


Data Argumentation

Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems. There are some ways to increase your data set.

  • Mirroring
  • Random Cropping
  • Rotation
  • SHearing
  • Local warping
  • Color shifting

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值