Course4-week2-case studies

最新推荐文章于 2018-06-15 22:49:04 发布

土肥宅娘口三三

最新推荐文章于 2018-06-15 22:49:04 发布

阅读量414

点赞数

分类专栏： deep learning 文章标签： Andrew Ng deep learning deeplearning.ai

本文链接：https://blog.csdn.net/robin_Xu_shuai/article/details/80631179

版权

deep learning 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

case studies

1 - why look at cases studies?

how the together the basic building block, such as CONV layer, POOL layer, FC layer, to form effective convolutinal neural network?

outline:

classic networks:
- LetNet-5
- AlexNet
- VGG
ResNet(Residual Network:152)
Inception

2 - classic networks

LeNet5(1998)
the goal of LetNet-5 was to recognize handwritten digits.

no padding, always using valid convolution
average pooling
$\hat{y}$ take on 10 possible values, not a softmax layer.
60,000 parameters
usd sigmoid and tanh, not relu
from left to right, the width/height go down, whereas the channels tend to increase
one or more CONV layers, followed by POOL layer, and one or more than one CONV layers, followed by POOL layer, and some FC layers, and then the output.
different filters look at different channels of the input volume
non-linearity after pooling

AlexNet(Alex Krizhevsky)

goal: which one of the 1000 classes the object would be.

much bigger than LeNet-5
60 million parameters
Relu activation function, make it much better than LeNet-5

After the AlexNet, a lot of computer vision community take a serious look at deep learning and convince them deep learning really works in computer vision. Whereas AlexNet has a relatively complicated architecture, there are a lot of hyperparameters.

VGG16
CONV = 3

× × $\times$ 3, s = 1, padding = “SAME”; MAX POOL = 2

× × $\times$ 2, s = 2

instead of having so many hyperparameters, VGG is really simplified the neural network architecture.

16 refer to the fact that this has 16 layers have weights
138 million parameters
the architecture is really quiet unifrom
the number of filters doubling through every stack of Conv layers is another simple principle used to design the architecture

3 - residual networks

Very, very deep neural network are diffcult to train, because of the vanishing and exploding gradient type of problem.

Residual network are built out of something called a residual blocks.

residual block

the information from $a^{[l]}$ flow to $a^{[l+2]}$ needs to go through all the steps as following:

z [l + 1] = W [l + 1] a [l] + b [l + 1] a [l + 1] = g (z [l + 1]) z [l + 2] = W [l + 2] a [l + 1] + b [l + 2] a [l + 2] = g (z [l + 2])

$z^{[l+1]} = W^{[l+1]}a^{[l]} + b^{[l+1]}\\ a^{[l+1]} = g(z^{[l+1]}) \\ z^{[l+2]} = W^{[l+2]}a^{[l+1]} + b^{[l+2]}\\ a^{[l+2]} = g(z^{[l+2]}) \\$

we can called this computation process governed by above 4 formular as main path. In the residual network, this make a change, rather than needing to follow the main path, the information from $a^{[l]}$ can follow a shortcut to skip over one layer and go much deeper into the network.

What this mean is that the last equation goes away, and instead with:

a [l + 2] = g (z [l + 2] + a [l])

$a^{[l + 2]} = g(z^{[l+2]} + a^{[l]})$

And this make a residual block. And note that the $a^{[l]}$ be injected before the activation part but after the linear part. This can also be call skip connection which refer to the $a^{[l]}$ skipping over one layer in order to pass information deeper into the network.

using the residual block allows you to train much deeper neural network, and the way you build a ResNet is by taking many of these residual blocks, and stack them together to form a deep network.

Before add the extra shortcut, we call the network as plain network. To turn it into ResNet, what we do is add all those skip connection. So this picture shows 5 residual blocks stacked together.

It turns out, using any optimization algorithm to train a plain network, we will find that as increase the number of layers, the training error will tend to decrease after a while, and then it will tend to go back up. In theory, having a deeper network should only help do better and better on the training set. But in reality, have a deeper network means the optimization algorithm will much harder to training.
but things change with the residual network, even as the number of layers gets deeper, we can have the performance of the training error keep on going down. The skip connection can really help with the vanishing and exploding gradient and allow us to train much more deeper neural network.

4 - why ResNet work?

Let’s go through one example and illustrates why it work well, and how we can make neural network deeper and deeper without hurting the ability at least on the training set. Doing well on the training set is usually a prerequisite to doing well on the dev set, or test set,

If you make a neural network deeper, it can hurt the ability to do well on the training set, that’s why sometimes we don’t want the neural network too deep. But this is much less true when we are training a ResNet.

Base on a big NN on the top shown in the picture, we add another two extra layers in the end, and make it become a residual block with the extra shortcut. Let’s say throughout the network we are using the Relu activation function, so all the activations are going to be more than or equal to 0.

a [l + 2] = g (z [l + 2] + a [l]) = g (w [l + 2] a [l + 1] + b [l + 2] + a [l]) (1) (1)

$\begin{align} a^{[l+2]} &= g(z^{[l+2]} + a^{[l]}) \\ & = g(w^{[l + 2]}a^{[l + 1]} + b^{[l + 2]} + a^{[l]}) \tag1 \end{align}$

if we are using L2 regularization, that will shrink the value of $w^{[l+2]}$ as well as maybe the $b^{[l+2]}$ . So if the $w^{[l+2]} = b^{[l+2]} = 0$ , and the equation $(1)$ become:

a [l + 2] = g (w [l + 2] a [l + 1] + b [l + 2] + a [l]) = g (a [l]) = a [l]

$a^{[l+2]} = g(w^{[l + 2]}a^{[l + 1]} + b^{[l + 2]} + a^{[l]}) = g(a^{[l]}) = a^{[l]}$

What this show is that the identity function is easy for residual block to learn. It’s easy to get $a^{[l+2]} = a^{[l]}$ , because of the skip connection. And what that means is adding two layers in the network it doesn’t hurt network’s ability to do as well as the network without those two extra layers. Because it’s quiet easy for it to learn the identity function. This is why adding the two extra layer into the network it doesn’t hurt performance. And if all the neurons in the extra two layers can actually learn somethings useful then maybe the network can do even better than learning just the identity function. But for the plain network when the network deeper and deeper, it’s actually very difficult for it to choose the parameters that learn even the identity function, which is why a lot of layers end up making your results worse rather than making your result better.

We assume the $z^{[l+2]}, a^{[l]}$ have the same dimension. So same convolution a lot of used in the ResNet. In case the $z^{[l+2]}, a^{[l]}$ have different dimensions, what we would do is add a matrix $W_s$ ,

a [l + 2] = g (z [l + 2] + W s a [l])

$a^{[l+2]} = g(z^{[l+2]} + W_s a^{[l]})$

where the dimension of $W_s$ is $(n_{a^{[l+2]}}, n_{a^{[l]}})$ , and $W_s$ could be parameters we learned or a fixed matrix just implement zero padding.

There are a lot of 3 by 3 same convolutions, and so the dimension is preserved, and so the $z^{[l+2]} + a^{[l]}$ make sense.

5 - 1 by 1 convolutions

What the one by one convolution will do is it will look at each of the 36 different positions, and take elementwise product between 32 numbers on the left and the 32 numbers in the filter, and apply a non-linearity to it after that.

If we want to shrink the height/width, we can use a pool layer, but if what we want to shrink is the number of channels, how do we shrink it to a 28 by 28 by 32 dimensional volume, what we can do is use 32 filters that are 1 by 1 by 192. We will see later how this idea of one by one convolutions allow you to shrink the number of channels and save on computation. For this example, if you just want to keep the number of channels at 192, and the effect of 1 by 1 convolution is just adds nonlinearity, this allows to learn more complex function of your network.

1 by 1 convolution is actually doing something pertty non-trivial

adds nonlinearity to neural network
decrease, keep the same or increase the number of channels

6 - inception network motivation

When designing a layer for convolution network, you might have to pick do you want 1 by 1 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer? What the inception network does is it does all of this automatically, this make the network architecture more complicated, but it also works remarkably well.

What a inception layer does is instead of choosing what filters size you want in CONV layer, or do you want a CONV layer or POOL layer, let’s do them all.

using 1 by 1 filters and output the first volume of shape 28 by 28 by 64; then using 3 by 3 filters and output the second volume of shape 28 by 28 by 128; using the 5 by 5 filters and output the third volumes of shape 28 by 28 by 32; use max-pool to output fourth volume of shape 28 by 28 by 32, and stack all the volumes together to form a big output volume of shape 28 by 28 by 256, note that in order to keep the same dimension, we always use the same convolution and use padding in the max-pooling.

Now we have one inception module input 28 by 28 by 192 and output 28 by 28 by 256. This is the heart of the inception network. The basic idea is that instead of you needing to pick one of the filter size or pooling you want, you can just do them all, and concatenate all the outputs, and let the algorithms learn whatever parameter it wants to use. And the main problem of the inception layer is computational cost.

computational cost:

the total number of multiplication:

28 * 28 * 32 * 5 * 5 * 192 = 120 million

$28 * 28 * 32 * 5 * 5 * 192 = 120\text{million}$

Next we will see how using the idea of 1 by 1 convolution can reduce the computational costs by about a factor of ten.

Notice the input and output dimension are still the same, and what we have done is taken this huge volume on the left, and shrunk it to smaller volume. Sometimes, this is called the bottleneck layer.

the total number of multiplication:

1 * 1 * 192 * 28 * 28 * 16 + 28 * 28 * 32 * 5 * 5 * 16 = 2.4 million + 10.0 million = 12.4 million

$1 * 1 * 192 * 28 * 28 * 16 + 28 * 28 * 32 * 5 * 5 * 16 = 2.4\text{million} + 10.0\text{million} = 12.4\text{million}$

So by using a 1 by 1 convolution you can create a bottleneck layer, thereby reducing the computational cost significantly.

7 - inception network

Put the inception module together to build the inception network.

the output of the max-pooling with padding was 28 by 28 by 192, this seems that it has a lot of channels, so what we are going to do is actually add one 1 by 1 convolution layer, to shrink the number of channels so get down to 28 by 28 by 32. So that we don’t end up with the pooling layer taking up all the channels in the final output. Finally, we take all the output volumes and do channel concatenation and get a 28 by 28 by 256 dimension output.

inception network does is put a lot of inception module together.

there are some extra max pooling layers to change the width and height
there are some additional side branches, what these side branches do is take some hidden layer, and try to use that to make a prediction, it helps ensure that the features computed even in the intermediate layers are not too bad for predicting the output class of a image. And this appears to have a regularizing effect on inception network and help prevent from overfitting.

GooLeNet

8 - using open-source implementations

If you are developing a computer vision app, a very common workflow would be to pick an architecture that you would like and look for an open-source implement and dowmload it from the github and start building from here. One of the advantages of doing so is that sometimes these network take a long time to train and someone else might have used multiple GPUs and a vary largely dataset to pre-train the network, and it’s allow you to do transfer learning using these networks.

9 - transfer learning

We can download the open source weights that took someone else many weeks, or months to figure out, and use that as a very good initializtion for you own neural network. And use transfer learning to transfer knowledge from the very large putblic data set to your own problem.

Let’s say you are building a cat detector to recognize your own pet cat, Tigger and Misty are their name, respectively. So you have a classification problem with three classes, Tigger or Misty or it’s neither. And you don’t have a lot of pictures of Tigger and Misty, so the training set will be small, so what can we do?

a pretty small training dataset:

What we can do is go online and download some open source implementation of a neural network. There are a lot of network can download that have trained on the dataset, such as ImageNet data set, which has 1000 classes, so the network might have a softmax unit that output 1 of 1000 possible classes. What we can do is get rid of the softmax layer and create your own softmax unit that output Tigger and Misty or neither. and think of all the eariler layer as frozen, so we freeze all the parameters in eariler layers, and then just train the parameters associated with the softmax layer. By using someone else’s pre-trained weights, we’d like to get pretty good performance even with a small dataset.
```
trainableparameter = 0
freeze = 1
```
one trick could speed up the training is we can pre-compute the activations from the last second layer, and just save them to disk, and then just training a shallow softmax model.
larger training set:

In this case, one thing we could do is freeze fewer eariler layers, and just training on the rest last few layers, and using the these layer’s weight as initialization and run optimization algorithm from there. Or we could also blow away the last few layers, and create own hidden layers and own final softmax output.

one pattern is if you have more data, the number of layers you freeze could be smaller, and the number of layers you train could be greater.

enough dataset

Take the open source network architecture and use the whole weights as initialization replace random initialization, and train the whole network on your own dataset.

For a lot of computer vision application, you just do much better if you download someone else’s open source weights and use that as the initialization for your problem. Computer vision is one where transfer learning is something that you should almost always do unless you have a very large dataset and very large computational budget.

10 - data augmentation

data augmentation is one of the technique that is often used to imporve the performance of the computer vision system.

mirroring on the vertical axis
random cropping

the second type of data augmentation that is commonly used is color shifting. What you do is take different value of R, G and B, and use them to distort the color channels. The motivation for this is that maybe the sunlight was a bit yollower, that could make the learning algorithm more robust to the changes in the colors of the images.

PCA color augmentation

implement distortions during training

The pertty way of implementing data augmentation is to have one thread that is responsible for loading the data and implementing the distortions, and then pass the result to some other thread does the training.

11 - the state of computer vision

When you have a lot of data, you tend to find people will using simpler algorithm as well as less hand engineering. Instead have a gaint neural network, even a simpler architecture and just can learn whatever it want to learn when you have a lot of data. Whereas when you don’t have much data, people engaging in more hand engineering and this is actually the best way to get good performance.

The learning algorithm has two source of knowledge:

labeled data (x, y)
hand engineering feature, network architecture, other components

So when you don’t have much more labeled data, you just have to count more on hand engineering. This is why the field of computer vision has developed so much complex network architecture, is because in the absence of more data, the way to get good performance is to spend more time on architecting. And in fact, because usually have smaller object detection datasets than image recognition dataset, when we talk about object detection, you will see the algorithm become even more complex and have more specialized components.

One thing could help a lot when you have little data is transfer learning. And another technique that’s used a lot for when you have relatively little data, here are a few tips at doing well on the benchmarks or winning competitions.

Ensembling:
training several neural network independently and average their output
Multi-crop at test time:
multi-crop is a form of applying data augmentation to your test image, run these ten images through your classifier and average the result.

10-crop

use architecture of networks publiced in the literature
use open source implementation
use pre-trained models and fine-tune on your dataset.

土肥宅娘口三三

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Course4-week2-case studies

case studies1 - why look at cases studies?how the together the basic building block, such as CONV layer, POOL layer, FC layer, to form effective convolutinal neural network?outline:classic ...
复制链接

扫一扫