Udacity DL CNN : Basic Idea

最新推荐文章于 2024-08-13 18:14:14 发布

Bourne_Boom

最新推荐文章于 2024-08-13 18:14:14 发布

阅读量383

点赞数 1

分类专栏：深度学习文章标签： CNN

本文链接：https://blog.csdn.net/linyijiong/article/details/86764987

版权

深度学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

The way of research:

Difference of MLP & CNN

Pooling

maxpooling is better at noticing the most important details about edges and other features in an image.
average pooling is used for smoothing an image.

Kernel / Filter / Patch 卷积核

Feature map / Convolution lay 卷积层(被多个卷积核处理后的输出)

Capsule Networks (not finished)

Hierarchical structure.

https://cezannec.github.io/Capsule_Networks/

https://github.com/cezannec/capsule_net_pytorch/blob/master/Capsule_Network.ipynb

CNN fully details

https://cs231n.github.io/convolutional-networks/#conv

Detail Summary

1.CNN的3个特性

fully connected识别出图像的全局pattern, 但实际上每个图片中的物体都是通过层级更小的pattern逐层建立起来的, 因此cnn可以通过local connected来识别出局部pattern, 这样convolution就具有Property1和2. local connected可以就能减少运算量. 并且由下面的第二点可知, 同一个filter下的每一个neuron都被迫共享着参数,因此就更少了. 当训练的参数(模型的大小)少的时候,意味着我们可以更有效地利用样本, 更快的训练速度.

2. Local connected

每个filter 等于fully connected中的一个neuron,这说法欠妥,应该是每个filter进行的每一次卷积操作都是一个neuron ,因此一些neuron 共享着同一个filter的weights. (而在fully connected中每个neuron都有自己独立的weight)

3. 局限性

当输入的image的某个pattern A的size与训练后CNN中负责识别pattern A 的filter size不一样时,就不能被识别出来. Deepmind 有一个算法就是可以针对image这些pattern来进行旋转缩放.(但一般不会存在这种情况)

4. Feature map (Convolution layer)

原图像被filter卷积后的输出. feature map的层数和filter个数相同,即每一层feature map对应着一个filter所卷积出来的结果.

5. 输入,输出与filter的深度（filter经过卷积后坍缩）

Filter的深度由输入到该卷积层的数据深度决定, 但输出深度由该层的filter数量决定.

因此,输入的深度为3,若用十个filter来处理的话,得到的是深度为10 的输出. 因为虽然每个filter的深度也是3,但是每个filter运算后得到的数值深度是1 (将三层的数值加起来求均值), 可以想象成“坍缩”。因此输出的深度只跟filter的数量有关,跟上一层(输入)的深度无关. ( 比如你n*n*3 被k个3*3*3的filter处理后,得到的feature map是k*m*m*1而不是k*m*m*3, 而下一个cnn卷积层的filter深度与输入有关)

再应用pytorch中的conv函数时, 第一个参数是输入到该卷积层的深度,第二个参数是输出的深度. 理解更深的话, 第一个深度指的是输入image的深度,第二个实际上是指用来卷积的filter个数, 而每个filter的深度是与输入的深度的,只不过不管一个filter有几层深度,最后输出还是只有一层数值.

计算某一层feature map 长和宽的方法

如下图所示，每卷积层的尺寸为feature map的size，即input经过filter之后的输出size，而filter的size则可以通过上式子计算，比如第二层的filter size为3*3*16，共有4个filters （因为输出深度为4）

6. 分析每一层所学习的规律

第一层卷积层

第一卷积层的filter很好理解,因为它的输入是图像,因此通过filter的weights数值就可以知道它在检测哪些特征.

第n层卷积层

第二及其后面的卷积层filter就不好理解,因为它们的输入不再是image,而是经过卷积层和池化层所得到的数据. 但是我们可以通过使得CNN输出最大的activated value (degree of the activation) 来反推出第二层及其后面的每一个filter所感兴趣的(检测的)图形pattern:

比如取出训练好的第二层的第k个filter, 其weight如图所示, 把它想象成一个neuron,那么该neuron的activation就是下面的sum(和fully connected相似), 称为 degree of the activation. 通过求梯度的方式可以得到使这一层最大化的input(image). 得到的image就是该filter所感兴趣的pattern.

(把输入x当作参数,对它进行gradient descent,而在原来训练CNN的时候, input是固定的, w是参数,对w进行gradient descent)

下图就是得到的每个filter最大的activated degree 的对应image. 可以看出是一些重复出现的pattern, 那么每个重复出现的pattern就是该filter感兴趣的pattern. (每个filter对应不同角度的线段)

Flatten层

同样应用求max activation的方法求出flatten层所学习得到的image规律. 因为全连接层的neuron不再是观察某一个pattern,而是把pattern都通过线性组合后的非线性变化得到, 因此这一层的neuron是观察整张图的规律. (filter考虑的只是输入中很小一部分的规律).

Output层

使得output层activation最大的image是完全和label不一样的东西, 因为通过上述方法得到的图片只是为了使得activation最大. 因此如果想要它输出符合的label, 需要对输入x(也就是训练的对象)做一些constraint. 类似regularization, 比如说下图白色的地方就是有墨水(涂白), 那么我们可以限制涂白的数量 (下面的+应该为-).

(Deep Neural Networks are Easily Fooled)

7. Deep Style

在Deep style中，我们用到的CNN是为了提取特征，训练的时候并不是训练CNN的参数，而是训练一张照片，这就与传统CNN用来分类的思路不太一样了。首先定义一个loss function，比较的是target image和content image的content部分的相似性。训练时我们通过调整target image的content部分来似loss减少，最后就能尽可能使target image的content部分像content image。（而传统意义的CNN 训练是调整weight来使得loss减少，而这个loss是分类的loss）简而言之一个是训练input x，另一个是训练weights。

To represent a style of an image, a feature space are designed to capture texture and color information as used. This space essentially looks at spatial correlation within layers of the network. A correlation is a measure of relationship between two or more variables.

For example, within a Convolution layer, for each feature maps, we can measure how strongly it is detected features related to other feature map in that layer: Is a certain color detected in one map similar to a color to the other map? What about the differences between detected corners and edges? etc. See which shapes and colors in a set of feature maps are related, and which are not?

Say that we detected that many feature maps in the first convolutional layer have the similar pink and features, if the common colors and shapes among the feature maps, then these can be thought as a part of that images style. So the similarity and differences between features maps in a layer should give us information about the texture and color information found in the image. But at the same time it should leave out information about the actual arrangement(placement) and identity of different objects in that image. Now we have seen the content and style can be separated from an image.

Using the trained CNN, style transfer find the style of one image and content of another, finally it tries to merge them to one new image. In the new image, the objects and shapes are from content image, and the color and textures are taken from style image.

Deep Dream

给一张输入的image加上CNN自己所”看”到的东西.

将训练好的CNN模型的某个hidden layer的weights参数”夸大” (fully 或者filter都可以), 然后将image放到这个”夸大”的网络中, 反向训练,该变image的值使得该网络的被”夸大”层的motivation最大,就得到夸张的照片.

Deep style

训练一个CNN网络作为绝对值输出,另一个CNN训练一张风格的图片,但不需要其output层,只需要其filters之间的correlation. 然后训练第三个CNN使得这两个网络兼容.

8. 运用到非image领域

运用到不同领域的时候, 要根据相同领域的特点改变CNN的形式.

Playing Go

因为CNN的前两个属性很适合用来检测局部特征规律, 而棋局的局部也具有特点. 但第三个属性(pooling) 会使得围棋的state信息缺失, 因此alpha go的算法中是没有pooling层,只有卷积层.

Speech

声音可以表示为frequency, 深色的代表相应频域的能力高. 而一段声音可以表示为spectrogram, 通过分析spectrogram, 来判断相应的一句话语信息是什么.

因此可以把spectrogram当作image,用CNN处理, 但是因为在横轴方向,每个字的声音是独立的, 所以在语音上通常只考虑在frequency方向上的filter. 男声和女声同一个字的frequency的spectrogram 形状是一样的,只是shift一下而已.

Text Sentence

文字处理: 将每一个句子都按照words bag的形式排成一个向量,因此每一列是有关系的,但每一行是独立的,因此filter是横向移动.

9. 在AutoEncoder上的应用（压缩再解压）

将经过encoder处理的输出作为transpose的输入，进行卷积的逆操作。（依次为input feature map, filter 叉叉表示weights, output feature map)

其中，重叠的部分会进行数值叠加，但是这样会造成一些问题，因此可以进行stride的补偿可以设为2，这样经过处理之后就不会出现重叠现象。

比如要定义一个以下结构的Autoencoder，代码如下所示。

注意下面定义transpose conv时候，由于filter边长为2，因此stride为2时，输出才不会有重叠部分。

class ConvAutoencoder(nn.Module):
    def __init__(self):
        super(ConvAutoencoder, self).__init__()
        ## encoder layers ##
        # conv layer (depth from 1 --> 16), 3x3 kernels
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)  
        # conv layer (depth from 16 --> 4), 3x3 kernels
        self.conv2 = nn.Conv2d(16, 4, 3, padding=1)
        # pooling layer to reduce x-y dims by two; kernel and stride of 2
        self.pool = nn.MaxPool2d(2, 2)
        
        ## decoder layers ##
        ## a kernel of 2 and a stride of 2 will increase the spatial dims by 2
        self.t_conv1 = nn.ConvTranspose2d(4, 16, 2, stride=2)
        self.t_conv2 = nn.ConvTranspose2d(16, 1, 2, stride=2)


    def forward(self, x):
        ## encode ##
        # add hidden layers with relu activation function
        # and maxpooling after
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        # add second hidden layer
        x = F.relu(self.conv2(x))
        x = self.pool(x)  # compressed representation
        
        ## decode ##
        # add transpose conv layers, with relu activation function
        x = F.relu(self.t_conv1(x))
        # output layer (with sigmoid for scaling from 0 to 1)
        x = F.sigmoid(self.t_conv2(x))
                
        return x

另一种方法： Up-Sampling