CNN学习笔记一_d-th depth slice-CSDN博客

CNN学习（一）

Convolutional Neural Networks are very similar to ordinary Neural Networks : they are made up of neurons that have learnable weights and biases

（一）大概操作

Example Architecture: Overview. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:

INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.
RELU layer will apply an elementwise activation function, such as the max(0,x)max(0,x) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).
POOL layer will perform a downsampling【1】 operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

【1】downsampling:缩小图像（或称为下采样（subsampled）或降采样（downsampled））的主要目的有两个：1、使得图像符合显示区域的大小；2、生成对应图像的缩略图。放大图像（或称为上采样（upsampling）或图像插值（interpolating））的主要目的是放大原图像,从而可以显示在更高分辨率的显示设备上。对图像的缩放操作并不能带来更多关于该图像的信息，因此图像的质量将不可避免地受到影响。然而，确实有一些缩放方法能够增加图像的信息，从而使得缩放后的图像质量超过原图质量的。具体算法：详见：wiki/Decimation

总的来说：

卷积神经网络是由最简单的层列表构成的，它将图片转化成输出模式（例如：一个包含各种类型概率的向量）。
有一些特殊的层（例如：卷积层、全连接层、池化层，这些都是目前为止最特殊的）。
每一层都接收一个输入的3D“体积"，每一层用不同的函数将输入的3D体积转化成输出3D"体积"。
有些层会有参数¹(卷积层和全连接层有，而激活曾和池化层就没有）。

有些层会有超参数²（卷积层、全连接层、池化层有，而激活层没有）。

cnn模型图

（二）各层的算法以及理解

（一）卷积层

1.卷积核与输入的连接

卷积层的参数包含了许多可学习的萃取剂，在处理图片时，这萃取剂说白了就是一个二维矩阵加上深度3变成三维矩阵，卷积核的深度3表示图片有三个颜色通道。
我们分别将每一个萃取矩阵在输入矩阵上滑动并计算点积。当我们这样做时，最终就会生成一个二维的矩阵。不同的萃取剂得出不同的矩阵，最后输出的体积深度就是萃取剂的个数。
感受野中的深度一定要和输入矩阵的深度一致,而且每一个卷积核点乘的对象相同！！！

【对于每一个卷积核的操作和神经网络基本是一样的，就是卷积中的点乘在同一空间上】
In the computational model of a neuron, the signals that travel along the axons (e.g. x₀) interact multiplicatively (e.g. w₀x₀) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. w₀). The idea is that the synaptic strengths (the weights w) are learnable and control the strength of influence (and its direction: excitatory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed.
If the final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information.
Based on this rate code interpretation, we model the firing rate of the neuron with an activation function f, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the sigmoid function σ, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

神经元的连接

2.神经元的空间处理

（1）有三个超参数影响着输出数组的大小，分别是深度（depth）、幅度（stride）、补零（zero-padding）

depth是输出数组的超参数，反映的是我们所用的过滤器的个数，我们希望每一个过滤器能反映不同的东西，比如说各种各样的边缘或者是用二进制表示的颜色。
stride也是需要经验控制的。如果移动幅度是1，那么过滤器移动一个像素；幅度是2，那就移动两个像素，这样就会产生小一点的输出数组。
有时候用零矩阵（zero-padding）去覆盖输入数组是很方便的。零矩阵的大小是超参数。零矩阵这个参数可以让我们空着输入输出的大小，大多时候会让输入矩阵和输出矩阵保持一致。

输出卷积层数组的深度（depth）：输入矩阵的大小（W）、感受野的大小（F）、滑动幅度的大小（S）、补零的个数（P）
则，神经元的大小为[(W-F+2P)/S]+1

对于使用补零的原因：

In general, setting zero padding to be P=(F−1)/2when the stride is S=1 ensures that the input volume and output volume will have the same size spatially.（因为这样很合适。。。。虽然我也不知道这是什么解释）
如果不补零那么会限制滑动幅度：有时滑动幅度S=2时，根据公式计算神经元的个数不为整数。为了让能够调控卷积神经网络的大小，让神经元变得简洁与对称，用零来调控。

什么神经元、萃取数组、什么卷积层输出矩阵。。。。都糊涂了。那就再理一遍：
首先输入原始图像——>确定萃取剂(卷积核)的大小——>与原始一边滑动一边点乘并乘上权重——> 获得n*n个神经——>重复K次——>输出n x n x K

example

The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11F=11, stride S=4S=4 and no zero padding P=0P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96K=96, the Conv layer output volume had size [55x55x96]. Each of the 555596 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.

3.参数共享

参数共享策略是用来控制参数的个数的
每一个深度切片上(注:如果有一个矩阵是[aan],那么该矩阵有n个大小为[a*a]的深度切片deep slice,这里的深度是指过滤器的深度),都用相同的权重.这样在用梯度下降计算权重的时候,每一层上的神经元都要考虑进去。如果每一个神经元在某一层深度切片上用相同的权重,那么神经元的权重和输入矩阵的卷积在每一层中可以计算为卷积层的向前传递。(…不太明白这句话)
要注意的是 有些时候并不能得到参数共享策略好的预测结果。特别是往卷积神经网路中放入有中心化结构的图像。

4.Summary

Accepts a volume of size W₁×H₁×D₁
Requires four hyperparameters:
- Number of filters K,
- their spatial extent F,
- the stride S,
- the amount of zero padding P
Produces a volume of size W₂×H₂×D₂ where:
- W₂=(W₁−F+2P)/S+1
- H₂=(H₁−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)
- D₂=K
With parameter sharing, it introduces F⋅F⋅D₁ weights per filter, for a total of (F⋅F⋅D₁)⋅K weights and K biases.
In the output volume, the d-th depth slice (of size ***W₂×H₂***) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

来自于cs213n

5.作为矩阵乘法来改进

The local regions in the input image are stretched out into columns in an operation commonly called im2col. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11113 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix X_col of im2col of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.
The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix W_row of size [96 x 363].
The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location.
The result must finally be reshaped back to its proper output dimension [55x55x96].

这种方法有个缺点就是占用了大量的内存，因为有些数据重复了。但是现在我们可以用到许多和有效的矩阵乘法包（最常用的就是BLAS API）

6.之后的发展

卷积层神经网络在之后的发展中有衍生出了多种卷积网络，对于数据和权重都有卷积操作

1x1 convolution 论文：Network in Network
Dilated convolutions 论文：Multi-Scale Context Aggregation by Dilated Convolutions

（二）池化层

周期性的在连续的卷积层中插入池化层在卷积神经网络构架中非常的常见。它的作用就是很大程度上减少显现出来的空间大小来达到减少参数和计算的目的，因此，也能控制字过拟合。池化层独立的作用于输入数据的每一个深度切片，用最大化操作来重新整理输入数据的空间大小。

1.具体操作

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer:

Accepts a volume of size **W₁×H₁×D_1**
Requires two hyperparameters:
- their spatial extent F,
- the stride S，
Produces a volume of size W2×H2×D2W2×H2×D2 where:
- W₂=(W₁−F)/S+1
- H₂=(H₁−F)/S+1
- D₂=D₁
Introduces zero parameters since it computes a fixed function of the input
For Pooling layers, it is not common to pad the input using zero-padding.

值得注意的是，在实践中经常用到的 F = 3 ,S = 2 ，更常见的是***F = 2*** ,S = 2，在池化层中再大一点的感受野会损失图像。
max pooling

2.池化的过去与现在

原来常见的池化有平均池化和L2-范式池化。原来平均池化很常见，但是现在由于在实践中最大池化性能好一些，所以就没有再用了。

现在还有一种说法，就是再不用池化层的前提下可以训练出更好的模型，比如说 variational autoencoders（VANs）和 generative adversarial networks（GANs），再以后的构架中池化层会越来越少，逐渐消失。

（三）一般化层

这个一般化层在现实中对于模型的构建作用很小，因此现在使用的越来越少。更多的一般化方式看论文：cuda-convnet library API.

（四）全连接层（继续学习。。）

neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections
相当于将最后一次卷积出的输出或者池化后，计算各部分的权重，最后得出分类的结果，可能是随机的数字，可能是米一个分类下的概率。

（五）全连接层和卷积层的转换

（三）总结

the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3)
当n = 2，总体来说对建造一个更大更深的神经网络有好处，因为这样池化层就不会损失太多的输入数据。

参数：由数据自动学习出的变量（学习的权重或者偏差）。 ↩︎
超参数：根据经验确定的变量（学习速率、迭代次数、层数、神经元的个数等）。 ↩︎