卷积神经网络

最新推荐文章于 2024-05-13 08:06:19 发布

守望者tt

最新推荐文章于 2024-05-13 08:06:19 发布

阅读量4.4k

点赞数 17

分类专栏：深度学习

本文链接：https://blog.csdn.net/u012756814/article/details/79995032

版权

深度学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文尝试对卷积神经网络（convolutional neural network ,CNN）做一个总结，主要包括以下几部分内容

1、卷积神经网络的动机和基本精神

2、卷积神经网络的原理

3、卷积神经网络的参数和数据的维度

4、使用python实现卷积神经网络

5、经典卷积神经网络

6、卷积神经网络在NLP中的应用

7、卷积神经网络实例

卷积神经网络的动机和基本精神

卷积神经网络（CNN）最初是为解决图像识别问题设计的，图像数据如果展开成一维，采用全连接神经网络，参数量将会非常大，根据VC理论，很难保证训练误差与泛化误差足够接近，因此非常容易产生过拟合。

因此，需要根据具体的问题，设定算法偏好来降低模型复杂度（设定先验），提升泛化能力。

图像数据与其他数据相比，具有以下特点：

1、图像数据可能又成千上万个像素点，但如果我们希望检查某些小的有意义的特征，比如，如果我们希望检测图像中的鸟嘴，我们并不需要了解整个图像数据，事实上只需要图中红框内很小的一块区域的信息。—-局部连接
这里写图片描述

2、相同的特征可能出现在图像的不同区域，如下图所示，鸟嘴特征可能出现在图像的不同区域，因此，可以采用相同的参数取侦查。—-参数共享

这里写图片描述

3、将图像数据降采样，大部分时候不会影响我们对图像的理解。
这里写图片描述

基于图像数据的前两个特性，提出卷积层，基于图像数据的第三个特点，为了提升效率和减少参数的考虑，提出池化层。为了实现分类或者其他任务，一般还需要全连接层连接到输出，因此一般卷积神经网络主要一下包括三个组件：

卷积层：实现局部连接和参数共享机制

池化层 ：实现降采样

全连接层 ：用于最终进行分类或其他任务

卷积层主要执行特征提取工作，后面的全连接层对特征进行组合匹配，并进行分类或其他任务，卷积神经网络将训练与特征提取两个任务同时执行，使其提取的特征更加有效，避免复杂的特征工程工作。接下来尝试说明卷积操作与特征提取的关系，然后再来介绍卷积神经网络。

卷积与特征提取 ：

为了说明卷积与特征提取的关系，首先引入卷积积分和傅立叶变换

傅立叶变换 ： $F(\omega) = \Gamma[f(t)] = \int_{t \in all} f(t)e^{-j\omega t}d_t$

卷积积分 ： $f_1(t)*f_2(t) = \int_{\tau \in all}f_1(\tau)f_2(t-\tau)d_{\tau}$

那么则有 $F(f_1(t)*f_2(t)) = \int_{t \in all}\int_{\tau \in all}[f_1(\tau)f_2(t-\tau)d_{\tau} ]e^{-j\omega t}d_t=\int_{\tau \in all}f_1(\tau)[\int_{t \in all}f_2(t-\tau)e^{-j\omega t}d_t]d\tau=F_1(\omega)F_2(\omega)$

上述公式即为时域卷积定理 ：（对于二维信号也是一样的），两个信号时间域的卷积为频率域的乘积。

因此我们可以将卷积与特征提取的工作转换到频率域理解，在频域中，频率越大说明原始信号变化速度越快；频率越小说明原始信号越平缓。因此，频率的大小反应了信号的变化快慢。因此高频分量解释信号的突变部分，而低频分量决定信号的整体形象。对于图像而言，高频分量在某些情况下指图像边缘信息，低频分量代表图像的轮廓信息。因此如果我们要做边缘检测，可以设计一个高通滤波器，将输入信号与该滤波器进行卷积操作，根据上面推导的卷积定理，卷积操作后的结果都为高频分量，即实现边缘检测，当然，我们也可以设计更复杂的滤波器，进行其他特征提取，比如我们可以通过傅立叶算子，提取形状特征，直接通过傅立叶系数获取纹理特征等。

下面是一些图像的纹理特征，在时间域（左）频率域（右）的不同表现，使用这些算子与输入信号进行卷积操作，即可实现特征提取。

同一个网络中，卷积层和池化层可以连续重复多次，每一层的卷积层可以有多个不同的卷积核，用于侦查图像中不同的特征，图像由一些基本特征（点，边）构成，因此只要提供足够数量的卷积核，就可以提取各种方向的边和各种形态的点，可以让卷积层抽取丰富而有效的高级特征，每一个卷积核得到的图像就是一类特征的映射，不管图片尺寸如何，由于局部连接和参数共享机制，需要训练的权重只跟卷积核的大小和卷积个数相关，我们可以只使用非常少的参数量处理任意大小的图片，每一个卷积核都很小，只提取简单的特征，但可以通过更深的网络组合成更高阶的特征，多层抽象的卷积神经网络表达能力更强，效率更高，相比只使用很少的隐藏层提取高阶特征，可以节省参数，减弱过拟合风险；池化操作的降采样，大大降低存储和计算压力，同时，由于池化操作保留最显著的特征，因此，可以有效提升模型容忍畸变的能力，提高泛化能力。

Lecun认为：可训练参数的卷积层是一种用少量参数在图像多个位置提取相似特征的有效方式，可以充分利用图像的空间相关性，直接使用像素点则利用不到这些信息。

卷积网络从输入到输出，应该让图片尺寸越来越小，输出同道数逐渐增加，让空间结构简化，将空间信息转化为高阶抽象的特征信息。

卷积神经网络的成功可以理解为其先验假设与图像处理任务的高度吻合，也要意识到池化操作虽然有许多好处，但其只保留最显著特征，很可能丢失其他次重要的特征。可能这些信息也是非常重要的。

卷积神经网络（CNN）最初是为解决图像识别问题设计的，现在也可以应用时间序列信号，比如音频信号或文本数据等。

卷积神经网络原理

二维卷积运算： $S(i,j) =(I*K)(i,j)=\sum_{m}\sum_{n}I(i-m)(j-n)K(m,n)$ ，卷积运算中将核进行了翻转（输入的索引在增大，但是核的索引在减小），将核翻转的唯一目的是实现可交换性，但可交换性在神经网络中不是一个非常中要的性质，因此为了简单，卷积神经网络实际上实现了

互相关函数： $S(i,j) =(I*K)(i,j)=\sum_{m}\sum_{n}I(i+m)(j+n)K(m,n)$ ，

互相关函数与特征提取 ：互相关是信号分析中的概念，表示两个时间序列之间的相关程度，越相关，操作之后，输出值也越大，意味神经元的激活。如果我们将卷积核视为某种特征提取器，如果数据数据中包含与卷积核相似的特征，那么输入数据与卷积核进行卷操作后的输出会很大，因此反过来，如果卷积操作之后神经元被激活，意味着提取到了该特征。

理解互相关函数与特征提取的另一个思路是相关定理，上一节通过卷积定理说明了卷积操作与特征提取的关系，事实上互相关也可以采用相同的形式，即相关定理：两信号的互相关函数的傅立叶变换等于其中一个的傅立叶变换乘上另一信号的傅立叶变换的共轭，虽然存在共轭，但是由于参数是学习得到的，这并不影响我们对利用卷积进行特征提取的理解。

下图说明了卷积的具体操作，filter在图像内滑动，并与滑动到的区域内数据进行元素相乘，并加上偏置（图中未画出），右边区域显示了卷积神经网络与全连接网络的关系，我们可以把卷积神经网络理解为对全连接网络施加了非常强的先验假设：

1、隐藏层中神经元只与其上一层范围的空间区域（感受域）内神经元相连，感受域之外的权值均为零 –局部相连。

2、隐藏层中每一个独立图像内部神经元参数相同–参数共享

需要说明的是，虽然卷积层连接只发生在局部空间区域内，但处在更深的层中神经元可以间接的链接到大部分或全部的输入 。
这里写图片描述

真实数据中图片一般都是多通道的(RGB)（多个卷积核也会生成多通道的图像），因此，执行卷积操作的卷积核一定要与输入图片的通道数一致。
这里写图片描述

为了提取输入层中的不同特征，需要多个卷积核，每个卷积核与输入进行卷积操作后，在输出层生成一个独立的图像，因此，卷积核数等于输出层的通道数。
这里写图片描述

下图说明了池化层的具体操作，在定义的范围内，统计最大值，并作为池化层的输出，没有需要学习的参数，一般采取最大池化，保留最显著特征，平均池化容易引起模糊效应。
这里写图片描述

池化操作中，多通道，多卷积核的操作与卷积层类似，这里就不再赘述。

将经过多层卷积层和池化层后的数据，展平输出到全连接网络，用于分类或其他任务。
这里写图片描述

卷积中的参数与维度

主要参数：

卷积核大小 ：f*f

卷积步长 ：s

padding : valid，same

如下图所示，卷积核为黄色filter，其大小为3*3，stride 表示卷积核移动的步长，stride=1 表示卷积核每次移动1格,stride=2表示每次移动两格。
这里写图片描述

如下图所示，如果直接执行卷积操作(valid)，那么每次执行卷积后，图像都会变小，核越大，减小的越快，无法构建很深的卷积网络。因此，为了保证图像大小，一般会在边界做padding(same)（事实上也是为了保证边界信息向深层传播）。
这里写图片描述

假设卷积之前图像大小为 $n^{l-1}*n^{l-1}$ ,channel 为 $c^{l-1}$ ,stride 为1，核大小为 $f^{l}*f^{l}$ ,为了保证卷积的执行，卷积核的通道数应为 $c^{l-1}$ ,因此卷积的维度为 $f^l*f^l*c^{l-1}$ ,卷积核个数为 $c^l$

如果padding为valid(有效)，那么执行卷积之后的图像维度为： $(n^{l-1}-f^l+1)*(n^{l-1}-f^l+1)*c^l$

如果padding为same,那么执行卷积之后图像的维度为 $n^{l-1}*n^{l-1}*c^l$ ,为了保证输出维度，padding 步长应该为： $\frac{f^l-1}{2}$ ,通常情况下 $f^l$ 取奇数。

如果stride不为1，那么就无法保证输入输出图像的维度相同，那么此时输出的维度为 $\lfloor \frac {n^{l-1}+2p^l-f^l}{s^l} +1 \rfloor * \lfloor \frac {n^{l-1}+2p^l-f^l}{s^l} +1 \rfloor *c^l$ , $\lfloor \rfloor$ 表示向下取整。

与全连接网络一样，卷积操作还需要添加偏置，并进行非线性激活输出。

下图是一个简单的卷积的例子， $n^{l-1} = 6$ , $c^{l-1} = 3$ , $f^l=3$ , $c^l=2$ ,stride=1,padding选择valid，那么输出图像的维度就为 $(6-3+1)*(6-3+1)*2 = 4*4*2$

这里写图片描述

下图是参数维度的一个说明：
这里写图片描述

池化层没有需要学习的参数，也不需要padding，维度信息为 $\lfloor \frac {n_H-f}{s} +1 \rfloor *\lfloor \frac {n_W-f}{s} +1 \rfloor *c$

下图是一个一个卷积神经网络的例子。输入维度为 $32*32*3$ ,经过两个（卷积层和池化层），最终输出为 $5*5*16$ ，（图像从宽短，逐渐变得细长）。然后展开成一维网络，并连接到全连接网络。
这里写图片描述

卷积神经网络的训练采用梯度下降法。

使用python 实现卷积神经网络

整体框架如下所示：
这里写图片描述

为了实现卷积网络，需要构建卷积层和池化层：

卷积层主要包括以下几个模块：

1、zero padding

2、convolve window：(实现单个卷积核与图像一小块区域相乘相加，并添加偏置操作)

3、convolution forward

4、convolution backward

池化层主要包括以下几个模块：

1、pooling forward

2、Creat mask

3、Distribute value

（2,3 用于反向传播时的梯度向前传播）

4、pooling backward

首先实现卷积层
这里写图片描述

def zero_pad(X, pad):
"""
Pad with zeros all images of the dataset X. The padding is applied to the height and width of an image, 
as illustrated in Figure 1.

Argument:
X -- python numpy array of shape (m, n_H, n_W, n_C) representing a batch of m images
pad -- integer, amount of padding around each image on vertical and horizontal dimensions

Returns:
X_pad -- padded image of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)
"""

### START CODE HERE ### (≈ 1 line)
X_pad = np.pad(X,((0,0),(pad,pad),(pad,pad),(0,0)),'constant')
#X: 待填充numpy数组，（before，after）表示对于维度填充个数，constant 表示填充方式，默认constant填充0
#zero_pad 中只在n_H,n_W 维度上填充，其他维度不填充
### END CODE HERE ###

return X_pad

这里写图片描述

def conv_single_step(a_slice_prev, W, b):

"""
Apply one filter defined by parameters W on a single slice (a_slice_prev) of the output activation 
of the previous layer.

Arguments:
a_slice_prev -- slice of input data of shape (f, f, n_C_prev)
W -- Weight parameters contained in a window - matrix of shape (f, f, n_C_prev)
b -- Bias parameters contained in a window - matrix of shape (1, 1, 1)

Returns:
Z -- a scalar value, result of convolving the sliding window (W, b) on a slice x of the input data
"""

### START CODE HERE ### (≈ 2 lines of code)
# Element-wise product between a_slice and W. Add bias.
s = np.multiply(a_slice_prev,W)

# Sum over all entries of the volume s
Z = np.sum(s)
# Add bias b to Z. Cast b to a float() so that Z results in a scalar value.
Z = np.squeeze(np.add(Z,b))
#去掉维度信息，输出标量值
### END CODE HERE ###

return Z

这里写图片描述

$n_H = \lfloor \frac{n_{H_{prev}- f + 2 *pad}}{stride} \rfloor + 1$

$n_W = \lfloor \frac{n_{W_{prev}- f + 2 *pad}}{stride} \rfloor + 1$

$n_C$ 滤波器个数

def conv_forward(A_prev, W, b, hparameters):
"""
Implements the forward propagation for a convolution function

Arguments:
A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
b -- Biases, numpy array of shape (1, 1, 1, n_C)
hparameters -- python dictionary containing "stride" and "pad"

Returns:
Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
cache -- cache of values needed for the conv_backward() function
"""

### START CODE HERE ###
# Retrieve dimensions from A_prev's shape (≈1 line)  
(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

# Retrieve dimensions from W's shape (≈1 line)
(f, f, n_C_prev, n_C) = W.shape

# Retrieve information from "hparameters" (≈2 lines)
stride = hparameters['stride']
pad = hparameters['pad']

# Compute the dimensions of the CONV output volume using the formula given above. Hint: use #int() to floor. (≈2 lines)
n_H = np.floor_divide(n_H_prev - f + 2 * pad  , stride) + 1
n_W = np.floor_divide(n_W_prev - f + 2 * pad  , stride) + 1

# Initialize the output volume Z with zeros. (≈1 line)
Z = np.zeros((m,n_H,n_W,n_C))

# Create A_prev_pad by padding A_prev
A_prev_pad = zero_pad(A_prev,pad)

for i in range(m):                               # loop over the batch of training examples
    a_prev_pad = A_prev_pad[i]                               # Select ith training example's #padded activation
    for h in range(n_H):                           # loop over vertical axis of the output #volume
        for w in range(n_W):                       # loop over horizontal axis of the output #volume
            for c in range(n_C):                   # loop over channels (= #filters) of the #output volume

                # Find the corners of the current "slice" (≈4 lines)
                vert_start = h * stride
                vert_end = h * stride + f
                horiz_start = w * stride
                horiz_end = w * stride + f

                # Use the corners to define the (3D) slice of a_prev_pad (See Hint above the #cell). (≈1 line)
                a_slice_prev = a_prev_pad[vert_start:vert_end,horiz_start:horiz_end,:]

                # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
                Z[i, h, w, c] = conv_single_step(a_slice_prev,W[:,:,:,c],b[:,:,:,c])

### END CODE HERE ###

# Making sure your output shape is correct
assert(Z.shape == (m, n_H, n_W, n_C))

# Save information in "cache" for the backprop
cache = (A_prev, W, b, hparameters)

return Z, cache

接下来实现池化层
这里写图片描述

这里写图片描述

def pool_forward(A_prev, hparameters, mode = "max"):
"""
Implements the forward pass of the pooling layer

Arguments:
A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
hparameters -- python dictionary containing "f" and "stride"
mode -- the pooling mode you would like to use, defined as a string ("max" or "average")

Returns:
A -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C)
cache -- cache used in the backward pass of the pooling layer, contains the input and hparameters 
"""

# Retrieve dimensions from the input shape
(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

# Retrieve hyperparameters from "hparameters"
f = hparameters["f"]
stride = hparameters["stride"]

# Define the dimensions of the output
n_H = np.floor_divide(n_H_prev - f  , stride) + 1
n_W = np.floor_divide(n_W_prev - f   , stride) + 1
n_C = n_C_prev

# Initialize output matrix A
A = np.zeros((m, n_H, n_W, n_C))              

### START CODE HERE ###
for i in range(m):                         # loop over the training examples
    for h in range(n_H):                     # loop on the vertical axis of the output volume
        for w in range(n_W):                 # loop on the horizontal axis of the output volume
            for c in range (n_C):            # loop over the channels of the output volume

                # Find the corners of the current "slice" (≈4 lines)
                vert_start = h * stride
                vert_end = h * stride + f
                horiz_start = w * stride
                horiz_end = w * stride + f

                # Use the corners to define the current slice on the ith training example of #A_prev, channel c. (≈1 line)
                a_prev_slice = A_prev[i,vert_start:vert_end,horiz_start:horiz_end,c]

                # Compute the pooling operation on the slice. Use an if statment to #differentiate the modes. Use np.max/np.mean.
                if mode == "max":
                    A[i, h, w, c] = np.max(a_prev_slice)
                elif mode == "average":
                    A[i, h, w, c] = np.mean(a_prev_slice)

### END CODE HERE ###

# Store the input and hparameters in "cache" for pool_backward()
cache = (A_prev, hparameters)

# Making sure your output shape is correct
assert(A.shape == (m, n_H, n_W, n_C))

return A, cache

卷积网络的反向传播 ：

前向传播

$Z[i,h,w,c] = A_{prepad}[i,h * stride:h * stride + f,w* stride:w * stride + f,:] * W[:,:,:,c] + b[:,:,:,c]$

因此反向传播中：

$dA_{prepad}[i,h * stride:h * stride + f,w * stride:w * stride + f,:] =\sum_{c=1}^{n_C}dZ[i,h,w,c] * W[:,:,:,c]$

$dW[:,:,:,c] = \sum_{i=1}^{m}\sum_{h=1}^{n_H}\sum_{w=1}^{n_W} dZ[i,h,w,c] *A_{prepad}[i,h * stride:h * stride + f,w * stride:w * stride + f,:]$

$db[:,:,:,c] = \sum_{i=1}^{m}\sum_{h=1}^{n_H}\sum_{w=1}^{n_W} dZ[i,h,w,c]$

def conv_backward(dZ, cache):
"""
Implement the backward propagation for a convolution function

Arguments:
dZ -- gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C)
cache -- cache of values needed for the conv_backward(), output of conv_forward()

Returns:
dA_prev -- gradient of the cost with respect to the input of the conv layer (A_prev),
           numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
dW -- gradient of the cost with respect to the weights of the conv layer (W)
      numpy array of shape (f, f, n_C_prev, n_C)
db -- gradient of the cost with respect to the biases of the conv layer (b)
      numpy array of shape (1, 1, 1, n_C)
"""

### START CODE HERE ###
# Retrieve information from "cache"
(A_prev, W, b, hparameters) = cache

# Retrieve dimensions from A_prev's shape
(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

# Retrieve dimensions from W's shape
(f, f, n_C_prev, n_C) = W.shape

# Retrieve information from "hparameters"
stride = hparameters['stride']
pad = hparameters['pad']

# Retrieve dimensions from dZ's shape
(m, n_H, n_W, n_C) = dZ.shape

# Initialize dA_prev, dW, db with the correct shapes
dA_prev = np.zeros_like(A_prev)                          
dW = np.zeros_like(W)
db = np.zeros_like(b)

# Pad A_prev and dA_prev
A_prev_pad = zero_pad(A_prev,stride)
dA_prev_pad = zero_pad(dA_prev,stride)

for i in range(m):                       # loop over the training examples

    # select ith training example from A_prev_pad and dA_prev_pad
    a_prev_pad = A_prev_pad[i,:,:,:]
    da_prev_pad = dA_prev_pad[i,:,:,:]

    for h in range(n_H):                   # loop over vertical axis of the output volume
        for w in range(n_W):               # loop over horizontal axis of the output volume
            for c in range(n_C):           # loop over the channels of the output volume

                # Find the corners of the current "slice"
                vert_start = h * stride
                vert_end = h * stride + f
                horiz_start = w * stride
                horiz_end = w * stride + f

                # Use the corners to define the slice from a_prev_pad
                a_slice = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]

                # Update gradients for the window and the filter's parameters using the code #formulas given above
                da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]
                dW[:,:,:,c] +=  a_slice * dZ[i,h,w,c]
                db[:,:,:,c] +=  dZ[i,h,w,c]

    # Set the ith training example's dA_prev to the unpaded da_prev_pad (Hint: use X[pad:-pad, #pad:-pad, :])
    dA_prev[i, :, :, :] = dA_prev_pad[i,pad:-pad,pad:-pad,:]
    #保持与前一个A维度一致
### END CODE HERE ###

# Making sure your output shape is correct
assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))

return dA_prev, dW, db

pooling的反向传播

maxpooling 反向传播 :

$A[i, h, w, c] = np.max( A_{prev}[i,h*stride:h*stride +f ,w*stride:w*stride+f,c])$

$A[i, h, w, c]$ 只与 $(A_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c])$ 的最大值有关，

因此,只将梯度传给其最大项即可，

建立mask， $(A_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c])$ 维度为 $f*f$ ，mask维度也为 $f*f$ ，

使mask中 $(A_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c])$ 最大项对应的项为1，其余项为0，

反向传播： $dA_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c]) = mask *dA[i,h,w,c]$

averagepooling 反向传播 :

$A[i, h, w, c]$ 为 $(A_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c])$ 的平均，因此

$dA_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c]) = dA[i,h,w,c]/(f*f)$

当f与stride不同时， $dA_{prev}[i,vert_start:vert_end,horiz_start:horiz_end,c])$ 会与多个 $dA[i,h,w,c]$ 有关，因此，需要全部加起来。

def create_mask_from_window(x):
"""
Creates a mask from an input matrix x, to identify the max entry of x.

Arguments:
x -- Array of shape (f, f)

Returns:
mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of x.
"""

### START CODE HERE ### (≈1 line)
mask = (x == np.max(x))
### END CODE HERE ###

return mask


def distribute_value(dz, shape):
"""
Distributes the input value in the matrix of dimension shape

Arguments:
dz -- input scalar
shape -- the shape (n_H, n_W) of the output matrix for which we want to distribute the value of dz

Returns:
a -- Array of size (n_H, n_W) for which we distributed the value of dz
"""

### START CODE HERE ###
# Retrieve dimensions from shape (≈1 line)
(n_H, n_W) = shape

# Compute the value to distribute on the matrix (≈1 line)
average = 1.0 / (n_H*n_W)

# Create a matrix where every entry is the "average" value (≈1 line)
a = dz * np.full((n_H,n_W),average)
### END CODE HERE ###

return a


def pool_backward(dA, cache, mode = "max"):
"""
Implements the backward pass of the pooling layer

Arguments:
dA -- gradient of cost with respect to the output of the pooling layer, same shape as A
cache -- cache output from the forward pass of the pooling layer, contains the layer's input and hparameters 
mode -- the pooling mode you would like to use, defined as a string ("max" or "average")

Returns:
dA_prev -- gradient of cost with respect to the input of the pooling layer, same shape as A_prev
"""

### START CODE HERE ###

# Retrieve information from cache (≈1 line)
(A_prev, hparameters) = cache

# Retrieve hyperparameters from "hparameters" (≈2 lines)
stride = hparameters['stride']
f = hparameters['f']

# Retrieve dimensions from A_prev's shape and dA's shape (≈2 lines)
m, n_H_prev, n_W_prev, n_C_prev = A_prev.shape
m, n_H, n_W, n_C = dA.shape
# Initialize dA_prev with zeros (≈1 line)
dA_prev = np.zeros_like(A_prev)
for i in range(m):                       # loop over the training examples

    # select training example from A_prev (≈1 line)
    a_prev = A_prev[i,:,:,:]

    for h in range(n_H):                   # loop on the vertical axis
        for w in range(n_W):               # loop on the horizontal axis
            for c in range(n_C):           # loop over the channels (depth)

                # Find the corners of the current "slice" (≈4 lines)
                vert_start = h * stride
                vert_end = h * stride + f
                horiz_start = w *stride 
                horiz_end = w * stride + f

                # Compute the backward propagation in both modes.
                if mode == "max":

                    # Use the corners and "c" to define the current slice from a_prev (≈1 line)
                    a_prev_slice = a_prev[vert_start:vert_end,horiz_start:horiz_end,c]
                    # Create the mask from a_prev_slice (≈1 line)
                    mask = create_mask_from_window(a_prev_slice)
                    # Set dA_prev to be dA_prev + (the mask multiplied by the correct entry of #dA) (≈1 line)
                    dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += np.multiply(dA[i,h,w,c],mask)

                elif mode == "average":

                    # Get the value a from dA (≈1 line)
                    da = dA[i,h,w,c]
                    # Define the shape of the filter as fxf (≈1 line)
                    shape = (f,f)
                    # Distribute it to get the correct slice of dA_prev. i.e. Add the distributed value of da. (≈1 line)
                    dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += distribute_value(da,shape)

### END CODE ###

# Making sure your output shape is correct
assert(dA_prev.shape == A_prev.shape)

return dA_prev

经典卷积神经网络

前边几节介绍了卷积神经网络的基本构件，但如何将这些组件组合起来，形成有效的卷积神经网络依然是一个非常开放的问题，一方面学习经典的，被证实确实有效的神经网络，理解其中的精神，可以非常有效的帮助构建自己的神经网络。另一方面，理解这些网络，并直接将经典的神经网络迁移到自己的应用中，也是一个非常有效的方法。

因此，下文尝试说明一些经典的卷积神经网络，说明其网络架构，网络设计的思想并尝试用tensorflow和keras实现相关网络。主要包括以下几个网络。

AlexNet

VGGNet

ResNet

Google Inception Net

AlexNet :

AlexNet 是Hinton的学生Alex Krizhevsky 提出的卷积神经网络模型，
包含5个卷积层和三个全连接层

在ILSVRC数据集上可达到16.4%的错误率，有6000万的参数，

AlexNet 确立了深度学习在计算机视觉的统治地位，同时也推动了深度学习在语音识别，自然语言处理，强化学习等领域的拓展，开启了深度学习的时代。

特点：

（1）成功使用relu 作为CNN激活函数，成功解决sigmoid梯度消失问题
（2）训练时使用dropout随机忽略一部分神经元，避免过拟合，主要用在全连接层
（3）CNN中使用重叠的最大池化，避免平均池化的模糊化效果，使步长比池化核小，这样有覆盖，提升的特征的丰富性
（4）提出了LRN层，对局部神经元的活动创建竞争机制，使得其中响应比较大的值相对更大，增强模型的泛化能力(不在使用)
（5）数据增强，截取，翻转等，数据量显著增多，避免过拟合，对RGB图像做PCA，并对主成分增加标准差为0.1的高斯扰动，进一步降低误差率。

（76）使用CUDA加速卷积神经网络的训练，利用GPU强大的并行计算能力，加速训练

网络架构如下图所示：

输入图片尺寸： $227*227*3$ ,

第一个卷积核尺寸为 $11*11$ ，步长为4，有96个卷积核，

LRN层（应用不多）

$3*3$ 的最大池化层，步长为2

$5*5$ 的卷积层，步长为1，卷积核数为256

接3个3*3卷积层，步长为1，padding=same，卷积核数分别为384,384,256

$3*3$ 的最大池化层，步长为2

flatten为9216维，接入全连接层，并最终使用softmax输出
这里写图片描述

VGGNet:

是牛津大学计算机视觉组（Visual Geometry Group）和 Google Deepmind 公司研究员一起研发的深度卷积神经网络，探索了卷积神经网络的深度与其性能之间的关系，通过反复堆叠 $3*3$ 的小型卷积核和 $2*2$ 的最大池化层，VGG构筑了16-19层的卷积神经网络，拓展性很强，泛化性能非常好，VGGNet依然被经常用来提取图像特征，可以作为非常好的初始化权重.

CONV =3*3 filters,s=1,padding=same

MAX-POOL =2*2,S=2

2014在ILSVRC上的准确率达到7.0 %

VGGNet拥有5段卷积，每一段内有2-3个卷积层，同时每段尾部都会连接一个最大池化层，用来缩小尺寸，每段内卷积数量一样，越往后越多64-128-256-512-512，其中出现多个完全一样的 $3*3$ 的卷积层堆叠在一起，两个 $3*3$ 卷积核相当于一个 $5*5$ 卷积核，3个 $3*3$ 卷积核相当于一个 $7*7$ 卷积核，感受野相同，但是可以省掉将近一半的参数量，而且3个串联的 $3*3$ 的卷积层，比一个 $7*7$ 的卷积层有更多的非线性变换（前者可以有3个激活函数），使得CNN特征学习能力更强。

VGGNet 参数量巨大，约1.38亿，但网络结构非常一致。

训练时叶采用Multi-Scale 的方法做数据增强，将原始图片缩放到不同尺寸S，然后再随机裁取224*224图片，增加数据量

重要观点：

(1) LRN层作用不大
(2)越深的网络效果越好
(3)大一些的卷积核可以学习更大的空间特征

图像缩小比例和信道增加的比例是有规律的

VGGNet模型虽然比AlexNet多，但反而只需要较少的迭代次数就可以收敛。
这里写图片描述

使用tensorflow实现VGGNet：

定义卷积操作：

def conv_op(input_op,name,kh,kw,n_out,dh,dw,p) :

#kh，kw分别为kernel的高和宽，dh和dw分布为步长的高和宽，
# input：输入tenser  n_out：卷积核数量（输出的同道数），p：参数列表
n_in = input_op.get_shape()[-1].value
with tf.name_scope(name) as scope:
    kernel = tf.get_variable(name=scope+'w',
                             shape=[kh,kw,n_in,n_out],
                             dtype=tf.float32,
                             initializer=tf.contrib.layers.xavier_initializer_conv2d())
    conv = tf.nn.conv2d(input_op,kernel,strides=[1,dh,dw,1],padding='SAME')
    bias_init_val =tf.constant(0.0,shape=[n_out],dtype=tf.float32)
    biases = tf.Variable(bias_init_val,trainable=True,name='b')
    activation = tf.nn.relu(tf.nn.bias_add(conv,biases))
    p +=[kernel,biases]
    return activation

def fc_op(input_op,name,n_out,p):

n_in = input_op.get_shape()[-1].value
with tf.name_scope(name) as scope:
    kernel = tf.get_variable(name=scope+'w',
                             shape=[n_in,n_out],
                             dtype=tf.float32,
                             initializer=tf.contrib.layers.xavier_initializer())
    biases = tf.Variable(tf.constant(0.1,shape=[n_out],dtype=tf.float32),name='b')
    activation = tf.nn.relu_layer(input_op,kernel,biases,name=scope)
    p +=[kernel,biases]
    return activation

定义网络

def inference_op(input_op,keep_prob):

p = []

#第一段
conv1_1 = conv_op(input_op,name='conv1_1',kh=3,kw=3,n_out=64,dh=1,dw=1,p=p)
conv1_2 = conv_op(conv1_1,name='conv1_2',kh=3,kw=3,n_out=64,dh=1,dw=1,p=p)
pool1 = mpool_op(conv1_2,name='pool1',kh=2,kw=2,dh=2,dw=2)

#第二段
conv2_1 = conv_op(pool1,name='conv2_1',kh=3,kw=3,n_out=128,dh=1,dw=1,p=p)
conv2_2 = conv_op(conv2_1,name='conv2_2',kh=3,kw=3,n_out=128,dh=1,dw=1,p=p)
pool2 = mpool_op(conv2_2,name='pool2',kh=2,kw=2,dh=2,dw=2)
#第三段

conv3_1 = conv_op(pool2,name='conv3_1',kh=3,kw=3,n_out=256,dh=1,dw=1,p=p)
conv3_2 = conv_op(conv3_1,name='conv3_2',kh=3,kw=3,n_out=256,dh=1,dw=1,p=p)
conv3_3 = conv_op(conv3_2,name='conv3_3',kh=3,kw=3,n_out=256,dh=1,dw=1,p=p)
pool3 = mpool_op(conv3_3,name='pool3',kh=2,kw=2,dh=2,dw=2)

#第四段
conv4_1 = conv_op(pool3,name='conv4_1',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
conv4_2 = conv_op(conv4_1,name='conv4_2',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
conv4_3 = conv_op(conv4_2,name='conv4_3',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
pool4 = mpool_op(conv4_3,name='pool4',kh=2,kw=2,dh=2,dw=2)

#第四段
conv5_1 = conv_op(pool4,name='conv5_1',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
conv5_2 = conv_op(conv5_1,name='conv5_2',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
conv5_3 = conv_op(conv5_2,name='conv5_3',kh=3,kw=3,n_out=512,dh=1,dw=1,p=p)
pool5 = mpool_op(conv5_3,name='pool5',kh=2,kw=2,dh=2,dw=2)

shp = pool5.get_shape()
flattened_shape = shp[1].value * shp[2].value*shp[3].value
resh1 = tf.reshape(pool5,[-1,flattened_shape],name='resh1')

fc6 = fc_op(resh1,name='fc6',n_out=4096,p=p)
fc6_drop = tf.nn.dropout(fc6,keep_prob,name='fc6_drop')

fc7 = fc_op(fc6_drop,name='fc7',n_out=4096,p=p)
fc7_drop = tf.nn.dropout(fc7,keep_prob,name='fc7_drop')

fc8 = fc_op(fc7_drop,name='fc8',n_out=1000,p=p)
softmax = tf.nn.softmax(fc8)
predictions = tf.argmax(softmax,1)
return predictions,softmax,fc8,p

ResNet

ResNet（Residual Neural Network）是由微软研究院Kaiming He等4名华人提出，通过使用Residual Unit 成功训练152层的神经网络，取得了3.57%的top-5错误率。参数量却比VGGNet低，可以极快的加速超深神经网络的训练，模型准确率也大幅提升。

ResNet可以算是深度学习中一个里程碑式的突破，真正意义上解决了极深神经网络的训练问题。

加深网络的深度，可以使网络表示更复杂的模型，可以在更多的水平上进行特征抽象和抽取，但极深网络存在比较严重的长期依赖问题 ：前向传播过程中由于信息丢失和损耗等问题，很难保证信息传递的完整性，反向传播时，则由于梯度消失等问题，梯度很难有效向前传播，优化非常困难。因此出现Degradation问题，即训练误差随着深度的增加会先降低，继续增加深度，训练误差反而增大的问题。

ResNet 采用了 skip connection 的思想，直接将输入信息绕道输出，前向传播时，使得后层信息直接获得输入信息，保证信息的完整，反向传播时，梯度直接从深层传到浅层，可以有效缓解梯度消失的问题，使得优化变得容易。

总结来说就是：ResNet使得信息可以传出去，梯度可以传回来

吴恩达在课程中的解释主要集中在训练难度和效率 上：如果在一个神经网络中，添加一个skip connection，

如果激活函数为ReLU,则 $a^{l+2} = acti(z^{l+2}+a^l) = acti(W^{l+2}a^{l+1}+b^{l+2} +a^l)$ ,只要 $W^{l+2},b^{l+2}$ 为0，那么

$a^{l+2} =a^l$ ,因此全等映射很容易通过学习得到，因此通过skip connection的方式增加网络的深度，并不会增加训练的难度，可以保证网络的性能不会受到影响，可以有效解决Degradation的问题，如果增加的层可以学到一些有意义的信息，就会比全等表现的更好。

事实上，使用ResNet结构后，可以发现层数不断增加导致的训练集上的误差增大的现象被消除了，随着深度的增加，测试集上的表现也变得更好。
这里写图片描述

skip connection :
这里写图片描述

$z^{l+1} = W^{l+1}a^l +b^{l+1}$

$a^{l+1} = acti(z^{l+1})$

$z^{l+2} = W^{l+2}a^{l+1} +b^{l+2}$

$a^{l+2} = acti(z^{l+2}+a^l)$

ship connection 可以有不同的深度，也可以有不用的残差结构。
这里写图片描述

上式最后一项，要执行 $z^{l+2}与a^l$ 的元素相加，因此，需要保证两项维度一致，如果维度一致，直接相加即可，如果维度不一致，则需要对 $a^l$ 进行某些运算，即：

$a^{l+2} = acti(z^{l+2}+W_sa^l)$ ，其网络结构如下图所示。
这里写图片描述

使用keras建立 ResNet，网络结构如下图所示：
这里写图片描述

The details of this ResNet-50 model are:

Zero-padding pads the input with a pad of (3,3)
Stage 1:
- The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
- BatchNorm is applied to the channels axis of the input.
- MaxPooling uses a (3,3) window and a (2,2) stride.
Stage 2:
- The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
- The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
Stage 3:
- The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
- The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
Stage 4:
- The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
- The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
Stage 5:
- The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
- The 2 identity blocks use three set of filters of size [256, 256, 2048], “f” is 3 and the blocks are “b” and “c”.
The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
The flatten doesn’t have any hyperparameters or name.
The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

图中 identity blocks 为上图跳过3层的全等层，convolutional block 为figure 所示的结构。

首先定义identty blocks（3层skip connection）,即： $a^{l+3} = acti(Z^{l+3}+a^l)$ :

第一和第三个为步长为1的 $1*1$ 卷积，不会改变数据维度，第二个卷积步长为1，padding为same，因此也不会改变数据维度，保证 $Z^{l+3}y与a^l$ 维度一致。

def identity_block(X, f, filters, stage, block):

“””
Implementation of the identity block as defined in Figure 4

Arguments:
X – input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
f – integer, specifying the shape of the middle CONV’s window for the main path
filters – python list of integers, defining the number of filters in the CONV layers of the main path
stage – integer, used to name the layers, depending on their position in the network
block – string/character, used to name the layers, depending on their position in the network

Returns:
X – output of the identity block, tensor of shape (n_H, n_W, n_C)
“”“

# defining name basis
conv_name_base = ‘res’ + str(stage) + block + ‘_branch’
bn_name_base = ‘bn’ + str(stage) + block + ‘_branch’

# Retrieve Filters
F1, F2, F3 = filters

# Save the input value. You’ll need this later to add back to the main path.
X_shortcut = X

# First component of main path
X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = ‘valid’, name = conv_name_base + ‘2a’, kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + ‘2a’)(X)
X = Activation(‘relu’)(X)

### START CODE HERE ###

# Second component of main path (≈3 lines)
X = Conv2D(filters = F2, kernel_size = (f, f), strides = (1,1), padding = ‘same’, name = conv_name_base + ‘2b’, kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name = bn_name_base + ‘2b’)(X)
X = Activation(‘relu’)(X)

# Third component of main path (≈2 lines)
X = Conv2D(filters = F3, kernel_size = (1, 1), strides = (1,1), padding = ‘valid’, name = conv_name_base + ‘2c’, kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name = bn_name_base + ‘2c’)(X)

# Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
X = layers.add([X, X_shortcut])
X = Activation(‘relu’)(X)

### END CODE HERE ###

return X
接下来定义 convolutional block，即： $a^{l+3} = acti(Z^{l+3}+W_sa^l)$

X的第一个卷积与X_short_cut 卷积维度一致，X的第二个和三个卷积，不会改变数据维度，最终可以保证 $Z^{l+3}与W_sa^l$ 的维度一致。

def convolutional_block(X, f, filters, stage, block, s = 2):

"""
Implementation of the convolutional block as defined in Figure 4

Arguments:
X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
f -- integer, specifying the shape of the middle CONV's window for the main path
filters -- python list of integers, defining the number of filters in the CONV layers of the main path
stage -- integer, used to name the layers, depending on their position in the network
block -- string/character, used to name the layers, depending on their position in the network
s -- Integer, specifying the stride to be used

Returns:
X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
"""

# defining name basis
conv_name_base = 'res' + str(stage) + block + '_branch'
bn_name_base = 'bn' + str(stage) + block + '_branch'

# Retrieve Filters
F1, F2, F3 = filters

# Save the input value
X_shortcut = X


##### MAIN PATH #####
# First component of main path 
X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', padding='valid', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)

### START CODE HERE ###

# Second component of main path (≈3 lines)
X = Conv2D(F2, (f, f), strides = (1, 1), name = conv_name_base + '2b',padding='same', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2b')(X)
X = Activation('relu')(X)

# Third component of main path (≈2 lines)
X = Conv2D(F3, (1, 1), strides = (1, 1), name = conv_name_base + '2c',padding='valid', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2c')(X)

##### SHORTCUT PATH #### (≈2 lines)
X_shortcut = Conv2D(F3, (1, 1), strides = (s, s), name = conv_name_base + '1',padding='valid', kernel_initializer = glorot_uniform(seed=0))(X_shortcut)
X_shortcut = BatchNormalization(axis = 3, name = bn_name_base + '1')(X_shortcut)

# Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
X = layers.add([X, X_shortcut])
X = Activation('relu')(X)

### END CODE HERE ###

return X

接下来实现完整的50层的ResNet。

def ResNet50(input_shape = (64, 64, 3), classes = 6):

"""
Implementation of the popular ResNet50 the following architecture:
CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
-> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER
Arguments:
input_shape -- shape of the images of the dataset
classes -- integer, number of classes

Returns:
model -- a Model() instance in Keras
"""
# Define the input as a tensor with shape input_shape
X_input = Input(shape=input_shape)
# Zero-Padding
X = ZeroPadding2D(padding=(3,3))(X_input)
# Stage 1
X = Conv2D(filters=64,kernel_size=(7,7),strides=(2,2),name='conv1')(X)
X = BatchNormalization(axis=3,name='bn1')(X)
X = Activation('relu')(X)
X = MaxPooling2D(pool_size=(3,3),strides=(2,2))(X)
# Stage 2
X =convolutional_block(X,f=3,filters=[64,64,256],stage=2,block='a',s=1)
X =identity_block(X,f=3,filters=[64,64,256],stage=2,block='b')
X =identity_block(X,f=3,filters=[64,64,256],stage=2,block='c')
### START CODE HERE ###
# Stage 3 (≈4 lines)
# The convolutional block uses three set of filters of size [128,128,512], "f" is 3, "s" is 2 and the block is "a".
# The 3 identity blocks use three set of filters of size [128,128,512], "f" is 3 and the blocks are "b", "c" and "d".
X =convolutional_block(X,f=3,filters=[128,128,512],stage=3,block='a')
X = identity_block(X,f=3,filters=[128,128,512],stage=3,block='b')
X = identity_block(X,f=3,filters=[128,128,512],stage=3,block='c')
X = identity_block(X,f=3,filters=[128,128,512],stage=3,block='d')
# Stage 4 (≈6 lines)
# The convolutional block uses three set of filters of size [256, 256, 1024], "f" is 3, "s" is 2 and the block is "a".
# The 5 identity blocks use three set of filters of size [256, 256, 1024], "f" is 3 and the blocks are "b", "c", "d", "e" and "f".
X =convolutional_block(X,f=3,filters=[256,256,1024],stage=4, block='a')
X = identity_block(X,f=3,filters=[256,256,1024],stage=4,block='b')
X = identity_block(X,f=3,filters=[256,256,1024],stage=4,block='c')
X = identity_block(X,f=3,filters=[256,256,1024],stage=4,block='d')
X = identity_block(X,f=3,filters=[256,256,1024],stage=4,block='e')
X = identity_block(X,f=3,filters=[256,256,1024],stage=4,block='f')
# Stage 5 (≈3 lines)
# The convolutional block uses three set of filters of size [512, 512, 2048], "f" is 3, "s" is 2 and the block is "a".
# The 2 identity blocks use three set of filters of size [256, 256, 2048], "f" is 3 and the blocks are "b" and "c".
X =convolutional_block(X,f=3,filters=[512,512,2048],stage=5,block='a')
X = identity_block(X,f=3,filters=[512,512,2048],stage=5,block='b')
X = identity_block(X,f=3,filters=[512,512,2048],stage=5,block='c') 
# filters should be [256, 256, 2048], but it fail to be graded. Use [512, 512, 2048] to pass the grading
# AVGPOOL (≈1 line). Use "X = AveragePooling2D(...)(X)"
# The 2D Average Pooling uses a window of shape (2,2) and its name is "avg_pool".
X = AveragePooling2D(pool_size=(2,2),name='avg_pool')(X)
X = Flatten()(X)
X = Dense(classes,activation='softmax',name='fc'+ str(classes))(X)
### END CODE HERE ###
# output layer
model = Model(inputs = X_input, outputs = X, name='ResNet50')
# Create model
return model

Google Inception Net

Google Inception Net 首次出现在ILSVRC 2014的比赛中，并以较大优势取得了第一名，top-5错误率为6.67%。Inception V1 有22层，但计算量只有15亿次浮点运算，参数量只有500万。

Inception可以很深但参数量却只有500万，主要有两个原因：

1、去除全连接层，用平均池化层替代（减少过拟合，加速训练）

2、Inception Net 网络架构设计精巧，提高了参数的利用率。

人类神经元的连接时稀疏的，因此研究者认为大型神经网络的链接方式也应该是稀疏的，稀疏结构可以有效减少参数量，减弱过拟合（卷积神经网络本身就是全连接网络的稀疏结构）。Inception Net 的主要目标就是找到最优的稀疏结构单元(Inception Module)。

一个“好”的稀疏结构，应该把相关性高的一簇神经元连接在一起，在普通数据集中，可能需要对神经元节点进行聚类，但图像数据中，天然的就是邻近区域数据相关性最高，极端情况，同一个空间位置，不同的通道内的特征相关性最高，而 $1*1$ 的卷积就是同一空间，不同通道内的特征组合在一起的有效工具。另外 $3*3$ , $5*5$ 区域，不同同道间的数据相关性也很强，因此也可以被连接在一起（使用 $3*3$ , $5*5$ ）卷积，这样做，也可以提供更多的多样性，提升泛化能力。事实上当直接使用一个 $3*3$ ，或 $5*5$ 网络时，参数量可能非常大，使用 $1*1$ 的卷积压缩通道数，然后再使用 $3*3$ 或 $5*5$ 的卷积可以非常显著的减少参数量。因此，首先介绍 $1*1$ 卷积网络。

$1*1$ 的卷积 又被称为Net in Net

对图像进行 $1*1$ 的卷积就是用 $1*1$ 的卷积核与图像中每一个空间位置(同一个height和weight位置)不同道之通间的数据进行元素相乘相加，然后加上偏置，经过非线性激活函数。如下图所示， $1*1$ 卷积相当于对图像每个位置不同同道之间的特征，建立全连接网络（不同空间位置共享同一组参数），因此 $1*1$ 的卷积也被称作Net in Net。
这里写图片描述

卷积神经网络每一个通道代表某一类特征， $1*1$ 卷积将不同同道之间的特征组织起来 ，提高了网络的表达能力。

另一方面， $1*1$ 网络可以对输出通道进行改变，将输出同道减小并接一个较大的卷积，可以显著减少参数量。
这里写图片描述

这里写图片描述

假设我们的目标是将一个 $28*28*192$ 的输入数据，通过 $5*5$ 的卷积，得到一个 $28*28*32$ 的输出。

直接使用 $5*5$ 卷积，

需要的参数量为 $5*5*192*32 =153600$ ,

每一层前向传播（反向传播相同）需要的计算量为 $(5*5*192)*(28*28*32)=1.2e8$ （输出位置每个点需要的乘法运算为 $5*5*192$ ,共 $28*28*32$ 个输入）

采用 $1*1$ 卷积先执行通道降维，然后通过 $5*5$ 卷积，同样可以得到 $28*28*32$ 的输出

需要的参数量为： $1*1*192*16 + 5*5*16*32 =15872$

每一层前向传播（反向传播相同）需要的计算量为
$1*1*192 * 28*28*16 +5*5*16*28*28*32=1.2e7$

对比可以看到，需要的参数量和计算量都下降了10倍，可以有效压制过拟合并提升计算效率。

$1*1$ 卷积可以组合不同同道的特征，并进行非线性输出，可以提高参数的表达能力，因此，虽然参数量下降很多，但只要卷积个数不太少，模型的表达能力不会有明显的下降。

Inception Module

上节阐述了 $1*1$ 卷积，下面使用 $1*1$ 卷积来构建Inception Module(Inception Net 就是将多个Inception Module连接在一起)。基本的Inception Module的结构如下图所示：
这里写图片描述

如上图所示，Inception Module包含4个分支，第一个分支为一个 $1*1$ 的卷积，第二个分支为先 $1*1$ 进行通道降维,然后接一个 $3*3$ 卷积,第三个分支为先 $1*1$ 进行通道降维,然后接一个 $5*5$ 卷积，最后一个分支先为一个maxpool，然后使用一个 $1*1$ 卷积进行通道降维到与其他输出一致，最终将4个分支的输出在通道维度上进行聚合。

Inception Module的精神是让网络自己去捕捉不同尺度的特征，或者说让网络自己去学习需要什么样的卷积，采取哪些类型卷积的融合，而不用人为去设计卷积核的类型。显著增加网络对不同尺度特征的适应性，可以捕捉到不同大小的特征。
这里写图片描述

如上图所示,Inception Net 就是将堆叠多个Inception Module，我们希望靠后的Inception Module可以捕捉更高阶的抽象特征，因此靠后的Inception Module需要捕获面积更大的特征，因此越靠后 $3*3$ 和 $5*5$ 卷积应该更多。

Inception Net 有22层，除了最后一层的输出，其中间节点的分类效果也很好，因此Inception Net 也将中间某一层节点进行分类输出（给予一个比较小的权重），添加到最终分类结果中，相当于堆模型做了集成，同时给网络增加了反向传播的梯度信号，提升泛化性能，假设网络训练。

Google Inception Net是一个大家族，包含Inception V1,Inception V2,Inception V3,Inception V4.

Inception V2 学习了VGG，连续使用两个 $3*3$ 网络替代一个 $5*5$ 网络，感受野相同，但减少参数量，并增加了非线性。并提出了Batch Normlization（在<深度学习的优化>一文中有详细论述，这里就不再说明）
这里写图片描述

Inception V3 网络主要有两方面改造：

1、引入Factorization into small convolution的思想，将一个较大的二维卷积拆成两个较小的一维卷积，比如 $7*7$ 卷积拆成 $1*7$ 和 $7*1$ 的卷积，一方面可以节省大量参数，减轻过拟合，另外增加了一层非线性扩展模型的表达能力。论文中指出，这种非对称的卷积结构拆分，其结果比对称的拆分几个相同的小卷积核效果更明显，可以处理更多，更丰富的空间特征，增加特征多样性。
这里写图片描述

2、Inception 还在分支中使用了分支,可以提取不同抽象程度的高阶特征，可以丰富网络的表达能力。

Inception V4 主要结合了ResNet。

卷积神经网络在NLP中的应用

NLP中大量任务都是序列数据，采用循环神经网络，尤其Attention + Bidirectional LSTM 可以NLP中的大部分任务，但循环神经网络也有一些问题。

1、RNN 无法单独捕捉没有前缀上下文的短语

2、最终的向量中最后一个单词影响更大

3、需要执行softmax

解决上述问题的一个解决方案是采用递归神经网络，但递归神经网络需要事先提供解析树，因此，可以考虑使用卷积神经网络。

接下来介绍如何使用卷积神经网络解决文本分类问题。

CNN分类任务

卷积神经网络应用于文本分类模型，需要解决两个主要问题：

1、如何解决文本序列长度不一致的问题

2、如何捕获不同长度的短语

CNN的核心思想 ：计算相邻的n-gram词组合起来，而不管它到底是不是真正的短语。
这里写图片描述

模型的架构 ：
这里写图片描述

每个单词使用一个k维的向量表示（可以使用wor2vec 或者Glove训练的参数进行初始化），序列长度为n，从而将序列使用n*k维的矩阵表示。

为了完整的表述单词，卷积核横向长度一定为k，为了捕获不同长度的短语结构，卷积核纵向长度h需要多种不同参数（h=1对应捕获unigram,h=2,对应捕获bigram，h=3对应捕获trigram…），卷积核个数可以为任意多个。为了不丢失重要的短语，卷积步长取1，由于我们不搭建深层CNN，因此序列可以补零，也可以不补零（为了叙述方便，这里假设只在末尾补零），对于维度为 $h*k$ 的卷积核与长度为 $n$ 的文本的文本序列进行卷积操作，得到 $n-h+1$ 的向量 $c =[c_1,c_2...c_{n-h+1}]$ 。可以看到，同一个卷积核与不同长度的文本序列进行卷积操作，得到的向量的维度不一致。解决这个问题的关键是在卷积层后再接一个池化层，从而得到一个标量 $\hat c = max(c)$ ，采用最大池化与在图像领域中的思想一致，即捕捉到最显著的特征。然后将池化后的所有标量concat到一起，这样，不管输入的文本序列为多少，输出的维度都只与卷积核个数有关。然后将拼接后的向量输出到全连接层，进而进行分类任务,具体操作如下：

$z = [\hat c_1,\hat c_2....\hat c_m]$

$y = softmax(W^sz+b)$

为了提升泛化性能，可以采用dropout机制： $y = softmax(W^sz * r+b)$ ，r为dropout 的 mask。

通过池化操作，可以有效解决，文本序列长度不一致的问题

通过选取不同大小的卷积核可以有效捕获不同长度的短语向量

另外，有一个关于词向量的技巧。如果任由梯度流入词向量，则词向量会根据分类任务目标而移动，丢失语义上泛化的相似性。解决办法是用两份相同的词向量，称作两个通道（channel）。一个通道可变，一个通道固定。将两个通道的卷积结果输入到max-pool中（模型架构中两个输入就表示两个通道，这样可以同时获得两个方面的信息：1、word2vec训练出的相语义，语法相似度，2、该问题的目标）。

CNN扩展 ：

可以选取不同的卷积核长度，卷积核的个数，padding方式，池化方法等，也可以堆叠多层的CNN，总之，可以有多种扩展方式，这里就不在过多说明。
hankcs.com 2017-07-04 下午2.09.13.png

CNN 应用：机器翻译

第一个神经网络机器翻译模型吧，用CNN做encoder，RNN做decoder：

hankcs.com 2017-07-04 下午2.10.23.png

卷积网络实例

下面的实例来自吴恩达深度学习工程师课程卷积神经网络第一周的作业，使用卷积神经网络实现一个分类任务，使用tensorflow实现。

问题描述：如下图所示，输入为用手演示的0-5数字的图像数据，标签为0-5数字

1、load数据集并将图像进行归一化处理（这一步在图像处理中常常是必要的，不希望特征太大），然后将标签改为one-hot vector。

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
X_train = X_train_orig/255.
X_test = X_test_orig/255.
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

2、创建 placeholders

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    """
    Creates the placeholders for the tensorflow session.

    Arguments:
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes

    Returns:
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"
    """

    ### START CODE HERE ### (≈2 lines)
    X = tf.placeholder(dtype=tf.float32,shape=[None,n_H0,n_W0,n_C0],name='X')
    Y = tf.placeholder(dtype=tf.float32,shape=[None,n_y],name='Y')
    ### END CODE HERE ###

    return X, Y

3、初始化参数

def initialize_parameters():
    """
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    Returns:
    parameters -- a dictionary of tensors containing W1, W2
    """

    tf.set_random_seed(1)                              # so that your "random" numbers match ours

    ### START CODE HERE ### (approx. 2 lines of code)
    W1 = tf.get_variable(name='W1',shape=[4,4,3,8],dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer(seed=0))
    W2 = tf.get_variable(name='W2',shape=[2,2,8,16],dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer(seed=0))
    ### END CODE HERE ###

    parameters = {"W1": W1,
                  "W2": W2}

    return parameters

4、定义前向传播（深度学习框架中只需要定义前向传播过程）

网络架构如下：

 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
 - Flatten the previous output.
 - FULLYCONNECTED (FC) layer

def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """

    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    W2 = parameters['W2']

    ### START CODE HERE ###
    # CONV2D: stride of 1, padding 'SAME'
    Z1 = tf.nn.conv2d(X,W1, strides = (1,1,1,1), padding = 'SAME')
    # RELU
    A1 = tf.nn.relu(Z1)
    # MAXPOOL: window 8x8, sride 8, padding 'SAME'
    P1 = tf.nn.max_pool(A1, ksize = (1,8,8,1), strides = (1,8,8,1),padding='SAME')
    # CONV2D: filters W2, stride 1, padding 'SAME'
    Z2 = tf.nn.conv2d(P1,W2,strides=(1,1,1,1),padding='SAME')
    # RELU
    A2 = tf.nn.relu(Z2)
    # MAXPOOL: window 4x4, stride 4, padding 'SAME'
    P2 = tf.nn.max_pool(A2, ksize = (1,4,4,1), strides = (1,4,4,1),padding='SAME')
    # FLATTEN
    P2 = tf.contrib.layers.flatten(P2)
    # FULLY-CONNECTED without non-linear activation function (not not call softmax).
    # 6 neurons in output layer. Hint: one of the arguments should be "activation_fn=None" 
    Z3 = tf.contrib.layers.fully_connected(P2, 6,activation_fn=None)
    ### END CODE HERE ###  

    return Z3

5、定义损失函数，分类问题使用cross_entropy

def compute_cost(Z3, Y):
    """
    Computes the cost

    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3

    Returns:
    cost - Tensor of the cost function
    """

    ### START CODE HERE ### (1 line of code)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Z3,labels=Y))
    ### END CODE HERE ###

    return cost

6、建立模型，主要包括以下几部分：

create placeholders
initialize parameters
forward propagate
compute the cost

create an optimizer

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
“””
Implements a three-layer ConvNet in Tensorflow:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

Arguments:
X_train -- training set, of shape (None, 64, 64, 3)
Y_train -- test set, of shape (None, n_y = 6)
X_test -- training set, of shape (None, 64, 64, 3)
Y_test -- test set, of shape (None, n_y = 6)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
minibatch_size -- size of a minibatch
print_cost -- True to print the cost every 100 epochs

Returns:
train_accuracy -- real number, accuracy on the train set (X_train)
test_accuracy -- real number, testing accuracy on the test set (X_test)
parameters -- parameters learnt by the model. They can then be used to predict.
"""

ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
tf.set_random_seed(1)                             # to keep results consistent (tensorflow seed)
seed = 3                                          # to keep results consistent (numpy seed)
(m, n_H0, n_W0, n_C0) = X_train.shape             
n_y = Y_train.shape[1]                            
costs = []                                        # To keep track of the cost

# Create Placeholders of the correct shape
### START CODE HERE ### (1 line)
X, Y = create_placeholders(n_H0, n_W0, n_C0, n_y)
### END CODE HERE ###

# Initialize parameters
### START CODE HERE ### (1 line)
parameters = initialize_parameters()
### END CODE HERE ###

# Forward propagation: Build the forward propagation in the tensorflow graph
### START CODE HERE ### (1 line)
Z3 = forward_propagation(X,parameters)
### END CODE HERE ###

# Cost function: Add cost function to tensorflow graph
### START CODE HERE ### (1 line)
cost = compute_cost(Z3,Y)
### END CODE HERE ###

# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
### START CODE HERE ### (1 line)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
### END CODE HERE ###

# Initialize all the variables globally
init = tf.global_variables_initializer()

# Start the session to compute the tensorflow graph
with tf.Session() as sess:

    # Run the initialization
    sess.run(init)

    # Do the training loop
    for epoch in range(num_epochs):

        minibatch_cost = 0.
        num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
        seed = seed + 1
        minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)

        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch
            # IMPORTANT: The line that runs the graph on a minibatch.
            # Run the session to execute the optimizer and the cost, the feedict should contain a minibatch for (X,Y).
            ### START CODE HERE ### (1 line)
            _ , temp_cost = sess.run([optimizer,cost],feed_dict={X:minibatch_X,Y:minibatch_Y})
            ### END CODE HERE ###

            minibatch_cost += temp_cost / num_minibatches


        # Print the cost every epoch
        if print_cost == True and epoch % 5 == 0:
            print ("Cost after epoch %i: %f" % (epoch, minibatch_cost))
        if print_cost == True and epoch % 1 == 0:
            costs.append(minibatch_cost)


    # plot the cost
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    # Calculate the correct predictions
    predict_op = tf.argmax(Z3, 1)
    correct_prediction = tf.equal(predict_op, tf.argmax(Y, 1))

    # Calculate accuracy on the test set
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print(accuracy)
    train_accuracy = accuracy.eval({X: X_train, Y: Y_train})
    test_accuracy = accuracy.eval({X: X_test, Y: Y_test})
    print("Train Accuracy:", train_accuracy)
    print("Test Accuracy:", test_accuracy)

    return train_accuracy, test_accuracy, parameters

7、对模型进行训练

, , parameters = model(X_train, Y_train, X_test, Y_test)