NNDL 作业7：第五章课后题（1×1 卷积核 | CNN BP）

最新推荐文章于 2023-10-06 21:45:23 发布

萐茀37

最新推荐文章于 2023-10-06 21:45:23 发布

阅读量372

点赞数 1

文章标签：深度学习 cnn 神经网络

本文链接：https://blog.csdn.net/qq_56829032/article/details/127600125

版权

习题5-2 证明宽卷积具有交换性,即公式(5.13)

习题5-3 分析卷积神经网络中用1×1的卷积核的作用

习题5-4 对于一个输入为100×100×256的特征映射组，使用3×3的卷积核，输出为100×100×256的特征映射组的卷积层，求其时间和空间复杂度。如果引入一个1×1的卷积核，先得到100×100×64的特征映射，再进行3×3的卷积，得到100×100×256的特征映射组，求其时间和空间复杂度

习题5-7 忽略激活函数，分析卷积网络中卷积层的前向计算和反向传播是一种转置关系

推导CNN反向传播算法（选做）

设计简易CNN模型，分别用Numpy、Pytorch实现卷积层和池化层的反向传播算子，并代入数值测试.(选做)

习题5-2 证明宽卷积具有交换性,即公式(5.13)

根据卷积，有： $y_{ij}=\sum^U_{u=1}\sum^V_{v=1}w_{uv}\cdot x_{i+u-1}$

变换坐标： $k=i-u+1,d=j-v+1$ ，于是有： $u=k-i+1,v=d-j+1$

因此原式变为：

$y_{ij}=\sum_{k=i-u+1}^{U+i-1}\sum_{d=j-v+1}^{V+j-1}x_{kd} \cdot w_{k-i+1,d-j+1}$

根据宽卷积的性质，宽卷积仅是进行了X的零填充，可发现等式仍成立，故宽卷积也符合可交换性。

习题5-3 分析卷积神经网络中用1×1的卷积核的作用

1*1卷积过滤器和正常的过滤器一样，唯一不同的是它的大小是1*1，没有考虑在前一层局部信息之间的关系。最早出现在 Network In Network的论文中，使用1*1卷积是想加深加宽网络结构，在Inception网络（ Going Deeper with Convolutions ）中用来降维，如下图：

这里写图片描述

由于3*3卷积或者5*5卷积在几百个filter的卷积层上做卷积操作时相当耗时，所以1*1卷积在3*3卷积或者5*5卷积计算之前先降低维度。
那么，1*1卷积的主要作用有以下几点：

降维（ dimension reductionality ）

比如，一张500 * 500*100 的图片用20个1*1*100的filter做卷积，那么结果的大小为500*500*20。

升维（用最少的参数拓宽网络channal）

例子：64的卷积核的channel是64，只需添加一个1*1，256的卷积核，只用64*256个参数就能把网络channel从64拓宽四倍到256。

加入非线性

想在不增加感受野的情况下，让网络加深，为的就是引入更多的非线性。而 1x1 卷积核，恰巧可以办到。卷积后生成图片的尺寸受卷积核的大小和卷积核个数影响，但如果卷积核是 1x1 ，个数也是 1，那么生成后的图像长宽不变，通道数为1。但通常一个卷积层是包含激活和池化的。也就是多了激活函数，比如 Sigmoid 和 Relu。所以，在输入不发生尺寸的变化下，加入卷积层的同时引入了更多的非线性，这将增强神经网络的表达能力。

习题5-4 对于一个输入为100×100×256的特征映射组，使用3×3的卷积核，输出为100×100×256的特征映射组的卷积层，求其时间和空间复杂度。如果引入一个1×1的卷积核，先得到100×100×64的特征映射，再进行3×3的卷积，得到100×100×256的特征映射组，求其时间和空间复杂度

情况一：=100;=3;=256 ;=256

时间复杂度一：100×100×3×3×256×256 = 5898240000

空间复杂度一：100×100×256 = 2560000

情况二：=100;=1;=3;=256;=64;=64;=256

时间复杂度二：100×100×1×1×256×64 + 100×100×3×3×64×256 = 1638400000

空间复杂度二：100×100×64 + 100×100×256 = 3200000

习题5-7 忽略激活函数，分析卷积网络中卷积层的前向计算和反向传播是一种转置关系

以一个3×3的卷积核为例，输入为X输出为Y
在这里插入图片描述 $X=\begin{pmatrix} x_1&x_2&x_3&x_4\ x_5&x_6&x_7&x_8\ x_9&x_{10}&x_{11}&x_{12}\ x_{13}&x_{14}&x_{15}&x_{16}\ \end{pmatrix}W=\begin{pmatrix} w_{00}&w_{01}&w_{02}\ w_{10}&w_{11}&w_{12}\ w_{20}&w_{21}&w_{22}\ \end{pmatrix}Y=\begin{pmatrix} y_1&y_2\ y_3&y_4\ \end{pmatrix}$

将4×4的输入特征展开为16×1的矩阵，y展开为4×1的矩阵，将卷积计算转化为矩阵相乘
在这里插入图片描述

由
而
即
所以

再看一下上面的Y=CX可以发现忽略激活函数时卷积网络中卷积层的前向计算和反向传播是一种转置关系。

推导CNN反向传播算法（选做）

CNN反向传播的不同之处：
首先要注意的是，一般神经网络中每一层输入输出a,z都只是一个向量，而CNN中的a,z是一个三维张量，即由若干个输入的子矩阵组成。其次：

池化层没有激活函数。这个问题倒比较好解决，我们可以令池化层的激活函数为σ(z)=z，即激活后就是自己本身。这样池化层激活函数的导数为1。
池化层在前向传播的时候，对输入进行了压缩，那么我们向前反向推导上一层的误差时，需要做upsample处理。
卷积层是通过张量卷积，或者说若干个矩阵卷积求和而得到当前层的输出，这和一般的网络直接进行矩阵乘法得到当前层的输出不同。这样在卷积层反向传播的时候，上一层误差的递推计算方法肯定有所不同。
对于卷积层，由于W使用的运算是卷积，那么由该层误差推导出该层的所有卷积核的W,b的方式也不同。由于卷积层可以有多个卷积核，各个卷积核的处理方法是完全相同且独立的。

1、已知池化层的误差，反向推导上一隐藏层的误差

在前向传播时，池化层我们会用MAX或者Average对输入进行池化，池化的区域大小已知。现在我们反过来，要从缩小后区域的误差，还原前一层较大区域的误差。这个过程叫做upsample。假设我们的池化区域大小是2x2。第l层误差的第k个子矩阵为:

在这里插入图片描述

如果池化区域表示为a*a大小，那么我们把上述矩阵上下左右各扩展a-1行和列进行还原：

在这里插入图片描述

如果是MAX，假设我们之前在前向传播时记录的最大值位置分别是左上，右下，右上，左下，则转换后的矩阵为：

在这里插入图片描述

如果是Average，则进行平均，转换后的矩阵为：

在这里插入图片描述

上边这个矩阵就是误差矩阵经过upsample之后的矩阵，那么，由后一层误差推导出前一层误差的公式为：

在这里插入图片描述

上式和普通网络的反向推导误差很类似：

可以看到，只有第一项不同。

2、已知卷积层的误差，反向推导上一隐藏层的误差

公式如下：

在这里插入图片描述

普通网络的反向推导误差的公式：

在这里插入图片描述

可以看到区别在于，下一层的权重w的转置操作，变成了旋转180度的操作，也就是上下翻转一次，左右再翻转一次，这其实就是“卷积”一词的意义（我们可简单理解为数学上的trick），可参考下图，Q是下一层的误差，周围补0方便计算，W是180度翻转后的卷积核，P是W和Q做卷积的结果：

在这里插入图片描述

3、已知卷积层的误差，推导该层的W,b的梯度

经过以上各步骤，我们已经算出每一层的误差了，那么：

对于全连接层，可以按照普通网络的反向传播算法求该层W,b的梯度。
对于池化层，它并没有W,b,也不用求W,b的梯度。
只有卷积层的W,b需要求出，先看w：

在这里插入图片描述
再对比一下普通网络的求w梯度的公式，发现区别在于，对前一层的输出做翻转180度的操作：

在这里插入图片描述

而对于b,则稍微有些特殊，因为在CNN中，误差δ是三维张量，而b只是一个向量，不能像普通网络中那样直接和误差δ相等。通常的做法是将误差δ的各个子矩阵的项分别求和，得到一个误差向量，即为b的梯度：

在这里插入图片描述

设计简易CNN模型，分别用Numpy、Pytorch实现卷积层和池化层的反向传播算子，并代入数值测试.(选做)

卷积层的反向传播实现：

from typing import Dict, Tuple
import numpy as np
import pytest
import torch
 
def conv2d_forward(input: np.ndarray, weight: np.ndarray, bias: np.ndarray,
                   stride: int, padding: int) -> Dict[str, np.ndarray]:
    """2D Convolution Forward Implemented with NumPy
    Args:
        input (np.ndarray): The input NumPy array of shape (H, W, C).
        weight (np.ndarray): The weight NumPy array of shape
            (C', F, F, C).
        bias (np.ndarray | None): The bias NumPy array of shape (C').
            Default: None.
        stride (int): Stride for convolution.
        padding (int): The count of zeros to pad on both sides.
    Outputs:
        Dict[str, np.ndarray]: Cached data for backward prop.
    """
    h_i, w_i, c_i = input.shape
    c_o, f, f_2, c_k = weight.shape
 
    assert (f == f_2)
    assert (c_i == c_k)
    assert (bias.shape[0] == c_o)
    input_pad = np.pad(input, [(padding, padding), (padding, padding), (0, 0)])
 
    def cal_new_sidelngth(sl, s, f, p):
        return (sl + 2 * p - f) // s + 1
 
    h_o = cal_new_sidelngth(h_i, stride, f, padding)
    w_o = cal_new_sidelngth(w_i, stride, f, padding)
    output = np.empty((h_o, w_o, c_o), dtype=input.dtype)
 
    for i_h in range(h_o):
        for i_w in range(w_o):
            for i_c in range(c_o):
                h_lower = i_h * stride
                h_upper = i_h * stride + f
                w_lower = i_w * stride
                w_upper = i_w * stride + f
                input_slice = input_pad[h_lower:h_upper, w_lower:w_upper, :]
                kernel_slice = weight[i_c]
                output[i_h, i_w, i_c] = np.sum(input_slice * kernel_slice)
                output[i_h, i_w, i_c] += bias[i_c]
 
    cache = dict()
    cache['Z'] = output
    cache['W'] = weight
    cache['b'] = bias
    cache['A_prev'] = input
    return cache
 
def conv2d_backward(dZ: np.ndarray, cache: Dict[str, np.ndarray], stride: int,
                    padding: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """2D Convolution Backward Implemented with NumPy
    Args:
        dZ: (np.ndarray): The derivative of the output of conv.
        cache (Dict[str, np.ndarray]): Record output 'Z', weight 'W', bias 'b'
            and input 'A_prev' of forward function.
        stride (int): Stride for convolution.
        padding (int): The count of zeros to pad on both sides.
    Outputs:
        Tuple[np.ndarray, np.ndarray, np.ndarray]: The derivative of W, b,
            A_prev.
    """
    W = cache['W']
    b = cache['b']
    A_prev = cache['A_prev']
    dW = np.zeros(W.shape)
    db = np.zeros(b.shape)
    dA_prev = np.zeros(A_prev.shape)
 
    _, _, c_i = A_prev.shape
    c_o, f, f_2, c_k = W.shape
    h_o, w_o, c_o_2 = dZ.shape
 
    assert (f == f_2)
    assert (c_i == c_k)
    assert (c_o == c_o_2)
 
    A_prev_pad = np.pad(A_prev, [(padding, padding), (padding, padding),
                                 (0, 0)])
    dA_prev_pad = np.pad(dA_prev, [(padding, padding), (padding, padding),
                                   (0, 0)])
    for i_h in range(h_o):
        for i_w in range(w_o):
            for i_c in range(c_o):
                h_lower = i_h * stride
                h_upper = i_h * stride + f
                w_lower = i_w * stride
                w_upper = i_w * stride + f
 
                input_slice = A_prev_pad[h_lower:h_upper, w_lower:w_upper, :]
                # forward
                # kernel_slice = W[i_c]
                # Z[i_h, i_w, i_c] = np.sum(input_slice * kernel_slice)
                # Z[i_h, i_w, i_c] += b[i_c]
 
                # backward
                dW[i_c] += input_slice * dZ[i_h, i_w, i_c]
                dA_prev_pad[h_lower:h_upper,
                            w_lower:w_upper, :] += W[i_c] * dZ[i_h, i_w, i_c]
                db[i_c] += dZ[i_h, i_w, i_c]
 
    if padding > 0:
        dA_prev = dA_prev_pad[padding:-padding, padding:-padding, :]
    else:
        dA_prev = dA_prev_pad
    return dW, db, dA_prev
 
@pytest.mark.parametrize('c_i, c_o', [(3, 6), (2, 2)])
@pytest.mark.parametrize('kernel_size', [3, 5])
@pytest.mark.parametrize('stride', [1, 2])
@pytest.mark.parametrize('padding', [0, 1])
def test_conv(c_i: int, c_o: int, kernel_size: int, stride: int, padding: str):
 
    # Preprocess
    input = np.random.randn(20, 20, c_i)
    weight = np.random.randn(c_o, kernel_size, kernel_size, c_i)
    bias = np.random.randn(c_o)
 
    torch_input = torch.from_numpy(np.transpose(
        input, (2, 0, 1))).unsqueeze(0).requires_grad_()
    torch_weight = torch.from_numpy(np.transpose(
        weight, (0, 3, 1, 2))).requires_grad_()
    torch_bias = torch.from_numpy(bias).requires_grad_()
 
    # forward
    torch_output_tensor = torch.conv2d(torch_input, torch_weight, torch_bias,
                                       stride, padding)
    torch_output = np.transpose(
        torch_output_tensor.detach().numpy().squeeze(0), (1, 2, 0))
 
    cache = conv2d_forward(input, weight, bias, stride, padding)
    numpy_output = cache['Z']
    assert np.allclose(torch_output, numpy_output)
 
    # backward
    torch_sum = torch.sum(torch_output_tensor)
    torch_sum.backward()
    torch_dW = np.transpose(torch_weight.grad.numpy(), (0, 2, 3, 1))
    torch_db = torch_bias.grad.numpy()
    torch_dA_prev = np.transpose(torch_input.grad.numpy().squeeze(0),
                                 (1, 2, 0))
 
    dZ = np.ones(numpy_output.shape)
    dW, db, dA_prev = conv2d_backward(dZ, cache, stride, padding)
 
    assert np.allclose(dW, torch_dW)
    assert np.allclose(db, torch_db)
    assert np.allclose(dA_prev, torch_dA_prev)

池化层的反向传播实现：

import numpy as np
from module import Layers 
 
class Pooling(Layers):
    def __init__(self, name, ksize, stride, type):
        super(Pooling).__init__(name)
        self.type = type
        self.ksize = ksize
        self.stride = stride 
 
    def forward(self, x):
        b, c, h, w = x.shape
        out = np.zeros([b, c, h//self.stride, w//self.stride]) 
        self.index = np.zeros_like(x)
        for b in range(b):
            for d in range(c):
                for i in range(h//self.stride):
                    for j in range(w//self.stride):
                        _x = i *self.stride
                        _y = j *self.stride
                        if self.type =="max":
                            out[b, d, i, j] = np.max(x[b, d, _x:_x+self.ksize, _y:_y+self.ksize])
                            index = np.argmax(x[b, d, _x:_x+self.ksize, _y:_y+self.ksize])
                            self.index[b, d, _x +index//self.ksize, _y +index%self.ksize ] = 1
                        elif self.type == "aveg":
                            out[b, d, i, j] = np.mean((x[b, d, _x:_x+self.ksize, _y:_y+self.ksize]))
        return out 
 
    def backward(self, grad_out):
        if self.type =="max":
            return np.repeat(np.repeat(grad_out, self.stride, axis=2),self.stride, axis=3)* self.index 
        elif self.type =="aveg":
            return np.repeat(np.repeat(grad_out, self.stride, axis=2), self.stride, axis=3)/(self.ksize * self.ksize)