NNDL 作业7：第五章课后题（1×1 卷积核 | CNN BP）

最新推荐文章于 2023-09-17 21:01:58 发布

辰希

最新推荐文章于 2023-09-17 21:01:58 发布

阅读量189

点赞数 1

文章标签：深度学习神经网络 cnn

本文链接：https://blog.csdn.net/weixin_51626106/article/details/127600107

版权

习题5-2 证明宽卷积具有交换性,即公式(5.13)

习题5-3 分析卷积神经网络中用1×1的卷积核的作用

习题5-4 对于一个输入为100×100×256的特征映射组，使用3×3的卷积核，输出为100×100×256的特征映射组的卷积层，求其时间和空间复杂度。如果引入一个1×1的卷积核，先得到100×100×64的特征映射，再进行3×3的卷积，得到100×100×256的特征映射组，求其时间和空间复杂度。

习题5-7 忽略激活函数，分析卷积网络中卷积层的前向计算和反向传播是一种转置关系

推导CNN反向传播算法（选做）

设计简易CNN模型，分别用Numpy、Pytorch实现卷积层和池化层的反向传播算子，并代入数值测试.(选做)

总结

参考文章

习题5-2 证明宽卷积具有交换性,即公式(5.13)

现有

$y_{ij}=\sum^m_{u=1}\sum^n_{v=1}w_{uv}\cdot x_{i+u-1,j+v-1}$

根据宽卷积定义

$y_{ij}=\sum^m_{n=1-(m-1)}\sum^n_{v=1-(n-1)}w_{uv}\cdot x_{i+u-1,j+v-1}$

为了让x的下标形式和w的进行对换，进行变量替换

令

s=i-u+1, t=j-v+1 ，

故

u=s-i+1,v=t-j+1 .

则

$y_{ij}=\sum^{i-1+m}_{s=i+1-m}\sum^{j-1+n}_{t=j+1-n}x_{st}\cdot w_{s-i+1,t-j+1}$

已知

$i \in [1,M]J,J \in [1,N]$

因此对于

$y_{ij}=\sum^{i-1+m}_{s=i+1-m}\sum^{j-1+n}_{t=j+1-n}x_{st}w_{s-i+1,t-j+1}$

由于宽卷积的条件，s和t的变动范围是可行的。

习题5-3 分析卷积神经网络中用1×1的卷积核的作用

特征降维，节省计算量
增加模型非线性表达能力

1*1卷积过滤器和正常的过滤器一样，唯一不同的是它的大小是1*1，没有考虑在前一层局部信息之间的关系。最早出现在 Network In Network的论文中，使用1*1卷积是想加深加宽网络结构，在Inception网络（ Going Deeper with Convolutions ）中用来降维，如下图：

由于3*3卷积或者5*5卷积在几百个filter的卷积层上做卷积操作时相当耗时，所以1*1卷积在3*3卷积或者5*5卷积计算之前先降低维度。
那么，1*1卷积的主要作用有以下几点：
1、降维（ dimension reductionality ）。比如，一张500 * 500*100 的图片用20个1*1*100的filter做卷积，那么结果的大小为500*500*20。
2、加入非线性。卷积层之后经过激励层，1*1的卷积在前一层的学习表示上添加了非线性激励（ non-linear activation ），提升网络的表达能力；

单通道图片上使用1*1的卷积核

只会在原来的输入图片的像素上乘以一个系数，没有什么直接的效果

多通道图片上使用11的卷积核

输入是6632的图片，经过1132的卷积核进行卷积运算后，得到的输出图片是66*卷积过程中使用的卷积核个数。这样就将输入图片的通道数32改变了，相当于给输入图片进行降维或升维操作。

注：输出图片的尺寸，还是根据最开始的公式计算，即Q值的大小。

习题5-4 对于一个输入为100×100×256的特征映射组，使用3×3的卷积核，输出为100×100×256的特征映射组的卷积层，求其时间和空间复杂度。如果引入一个1×1的卷积核，先得到100×100×64的特征映射，再进行3×3的卷积，得到100×100×256的特征映射组，求其时间和空间复杂度。

时间复杂度：时间复杂度即模型的运行次数。

单个卷积层的时间复杂度：Time~O(M^2 * K^2 * Cin * Cout)

M:输出特征图（Feature Map）的尺寸。
K:卷积核（Kernel）的尺寸。
Cin:输入通道数。
Cout:输出通道数。
注：

为了简化表达式变量个数，统一假设输入和卷积核的形状是正方形，实际中如果不是，则将M ^2替换成特征图的长宽相乘即可；
每一层卷积都包含一个偏置参数（bias），这里也给忽略了。加上的话时间复杂度则为：O(M^2 * K^2 * Cin *Cout+Cout)。
空间复杂度：空间复杂度即模型的参数数量。

单个卷积的空间复杂度：Space~O(K^2 * Cin * Cout+M^2*Cout)

注：空间复杂度只与卷积核的尺寸K、通道数C相关。而与输入图片尺寸无关。当我们需要裁剪模型时，由于卷积核的尺寸通常已经很小，而网络的深度又与模型的能力紧密相关，不宜过多削减，因此模型裁剪通常最先下手的地方就是通道数。

解：

时间复杂度=100×100×3×3^256×256=5898240000

空间复杂度=3×3^256×256+100×100×256=3149824

时间复杂度=100×100×1×1×256×64+100×100×3×3×64×256=1638400000

空间复杂度=1×1×256×64+100×100×64+3×3×64×256+100×100×256=3363840

习题5-7 忽略激活函数，分析卷积网络中卷积层的前向计算和反向传播是一种转置关系

卷积层大家应该都很熟悉了,为了方便说明，定义如下：
- 二维的离散卷积（N=2N=2）
- 方形的特征输入（i1=i2=ii1=i2=i）
- 方形的卷积核尺寸（k1=k2=kk1=k2=k）
- 每个维度相同的步长（s1=s2=ss1=s2=s）
- 每个维度相同的padding (p1=p2=pp1=p2=p)

下图表示参数为 (i=5,k=3,s=2,p=1)(i=5,k=3,s=2,p=1) 的卷积计算过程，从计算结果可以看出输出特征的尺寸为 (o1=o2=o=3)(o1=o2=o=3)。

下图表示参数为 (i=6,k=3,s=2,p=1)(i=6,k=3,s=2,p=1) 的卷积计算过程，从计算结果可以看出输出特征的尺寸为 (o1=o2=o=3)(o1=o2=o=3)。

从上述两个例子我们可以总结出卷积层输入特征与输出特征尺寸和卷积核参数的关系为：

o=⌊i+2p−ks⌋+1.o=⌊i+2p−ks⌋+1.
其中 ⌊x⌋⌊x⌋ 表示对 xx 向下取整。

反卷积层
在介绍反卷积之前，我们先来看看卷积运算和矩阵运算之间的关系。

卷积和矩阵相乘
考虑如下一个简单的卷积层运算，其参数为 (i=4,k=3,s=1,p=0)(i=4,k=3,s=1,p=0)，输出 o=2o=2。

对于上述卷积运算，我们把上图所示的3×3卷积核展成一个如下所示的[4,16]的稀疏矩阵 CC，其中非0元素 wi,jwi,j 表示卷积核的第 ii 行和第 jj 列。

我们再把4×4的输入特征展成[16,1]的矩阵 XX，那么 Y=CXY=CX 则是一个[4,1]的输出特征矩阵，把它重新排列2×2的输出特征就得到最终的结果，从上述分析可以看出卷积层的计算其实是可以转化成矩阵相乘的。值得注意的是，在一些深度学习网络的开源框架中并不是通过这种这个转换方法来计算卷积的，因为这个转换会存在很多无用的0乘操作，Caffe中具体实现卷积计算的方法可参考Implementing convolution as a matrix multiplication。

通过上述的分析，我们已经知道卷积层的前向操作可以表示为和矩阵CC相乘，那么我们很容易得到卷积层的反向传播就是和CC的转置相乘。

推导CNN反向传播算法（选做）

反向传播回来的误差可以看做是每个神经元的基的灵敏度sensitivities（灵敏度的意思就是我们的基b变化多少，误差会变化多少，也就是误差对基的变化率，也就是导数了），定义如下：

因为∂u/∂b=1，所以∂E/∂b=∂E/∂u=δ，也就是说bias基的灵敏度∂E/∂b=δ和误差E对一个节点全部输入u的导数∂E/∂u是相等的。这个导数就是让高层误差反向传播到底层的神来之笔。反向传播就是用下面这条关系式：（下面这条式子表达的就是第l层的灵敏度，就是）

公式（1）

这里的“◦”表示每个元素相乘。输出层的神经元的灵敏度是不一样的：

最后，对每个神经元运用delta（即δ）规则进行权值更新。具体来说就是，对一个给定的神经元，得到它的输入，然后用这个神经元的delta（即δ）来进行缩放。用向量的形式表述就是，对于第l层，误差对于该层每一个权值（组合为矩阵）的导数是该层的输入（等于上一层的输出）与该层的灵敏度（该层每个神经元的δ组合成一个向量的形式）的叉乘。然后得到的偏导数乘以一个负学习率就是该层的神经元的权值的更新了：

公式（2）

对于bias基的更新表达式差不多。实际上，对于每一个权值(W)ij都有一个特定的学习率ηIj。

设计简易CNN模型，分别用Numpy、Pytorch实现卷积层和池化层的反向传播算子，并代入数值测试.(选做)

卷积层的反向传播实现：

from typing import Dict, Tuple
import numpy as np
import pytest
import torch
 
def conv2d_forward(input: np.ndarray, weight: np.ndarray, bias: np.ndarray,
                   stride: int, padding: int) -> Dict[str, np.ndarray]:
    """2D Convolution Forward Implemented with NumPy
    Args:
        input (np.ndarray): The input NumPy array of shape (H, W, C).
        weight (np.ndarray): The weight NumPy array of shape
            (C', F, F, C).
        bias (np.ndarray | None): The bias NumPy array of shape (C').
            Default: None.
        stride (int): Stride for convolution.
        padding (int): The count of zeros to pad on both sides.
    Outputs:
        Dict[str, np.ndarray]: Cached data for backward prop.
    """
    h_i, w_i, c_i = input.shape
    c_o, f, f_2, c_k = weight.shape
 
    assert (f == f_2)
    assert (c_i == c_k)
    assert (bias.shape[0] == c_o)
    input_pad = np.pad(input, [(padding, padding), (padding, padding), (0, 0)])
 
    def cal_new_sidelngth(sl, s, f, p):
        return (sl + 2 * p - f) // s + 1
 
    h_o = cal_new_sidelngth(h_i, stride, f, padding)
    w_o = cal_new_sidelngth(w_i, stride, f, padding)
    output = np.empty((h_o, w_o, c_o), dtype=input.dtype)
 
    for i_h in range(h_o):
        for i_w in range(w_o):
            for i_c in range(c_o):
                h_lower = i_h * stride
                h_upper = i_h * stride + f
                w_lower = i_w * stride
                w_upper = i_w * stride + f
                input_slice = input_pad[h_lower:h_upper, w_lower:w_upper, :]
                kernel_slice = weight[i_c]
                output[i_h, i_w, i_c] = np.sum(input_slice * kernel_slice)
                output[i_h, i_w, i_c] += bias[i_c]
 
    cache = dict()
    cache['Z'] = output
    cache['W'] = weight
    cache['b'] = bias
    cache['A_prev'] = input
    return cache
 
def conv2d_backward(dZ: np.ndarray, cache: Dict[str, np.ndarray], stride: int,
                    padding: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """2D Convolution Backward Implemented with NumPy
    Args:
        dZ: (np.ndarray): The derivative of the output of conv.
        cache (Dict[str, np.ndarray]): Record output 'Z', weight 'W', bias 'b'
            and input 'A_prev' of forward function.
        stride (int): Stride for convolution.
        padding (int): The count of zeros to pad on both sides.
    Outputs:
        Tuple[np.ndarray, np.ndarray, np.ndarray]: The derivative of W, b,
            A_prev.
    """
    W = cache['W']
    b = cache['b']
    A_prev = cache['A_prev']
    dW = np.zeros(W.shape)
    db = np.zeros(b.shape)
    dA_prev = np.zeros(A_prev.shape)
 
    _, _, c_i = A_prev.shape
    c_o, f, f_2, c_k = W.shape
    h_o, w_o, c_o_2 = dZ.shape
 
    assert (f == f_2)
    assert (c_i == c_k)
    assert (c_o == c_o_2)
 
    A_prev_pad = np.pad(A_prev, [(padding, padding), (padding, padding),
                                 (0, 0)])
    dA_prev_pad = np.pad(dA_prev, [(padding, padding), (padding, padding),
                                   (0, 0)])
    for i_h in range(h_o):
        for i_w in range(w_o):
            for i_c in range(c_o):
                h_lower = i_h * stride
                h_upper = i_h * stride + f
                w_lower = i_w * stride
                w_upper = i_w * stride + f
 
                input_slice = A_prev_pad[h_lower:h_upper, w_lower:w_upper, :]
                # forward
                # kernel_slice = W[i_c]
                # Z[i_h, i_w, i_c] = np.sum(input_slice * kernel_slice)
                # Z[i_h, i_w, i_c] += b[i_c]
 
                # backward
                dW[i_c] += input_slice * dZ[i_h, i_w, i_c]
                dA_prev_pad[h_lower:h_upper,
                            w_lower:w_upper, :] += W[i_c] * dZ[i_h, i_w, i_c]
                db[i_c] += dZ[i_h, i_w, i_c]
 
    if padding > 0:
        dA_prev = dA_prev_pad[padding:-padding, padding:-padding, :]
    else:
        dA_prev = dA_prev_pad
    return dW, db, dA_prev
 
@pytest.mark.parametrize('c_i, c_o', [(3, 6), (2, 2)])
@pytest.mark.parametrize('kernel_size', [3, 5])
@pytest.mark.parametrize('stride', [1, 2])
@pytest.mark.parametrize('padding', [0, 1])
def test_conv(c_i: int, c_o: int, kernel_size: int, stride: int, padding: str):
 
    # Preprocess
    input = np.random.randn(20, 20, c_i)
    weight = np.random.randn(c_o, kernel_size, kernel_size, c_i)
    bias = np.random.randn(c_o)
 
    torch_input = torch.from_numpy(np.transpose(
        input, (2, 0, 1))).unsqueeze(0).requires_grad_()
    torch_weight = torch.from_numpy(np.transpose(
        weight, (0, 3, 1, 2))).requires_grad_()
    torch_bias = torch.from_numpy(bias).requires_grad_()
 
    # forward
    torch_output_tensor = torch.conv2d(torch_input, torch_weight, torch_bias,
                                       stride, padding)
    torch_output = np.transpose(
        torch_output_tensor.detach().numpy().squeeze(0), (1, 2, 0))
 
    cache = conv2d_forward(input, weight, bias, stride, padding)
    numpy_output = cache['Z']
    assert np.allclose(torch_output, numpy_output)
 
    # backward
    torch_sum = torch.sum(torch_output_tensor)
    torch_sum.backward()
    torch_dW = np.transpose(torch_weight.grad.numpy(), (0, 2, 3, 1))
    torch_db = torch_bias.grad.numpy()
    torch_dA_prev = np.transpose(torch_input.grad.numpy().squeeze(0),
                                 (1, 2, 0))
 
    dZ = np.ones(numpy_output.shape)
    dW, db, dA_prev = conv2d_backward(dZ, cache, stride, padding)
 
    assert np.allclose(dW, torch_dW)
    assert np.allclose(db, torch_db)
    assert np.allclose(dA_prev, torch_dA_prev)

池化层的反向传播实现：

import numpy as np
from module import Layers 
 
class Pooling(Layers):
    def __init__(self, name, ksize, stride, type):
        super(Pooling).__init__(name)
        self.type = type
        self.ksize = ksize
        self.stride = stride 
 
    def forward(self, x):
        b, c, h, w = x.shape
        out = np.zeros([b, c, h//self.stride, w//self.stride]) 
        self.index = np.zeros_like(x)
        for b in range(b):
            for d in range(c):
                for i in range(h//self.stride):
                    for j in range(w//self.stride):
                        _x = i *self.stride
                        _y = j *self.stride
                        if self.type =="max":
                            out[b, d, i, j] = np.max(x[b, d, _x:_x+self.ksize, _y:_y+self.ksize])
                            index = np.argmax(x[b, d, _x:_x+self.ksize, _y:_y+self.ksize])
                            self.index[b, d, _x +index//self.ksize, _y +index%self.ksize ] = 1
                        elif self.type == "aveg":
                            out[b, d, i, j] = np.mean((x[b, d, _x:_x+self.ksize, _y:_y+self.ksize]))
        return out 
 
    def backward(self, grad_out):
        if self.type =="max":
            return np.repeat(np.repeat(grad_out, self.stride, axis=2),self.stride, axis=3)* self.index 
        elif self.type =="aveg":
            return np.repeat(np.repeat(grad_out, self.stride, axis=2), self.stride, axis=3)/(self.ksize * self.ksize)