d2l现代卷积神经网络（全部更新完成）

我想吃鱼了，

已于 2023-03-18 08:06:25 修改

阅读量1k

点赞数 3

分类专栏：文件处理文章标签： cnn 人工智能机器学习

于 2023-03-14 22:06:56 首次发布

本文链接：https://blog.csdn.net/python_innocent/article/details/129540503

版权

文件处理专栏收录该内容

25 篇文章 2 订阅

订阅专栏

对第七章节的AlexNet、VGG、NiN、GoogLeNet、ResNet、DenseNet进行讲解。

2.2.1为什么会有两次in_c=out_c?

5.2.2为什么要保持通道与尺寸一致？--广播机制与BN层回顾

1.AlexNet

1.1模型概览

模型概览图如下所示：其实AlexNet可以看作是一个加强版的LeNet，相较前者，他的改进标记到了图中，主要有采取dropout，引进Relu代替Sigmoid，使用maxpooling等。

1.2网络实现

上代码：

net = nn.Sequential(
    # 这⾥，我们使⽤⼀个11*11的更⼤窗⼝来捕捉对象。
    # 同时，步幅为4，以减少输出的⾼度和宽度。
    # 另外，输出通道的数⽬远⼤于LeNet
    nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 减⼩卷积窗⼝，使⽤填充为2来使得输⼊与输出的⾼和宽⼀致，且增⼤输出通道数
    nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 使⽤三个连续的卷积层和较⼩的卷积窗⼝。
    # 除了最后的卷积层，输出通道的数量进⼀步增加。
    # 在前两个卷积层之后，汇聚层不⽤于减少输⼊的⾼度和宽度
    nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    # 这⾥，全连接层的输出数量是LeNet中的好⼏倍。使⽤dropout层来减轻过拟合
    nn.Linear(6400, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    # 最后是输出层。由于这⾥使⽤Fashion-MNIST，所以⽤类别数为10，⽽⾮论⽂中的1000
    nn.Linear(4096, 10))

1.这⾥，我们使⽤⼀个11×11的更⼤窗⼝来捕捉对象。同时，步幅为4，以减少输出的⾼度和宽度。另外，输出通道的数⽬远⼤于LeNet。有(96)。

2.减⼩卷积窗⼝，使⽤填充为2来使得输⼊与输出的⾼和宽⼀致，且增⼤输出通道数。

3.使⽤三个连续的卷积层和较⼩的卷积窗⼝。除了最后的卷积层，输出通道的数量进⼀步增加。在前两个卷积层之后，汇聚层不⽤于减少输⼊的⾼度和宽度。

4.全连接层的输出数量是LeNet中的好⼏倍。使⽤dropout层来减轻过拟合。

5.最后是输出层。由于这⾥使⽤Fashion-MNIST，所以⽤类别数为10，⽽⾮论⽂中的1000。

6.为什么后面使用了两个相同的4096？是因为前面的卷积抽取特征不够好不够深，所以后面采用了两个大的dense来补，砍掉一个的话效果会变差。

1.3模型输出概览

X = torch.randn(1, 1, 224, 224)
for layer in net:
    X=layer(X)
    print(layer.__class__.__name__,'output shape:\t',X.shape)

'''
Conv2d output shape:	 torch.Size([1, 96, 54, 54])
ReLU output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Conv2d output shape:	 torch.Size([1, 256, 26, 26])
ReLU output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
Conv2d output shape:	 torch.Size([1, 256, 12, 12])
ReLU output shape:	 torch.Size([1, 256, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 256, 5, 5])
Flatten output shape:	 torch.Size([1, 6400])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])
'''

1.4实现

使用上一节讲过的改进纯数据版：我直接整了一大块，后续还想使用直接复制即可。

from torchvision import transforms
import torchvision
from torch.utils import data


def load_data_fashion_mnist_nw2(batch_size, resize=None):
    """下载Fashion-MNIST数据集，然后将其加载到内存中"""
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(
        root="../data", train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(
        root="../data", train=False, transform=trans, download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=2),
            data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=2))

def train_ch6_data(net, train_iter, test_iter, num_epochs, lr, device):
    """⽤GPU训练模型(在第六章定义)"""
    global train_l, train_acc, metric

    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            nn.init.xavier_uniform_(m.weight)

    net.apply(init_weights)
    print('training on', device)
    net.to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss = nn.CrossEntropyLoss()
    # animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
    #                         legend=['train loss', 'train acc', 'test acc'])
    timer, num_batches = d2l.Timer(), len(train_iter)
    for epoch in range(num_epochs):
        # 训练损失之和，训练准确率之和，样本数
        metric = d2l.Accumulator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            timer.start()
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            with torch.no_grad():
                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
            timer.stop()
            train_l = metric[0] / metric[2]
            train_acc = metric[1] / metric[2]
        test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
        print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
              f'test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
          f'on {str(device)}')

超参数与数据集加载：

batch_size = 128
train_iter, test_iter = load_data_fashion_mnist_nw2(batch_size, resize=224)

训练命令行：

lr, num_epochs = 0.01, 10
train_ch6_data(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

'''
training on cuda:0
loss 1.336, train acc 0.500, test acc 0.744
loss 0.659, train acc 0.752, test acc 0.791
loss 0.542, train acc 0.798, test acc 0.808
loss 0.476, train acc 0.824, test acc 0.841
loss 0.436, train acc 0.840, test acc 0.851
loss 0.403, train acc 0.853, test acc 0.860
loss 0.381, train acc 0.861, test acc 0.862
loss 0.361, train acc 0.867, test acc 0.870
loss 0.344, train acc 0.875, test acc 0.876
loss 0.335, train acc 0.877, test acc 0.880
1200.0 examples/sec on cuda:0
'''

2.VGG

2.1模型概览

为什么要有VGG？因为Alex没有提供一个通用的模板来指导后续的研究人员设计新的网络。

其与Alex相比的特点标记在了图中：

2.2网络实现

首先，设置好每一个vgg块，是由n个卷积层与一个最大池化层组成:

def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.Conv2d(in_channels, out_channels,
                                kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

定义vgg块的个数与相应的输入输出通道：其中，224除2的话除到5次得7除不动了，所以对于输入224尺寸的图片，最多分5个块（再温习一下尺寸公式o=(i+2p-k+s)/s,带入i=224，即每一次经过一个vgg块尺寸减半（因为3×3p=1不会减尺寸)。

conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))

进而得出vgg整体网络：

def vgg(conv_arch):
    conv_blks = []
    in_channels = 1
    # 卷积层部分
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
        in_channels = out_channels
        
    return nn.Sequential(
        *conv_blks, nn.Flatten(),
        # 全连接层部分
        nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 10))

net = vgg(conv_arch)

2.2.1为什么会有两次in_c=out_c?

这是应为在VGG块中in_c=out_c，但是到了vgg函数中in_c不是共享的，还是1！两个函数不互通，如下图：

2.3模型输出概览

其中的规律可得每经过一个vgg块，空间减半通道翻倍，这个规律经常用到。

X = torch.randn(size=(1, 1, 224, 224))
for blk in net:
    X = blk(X)
    print(blk.__class__.__name__,'output shape:\t',X.shape)

'''
Sequential output shape:	 torch.Size([1, 64, 112, 112])
Sequential output shape:	 torch.Size([1, 128, 56, 56])
Sequential output shape:	 torch.Size([1, 256, 28, 28])
Sequential output shape:	 torch.Size([1, 512, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
Flatten output shape:	 torch.Size([1, 25088])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])
'''

在此注意其中的Seq操作（具体也可见5_1节关于模型嵌套),在此再啰嗦一下。

Seq(*[1,2,3])会变成Seq(1,2,3),这是一个块返回的，从而conv_blks.append到conv_blks后时，conv_blks为[Seq1, Seq2...]丢进最后的return的Seq后再解包，进而形成了最后的结果:在此看一下conv_blks没有解包时是什么：是各个vgg块的list

在vgg块中，也是打包操作：进而再通过append将各个vgg块打包到一个list中，再通过最后的Seq

总结：Seq里面套Seqi，则会顺序进行Seqi里面的layer，直到Seqi所有的layers进行完之后则在进行Seq里面的第二项（顺序）

2.4实现

降低计算量，可以将原始设置的通道数全部除4，从而降低参数量与运算量，在此也表明了vgg可以通过整体调整通道数。

ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)

训练命令行：使用输出数据的改编版，别忘了加上上面的块。

lr, num_epochs, batch_size = 0.05, 10, 64
train_iter, test_iter = load_data_fashion_mnist_nw2(batch_size, resize=224)
train_ch6_data(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

3.NiN

3.1模型概览

两个模型VGG与NiN之间的不同与改进。

其中，为什么最后要摒弃全连接：卷积层需要的参数：Ci×Co×k^2 ，而全连接为out_c×h×w×L1out_c ！！参数太大内存太大，也会占用更多的计算带宽，并且和容易过拟合。使用全局平均池化，本质是卷积操作，会降低参数与计算量，避免过拟合现象。

其中，全局平均池化：也就是这个池化层的w，h与输入的特征图的wh相同，进而得出输出的卷积尺寸为1×1。如果想预测1k类，那就输入1k个通道，每个通道经过上述池化操作只会取1个(平均)值，再加个Flattern通过Softmax得出预测结果。

3.2代码实现

首先定义NiN块：层中第一个卷积核尺寸、输出通道与stride均见概览图！

def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())

进而再构造NiN网络：注意！最后全局平均池化生成的是一个4维的，还需要Flatten来生成(bs,cls)

net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, strides=4, padding=0),
    nn.MaxPool2d(3, stride=2),
    nin_block(96, 256, kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3, stride=2),
    nin_block(256, 384, kernel_size=3, strides=1, padding=1),
    nn.MaxPool2d(3, stride=2),
    # 标签类别数是10
    nin_block(384, 10, kernel_size=3, strides=1, padding=1),
    nn.AdaptiveAvgPool2d((1, 1)),
    # 将四维的输出转成⼆维的输出，其形状为(批量⼤⼩,10)
    nn.Flatten())

第一个输入通道为1，因为FMinist为灰度图，输出为10是种类为10

注意最后这个AdaptiveAvgpool2d，里面写的（1，1）是说全局平均池化后得到的尺寸为（1，1），这就是torch的写法，没有为什么，详情见原文档：

3.2.1模型层预览

X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

'''
Sequential output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Sequential output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Sequential output shape:	 torch.Size([1, 384, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 384, 5, 5])
Sequential output shape:	 torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 10, 1, 1])
Flatten output shape:	 torch.Size([1, 10])
'''

3.2.2训练命令行

lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = load_data_fashion_mnist_nw2(batch_size, resize=224)
train_ch6_data(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

4.GoogLeNet

4.1模型概览

4.1.1Inception块

如下图：需要强调的是，inception不会改变尺寸，指挥改变通道数，且在各个部分分配的通道数是超参数，人工训练出来的！没有道理。其他的见图注：

4.1.2整体

整体分为五部分：

4.2代码实现

4.2.1首先定义inception块：

class Inception(nn.Module):
    # c1--c4是每条路径的输出通道数
    def __init__(self, in_channels, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # 线路1，单1x1卷积层
        self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
        # 线路2，1x1卷积层后接3x3卷积层
        self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
        # 线路3，1x1卷积层后接5x5卷积层
        self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
        # 线路4，3x3最⼤汇聚层后接1x1卷积层
        self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)
        
    def forward(self, x):
        p1 = F.relu(self.p1_1(x))
        p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
        p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
        p4 = F.relu(self.p4_2(self.p4_1(x)))
        # 在通道维度上连结输出
        return torch.cat((p1, p2, p3, p4), dim=1)

c1-c4表示每条分支的通道数，与Inception图结合着理解，其中线路2举例，1×1后接3×3，对后面的3×3的输入通道数对应的就是上面1×1的输出通道数，这里也是要提前设置好这个超参数。

池化层的本质是算出的那个Yij是池化窗口与X对应元素相乘的最大值或平均值，只不过不设置stride的时候与池化窗口默认一致，输出尺寸还是按照公式来o=[(i+2p-k)/s]+1，所以分支4也不改变尺寸大小。

最后cat的dim=1，是因为在四维张量里，通道数的维度位置是1（第二个）

4.2.2整体定义

整体googlenet分为五块，每一块如整体概览图所示，其中关于每个inception中的通道数是超参数，试出来的，没有原因。。。

b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                    nn.ReLU(),
                    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b2 = nn.Sequential(nn.Conv2d(64, 64, kernel_size=1),
                    nn.ReLU(),
                    nn.Conv2d(64, 192, kernel_size=3, padding=1),
                    nn.ReLU(),
                    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b3 = nn.Sequential(Inception(192, 64, (96, 128), (16, 32), 32),
                    Inception(256, 128, (128, 192), (32, 96), 64),
                    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b4 = nn.Sequential(Inception(480, 192, (96, 208), (16, 48), 64),
                    Inception(512, 160, (112, 224), (24, 64), 64),
                    Inception(512, 128, (128, 256), (24, 64), 64),
                    Inception(512, 112, (144, 288), (32, 64), 64),
                    Inception(528, 256, (160, 320), (32, 128), 128),
                    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b5 = nn.Sequential(Inception(832, 256, (160, 320), (32, 128), 128),
                    Inception(832, 384, (192, 384), (48, 128), 128),
                    nn.AdaptiveAvgPool2d((1,1)),
                    nn.Flatten())

net = nn.Sequential(b1, b2, b3, b4, b5, nn.Linear(1024, 10))

4.2.3模型概览

X = torch.rand(size=(1, 1, 96, 96))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

'''
Sequential output shape:	 torch.Size([1, 64, 24, 24])
Sequential output shape:	 torch.Size([1, 192, 12, 12])
Sequential output shape:	 torch.Size([1, 480, 6, 6])
Sequential output shape:	 torch.Size([1, 832, 3, 3])
Sequential output shape:	 torch.Size([1, 1024])
Linear output shape:	 torch.Size([1, 10])
'''

这里会发现GoogLeNet对通道数比较敏感，不像VGG那样便于修改通道数。但是它对尺寸不敏感，因为最后都是用全局平均池化，输出尺寸都会变成(1,1)，影响最后softmax的只有通道数，所以只要对应好b1的输入通道数即可。

4.2.4训练命令行

别问为什么用原函数了，问就是笔者换服务器了！！再也不用憋屈的本地小显卡了！！！

lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu(1))

5.ResNet

5.1模型概览

首先先了解一下残差块的结构：

由于本模型关联性较强，所以直接上代码部分

5.2代码实现

5.2.1残差块的实现：

class Residual(nn.Module): #@save
    def __init__(self, input_channels, num_channels,
                use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, num_channels,
                                kernel_size=3, padding=1, stride=strides)
        self.conv2 = nn.Conv2d(num_channels, num_channels,
                                kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels, num_channels,
                                    kernel_size=1, stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)
        
    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

可结合图7.6.3观看。首先第一个3×3的卷积层指定步长，并指定输入输出通道，一般尺寸减半的时候通道数要翻倍的。

第二个卷积层不再设步长，s默认等于1，通道数不变，尺寸也不变。

总结3×3这一层，如果strides=1，则尺寸不变，=2即尺寸减半，通道数按输入的来（卷积尺寸向下取整）。

再总结1×1这一层，如果strides=1，则尺寸不变，=2则尺寸减半，通道数按输入的来（卷积尺寸向下取整，1.5按1）。

上面的总结这里补充一下，卷积尺寸是向下取整的，见下图：

如果要使用1×1的卷积层，会在设置一个conv3，要保证经过1×1卷积层最终输出的尺寸和高宽与经过两个3×3的是一致的。

BN操作，在卷积后面指定上一层卷积输出通道数，在全连接后面指定上一层全连接的第一维度输出数。因为要构造gamma\beta\moving_mean\var，他们的形状分别为(1,n,1,1)、(1,n)。

要保证1×1与3×3最后输出的通道数与尺寸一致，这样才能相加。

5.2.2为什么要保持通道与尺寸一致？--广播机制与BN层回顾

这里要重温一下广播的定义，起源于复习BN层的计算，为什么BN层要传入输出通道数或Linear的输出数？

为了探寻bn里面各个变量之间的关系，造了一个网络与输入张量（2，2，5，5）

带入BN块，得到如下结果：

首先来看gamma与beta的量，传入的num_features主要是构造这四个变量形状用的，全连接就是(1,n),卷积是(1,n,1,1)

再看传入到bn计算层，X的尺寸为经过卷积后的尺寸，3=5-3+1；通道由2变为6，bs不变；mean与var计算是根据按维度加得来的，由于保持了keepdim=True,所以dim指定的维度变为1，故mean与var变成(1，6，1，1)正常。

最重要的对X_hat做矩阵运算，[(2,6,3,3)-(1,6,1,1)]/(sqrt(1,6,1,1)+eps,由于广播机制，最终得到的尺寸为(2,6,3,3)然后再送入Y，同理最终得到的尺寸与送进来的X一致，所以经过这一系列的的运算，前后一致，故不会改变输入输出的尺寸，只是对输入的数据做了归一化与缩放偏置。

在这里可以看到，这两个尺寸的张量能够利用广播机制做除的矩阵运算，

广播机制的根本

记住，广播机制满足简洁版：1.对比两个要广播的tensor，要么两个维度大小相等；2.要么一个有这个维度，另一个没有；3.要么一个有，另一个也有但是为1

那么为什么上面那个能够进行广播运算，而地下这个只有一个维度（通道）不一样就不行？

仔细观察发现，后面这个例子不满足一个张量有，另一个张量也有但是大小为1的条件（5）。

所以综上，想要实现Y+X的残差块，就必须保证通道与尺寸一致才行！！

5.2.3残差网络实现

总模型分成了五块，第一块与GooLeNet一样，首先进入的第一模块：通道数变为64，经过7×7，s=2卷积，再接3×3，s=2的最大池化层

b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

残差模块的实现：

见模型总图片：

残差模块：

def resnet_block(input_channels, num_channels, num_residuals,
                first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(input_channels, num_channels,
                                use_1x1conv=True, strides=2))
        else:
            blk.append(Residual(num_channels, num_channels))
    return blk

b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
b3 = nn.Sequential(*resnet_block(64, 128, 2))
b4 = nn.Sequential(*resnet_block(128, 256, 2))
b5 = nn.Sequential(*resnet_block(256, 512, 2))

注意：not False返回True；not True返回False

第一个block不执行stride=2（因为第一块的最大池化已经进行了尺寸减半操作了），所以两个残差块都要走else条件，则输入输出通道相同，尺寸也不减半。

这里面一共有4个模块组成，每个模块使用若干个残差块，在num_residuals里面设置。

在这里取每个模块有2个残差块，第一个模块中，i==0时，not True返回False，所以为else，i==1时，也返回else，所以第一个模块中的两个残差快均执行的通道不变，尺寸不变，没有1×1层的残差操作，结合模型总图与图一致，故思路正确！！

后面的3个模块，每个均包含两个残差快，其中第一个(i==0)包含了1×1，第二个不包含1×1，也都make sense。

模型的组装：

net = nn.Sequential(b1, b2, b3, b4, b5,
                    nn.AdaptiveAvgPool2d((1,1)),
                    nn.Flatten(), nn.Linear(512, 10))

可以看到，第一个b1的两个残差块s都为1，都没有1×1卷积；之后的残差块第一个s=2，有1×1卷积，与前面的描述一致。

检验与执行：看一下经过第二个模块确实没有再将尺寸减半。

X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

'''
Sequential output shape:	 torch.Size([1, 64, 56, 56])
Sequential output shape:	 torch.Size([1, 64, 56, 56])
Sequential output shape:	 torch.Size([1, 128, 28, 28])
Sequential output shape:	 torch.Size([1, 256, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 512, 1, 1])
Flatten output shape:	 torch.Size([1, 512])
Linear output shape:	 torch.Size([1, 10])
'''

5.2.4训练命令行

lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu(1))

6.DenseNet

这个网络沐神的课是没有的，只有书上有，所以可以说是读者自己复现debug理解的。

6.1模型概览与原理图

其主要由稠密块dense block与过渡层transition layer组成，前者定义如何连接输⼊和输出，⽽后者则控制通道数量，使其不会太复杂。talk is cheap:

6.2代码实现

首先来看稠密块形成：首先定义一个稠密块里面的卷积操作层，实现了一次BN+Relu+(k=3p=1)的卷积操作：其尺寸是不会变的，只会变通道：

def conv_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), nn.ReLU(),
        nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1))

class DenseBlock(nn.Module):
    def __init__(self, num_convs, input_channels, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(
                num_channels * i + input_channels, num_channels))
        self.net = nn.Sequential(*layer)
        
    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # 连接通道维度上每个块的输⼊和输出
            X = torch.cat((X, Y), dim=1)
        return X

生成一个demo检测一下这个稠密块都干了啥：

blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

'''
torch.Size([4, 23, 8, 8])
'''

可以根据此画出原理图：

总结一下，原来的通道数为c，经过DenseBlock(a,c,d),输出的通道数为3+ad，结合稠密定义的图片，也可以更好的理解论文上的原图：（这里每个稠密块只采用了两个卷积层）

接下来再看过度层：

def transition_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), nn.ReLU(),
        nn.Conv2d(input_channels, num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

经过一个过渡层，使用1×1卷积，通道减为过渡层输入的num_channels,尺寸从in_c降到num_c；

尺寸通过Avg池化s=2减半，进一步降低模型复杂度。

再来用上一个用过的demo演示一下：

blk = transition_block(23, 10)
blk(Y).shape

'''
torch.Size([4, 10, 4, 4])
'''

6.3构造DenseNet

首先对第一个模块构造：

b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
    nn.BatchNorm2d(64), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

没啥好讲的，就是对图片进行了类似resnet第一块操作。

紧接着重头戏：稠密层与过渡层：

# num_channels为当前的通道数
num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blks = []
for i, num_convs in enumerate(num_convs_in_dense_blocks):
    blks.append(DenseBlock(num_convs, num_channels, growth_rate))
    # 上⼀个稠密块的输出通道数
    num_channels += num_convs * growth_rate
    # 在稠密块之间添加⼀个转换层，使通道数量减半
    if i != len(num_convs_in_dense_blocks) - 1:
        blks.append(transition_block(num_channels, num_channels // 2))
        num_channels = num_channels // 2

使用4个稠密块，每个稠密块设置4个卷积层，卷积层通道数(即增长率）设为32，故每个稠密块将增加128个通道数。

每个模块之间，resnet通过s=2减小wh，Densenet通过过渡层减半wh并减少通道数。

net = nn.Sequential(
    b1, *blks,
    nn.BatchNorm2d(num_channels), nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(num_channels, 10))

然后训练即可：

lr, num_epochs, batch_size = 0.1, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu(1))