深度卷积神经网络(AlexNet)——【torch学习笔记】_将dropout和relu应用于lenet-5,效果有提升吗?再试试预处理会怎么样?-CSDN博客

本文链接：https://blog.csdn.net/weixin_43180762/article/details/124436985

深度卷积神经网络(AlexNet)

引用翻译：《动手学深度学习》

一、学习特征表示

另一种说法是，管道中最重要的部分是表示。而直到2012年，表征都是机械式的计算。事实上，设计一套新的特征函数，改进结果，并写出方法是一个突出的论文体裁。SIFT124、SURF125、HOG126、Baggs of visual words127和类似的特征提取器统治了整个世界。另一组研究人员，包括Yann LeCun、Geoff Hinton、Yoshua Bengio、Andrew Ng、Shun-ichi Amari和Juergen Schmidhuber，有不同的计划。他们认为，特征本身应该被学习。此外，他们认为，为了达到合理的复杂性，特征应该由多个共同学习的层分层组成，每个层都有可学习的参数。在图像的情况下，最底层可能是检测边缘、颜色和纹理的。事实上，Krizhevski, Sutskever和Hinton, 2012128设计了一个新的卷积神经网络的变体，在ImageNet挑战中取得了优异的表现。有趣的是，在网络的最低层，该模型学习了类似于一些传统过滤器的特征提取器。下图是从这篇论文中转载的，描述了较低层次的图像描述符。网络中的高层可能会在这些表征的基础上，表示更大的结构，如眼睛、鼻子、草叶等。甚至更高的层可以表示整个物体，如人、飞机、狗或飞盘。最终，最终的隐藏状态学习了图像的紧凑表示，总结了它的内容，从而使属于不同类别的数据很容易被分开。虽然多层卷积网络的最终突破是在2012年，但一组核心研究人员已经致力于这一想法，多年来一直试图学习视觉数据的分层表示。2012年的最终突破可以归功于两个关键因素。

二、AlexNet

AlexNet于2012年推出，以突破性的ImageNet分类论文130的第一作者Alex Krizhevsky命名。AlexNet采用了8层卷积神经网络，在2012年ImageNet大规模视觉识别挑战赛中以惊人的巨大优势获胜。这个网络首次证明了通过学习获得的特征可以超越人工设计的特征，从而打破了计算机视觉领域的原有模式。AlexNet和LeNet的架构非常相似，如下图所示。请注意，我们提供的是一个略微精简的AlexNet版本，去掉了一些2012年需要的设计怪癖，以使该模型适合两个小型GPU。AlexNet和LeNet的设计理念非常相似，但也有很大的区别。首先，AlexNet比相对较小的LeNet5要深得多。 AlexNet由八层组成：五个卷积层、两个全连接的隐藏层和一个全连接的输出层。其次，AlexNet使用ReLU而不是sigmoid作为其激活函数。让我们深入了解一下下面的细节。

三、架构

在AlexNet的第一层，卷积窗口的形状是11×11。由于ImageNet中的大多数图像比MNIST的图像高十几倍，宽十几倍，ImageNet数据中的物体往往占据更多的像素。因此，需要一个更大的卷积窗口来捕捉物体。第二层的卷积窗口形状被缩小到5x5，其次是3x3。此外，在第一、第二和第五卷积层之后，网络增加了最大的池化层，窗口形状为3×3，步长为2。此外，AlexNet的卷积通道比LeNet多10倍。

在最后一个卷积层之后是两个具有4096个输出的全连接层。这两个巨大的全连接层产生了近1GB的模型参数。由于早期的GPU内存有限，最初的AlexNet采用了双数据流设计，因此其两个GPU中的每一个都可以只负责存储和计算其一半的模型。幸运的是，现在的GPU内存相对充裕，所以我们现在很少需要在不同的GPU之间拆分模型（我们的AlexNet模型版本在这方面与原始论文有偏差）。

在这里插入图片描述

LeNet(Left) and AlexNet(right)

四、激活函数

其次，AlexNet将sigmoid激活函数改为更简单的ReLU激活函数。一方面，ReLU激活函数的计算更简单。例如，它没有在sigmoid激活函数中发现的指数化操作。另一方面，当使用不同的参数初始化方法时，ReLU激活函数使模型训练更容易。这是因为，当sigmoid激活函数的输出非常接近0或1时，这些区域的梯度几乎为0，所以反向传播不能继续更新一些模型参数。相反，ReLU激活函数在正区间的梯度总是1。因此，如果模型参数没有被正确初始化，sigmoid函数在正区间可能获得几乎为0的梯度，这样模型就不能被有效训练。

五、容量控制预处理

AlexNet通过dropout控制全连接层的模型复杂性，而LeNet只使用权重衰减。为了进一步增强数据，AlexNet的训练循环增加了大量的图像增强，如翻转、剪裁和颜色变化。这使得模型更加稳健，更大的样本量有效地减少了过拟合。我们将在第14.1节中更详细地讨论数据增强的问题。

import sys
sys.path.insert (0,'..')

import d2l
import torch
import torch.nn as nn
import torch.optim as optim

class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

net = nn.Sequential(
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            Flatten(),
            nn.Dropout(p=0.5,inplace=True),
            nn.Linear(in_features=6400,out_features=4096),
            nn.ReLU(),
            nn.Dropout2d(p=0.5,inplace=True),
            nn.Linear(in_features=4096,out_features=4096),
            nn.ReLU(),
            nn.Linear(in_features=4096,out_features=10)
            )
print(net)

Sequential(
  (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU(inplace=True)
  (8): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): ReLU(inplace=True)
  (10): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (13): Flatten()
  (14): Dropout(p=0.5, inplace=True)
  (15): Linear(in_features=6400, out_features=4096, bias=True)
  (16): ReLU()
  (17): Dropout2d(p=0.5, inplace=True)
  (18): Linear(in_features=4096, out_features=4096, bias=True)
  (19): ReLU()
  (20): Linear(in_features=4096, out_features=10, bias=True)
)

我们构建一个高度和宽度均为224的单通道数据实例，以观察每层的输出形状。它与我们的上图相符。

X = torch.randn(size=(1,1,224,224))
# 其中  6400相当于80*80，4096 = 64*64，相当于64*64、
# 由MaxPool2d Output shape:	 torch.Size([1, 256, 5, 5])怎么到后面的Flatte？n Output shape:	 torch.Size([1, 6400])？
for layer in net:
    X=layer(X)
    print(layer.__class__.__name__,'Output shape:\t',X.shape)

Conv2d Output shape:	 torch.Size([1, 96, 54, 54])
ReLU Output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d Output shape:	 torch.Size([1, 96, 26, 26])
Conv2d Output shape:	 torch.Size([1, 256, 26, 26])
ReLU Output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d Output shape:	 torch.Size([1, 256, 12, 12])
Conv2d Output shape:	 torch.Size([1, 384, 12, 12])
ReLU Output shape:	 torch.Size([1, 384, 12, 12])
Conv2d Output shape:	 torch.Size([1, 384, 12, 12])
ReLU Output shape:	 torch.Size([1, 384, 12, 12])
Conv2d Output shape:	 torch.Size([1, 256, 12, 12])
ReLU Output shape:	 torch.Size([1, 256, 12, 12])
MaxPool2d Output shape:	 torch.Size([1, 256, 5, 5])
Flatten Output shape:	 torch.Size([1, 6400])
Dropout Output shape:	 torch.Size([1, 6400])
Linear Output shape:	 torch.Size([1, 4096])
ReLU Output shape:	 torch.Size([1, 4096])
Dropout2d Output shape:	 torch.Size([1, 4096])
Linear Output shape:	 torch.Size([1, 4096])
ReLU Output shape:	 torch.Size([1, 4096])
Linear Output shape:	 torch.Size([1, 10])

六、读取数据

虽然AlexNet在论文中使用了ImageNet，但我们在这里使用了Fashi-MNIST，因为即使在现代GPU上，训练一个ImageNet模型到收敛也需要数小时或数天。在Fashion-MNIST上直接应用AlexNet的一个问题是，我们的图像比ImageNet图像的分辨率低（28 x 28像素）。为了使事情顺利进行，我们将其升采样为244 x 244（通常不是一个明智的做法，但我们在这里这样做是为了忠实于AlexNet的架构）。我们用Resize类来执行这一调整，在使用ToTensor类之前将其插入到处理管道中。Compose类将这两个变化连接起来以方便调用。

七、训练

现在，我们可以开始训练AlexNet了。与上一节中的LeNet相比，这里的主要变化是使用了较小的学习率，并且由于网络更深更广、图像分辨率更高、卷积成本更高，所以训练速度也更慢。

batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)

def train_ch5(net, train_iter, test_iter, criterion, num_epochs, batch_size, device, lr=None):
    """Train and evaluate a model with CPU or GPU."""
    print('training on', device)
    net.to(device)
    optimizer = optim.SGD(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        net.train() # Switch to training mode
        n, start = 0, time.time()
        train_l_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        train_acc_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
        for X, y in train_iter:
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device) 
            y_hat = net(X)
            loss = criterion(y_hat, y)
            loss.backward()
            optimizer.step()
            with torch.no_grad():
                y = y.long()
                train_l_sum += loss.float()
                train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == y))).float()
                n += y.shape[0]

        test_acc = evaluate_accuracy(test_iter, net, device) 
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'\
            % (epoch + 1, train_l_sum/n, train_acc_sum/n, test_acc, time.time() - start))

lr, num_epochs, device = 0.01, 5, d2l.try_gpu()
def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:
        torch.nn.init.xavier_uniform_(m.weight)

net.apply(init_weights)
optimizer = optim.Adam(net.parameters(), lr=0.001, weight_decay=0.0005)
criterion = nn.CrossEntropyLoss()
d2l.train_ch5(net, train_iter, test_iter, criterion, num_epochs, batch_size, device, lr)