ResNet学习以及测试：翻车篇

ney18781902474

于 2024-07-10 15:15:32 发布

阅读量657

点赞数 10

文章标签：学习

本文链接：https://blog.csdn.net/ney18781902474/article/details/140170568

版权

在前面两篇博客VGG学习和alexnet学习的学习之后，这次学习resnet。首先我们从resnet的名字开始学习。alexnet的名字由来是因为Hinton 的学习alex爆肝搞出来的，所以用他名字来命名；VGG是牛津大学visual geometry group搞出来的，所以以视觉几何组的简称命名；resnet名字相比较前两者而言更有意义，其中的res是residual block。

作者当时研究的动机

当时的背景是圈内统一朝着深度学的方向前进，并在各种图像任务recognition, detection，segmentation,上都有进步，大家也发现在特定的网络结构之下，加大网络的宽度和深度可以提高模型的表现，例如VGG19比VGG13更好，google也在inception的论文中指出加大网络深度能够提升准确率（当然计算量更大，更难训练）。普遍的现象（问题）是：特定的网络结构之下，深度比较大的网络需要更多的训练。论文中的实验图如下

这张图的意思是：同样在4x1e4次训练之后，56层的网络在在训练集上的error rate比20层网络在train set上的error rate更高，即学习到的知识更少，即层数越大，网络的学习效率越低。再将这个事情换一个话来说：为了达到相同的准确率，deeper的网络相比于shallow的网络需要更多的训练迭代次数more epochs，即训练的效率低下。

题外话：作者在文中提到当时已经有可靠的初始化方法和batch normalization能够解决网络层数增加时梯度消失和梯度爆炸的问题，即模型训练不收敛，这个杰出共享来自google batch normalization的论文，里面的附录里是inception + BN,如果将再加上vgg的小卷积核以及标签平滑就是inception v2。

我在前面实验中对Batch Normalization的体会：在alexnet的学习时，我不知道有batch normalization的事情，当时训练的时候发现lr的设置真的是个难题，太小的根本训练不动，lr=0.01训练都很难。所以我当时甚至用0.03的lr，就是为了一开始能够训练的动，因为lr0.01甚至训练3个或者5个epoch，loss都不下降！！！虽然我没有去研究具体的梯度值，但是我认为这就是梯度太小导致的，因为网络参数更新的量=梯度*lr，当我加大lr，就能够收敛，说明某些层的梯度很小，导致lr很小时，很难训练。在vgg的学习中，没有用batch normalization时，lr=0.025,训练30个epoch才在train set上达到20%的准确率；但是使用batch normalization时，lr=0.005,训练3个epoch在train set上就达到20%的准确率，这充分说明了BN带来能够让梯度消失的问题得到极大的缓解，而且还能让网络的训练效率提高！所以即使我没有特别去学inception ,但是我对BN是肃然起敬的。。这里要特别注意我对比BN带来的训练差异是在train set上的准确率作比较。BN当年解决的实际上是深层网络在梯度消失的情形下无法训练的问题，而效率提高是附带的大礼包。

Residual Block中residual是啥意思

首先有必要说明这个residual是个啥。先看生活，我们每天都在前一天的基础上取得一点小小的进

步，那我们最后就会取得很大的进步，至少不会退步，再小都是进步； residual net的思想就是我们每一层只需要在上一层的特征基础上取得额外的进步就行，把这个原有基础上额外的进步这句话用具体的规范表示为 $H(x) = x+F(x)$ ，其中x为上一层的输出，即昨天的水平，F（x）为今天需要取得的小小进步，H（x）为今天结束之后我的水平。residual的本意就是残渣，就是说微小的好处（再小的食物残渣也可以吃），就是F（x）这微小的进步。论文中的图表示为

注意：上图从输入x到输出正体为 $relu(H(x))=relu(F(x)+x))$ ，即在今天结束之后的水平之上还做了relu。

论文中给出具体的残渣（残差） F（x）的计算公式为

即F=linear(relu(linear(x))) 。

验证残差网络的高效

大家通常提到残差网络会总是第一时间想到深度，当时就能干到152层，但容易忽略效率的问题，实际上效率才是它最出色的地方，即达到相同的train set准确率，具有残差结构的深层网络resnet的训练次数并不需要比普通网络 plainnet更多。换一个角度，如果相同训练次数，resnet在train set上的准确率比plainnet更好，那说明加入residual block是学到东西的，是带来了进步的。实际上这个进步还真不是残渣（残差），至少都算是鸡肋吧。

用于对比的VGG34的结构

下面的代码是为了和ResNet34作比较做的非常类似的一个VGG34，不同之处在于残差结构中的短路分支被我去掉。

def conv3x3(in_channels=None,out_channels = None,stride=None):
    return nn.Conv2d(in_channels=in_channels,out_channels=out_channels,kernel_size=3,stride=stride,padding=1)

def bn(channels=None):
    return nn.BatchNorm2d(channels)

def relu():
    return nn.ReLU(inplace=True)

class CResBlock(nn.Module):
    def __init__(self,in_channels=None,out_channels=None,down_sample = None):
        super().__init__()
        self.down_sample = down_sample
        self.conv1 = None
        if down_sample:
            self.conv1 = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=2)
        else:
            self.conv1 = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=1)
        self.bn1 = bn(channels=out_channels)
        self.relu1 = relu()
        self.conv2 = conv3x3(in_channels=out_channels,out_channels=out_channels,stride=1)
        self.bn2 = bn(channels=out_channels)
        self.relu2 = relu()
        self.down_sample_rout = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=2)
        self.bn_sample = bn(channels=out_channels)
        
    def forward(self,x):
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)
        
        return out


class CStage(nn.Module):
    def __init__(self,in_channels=None,out_channels=None,layer_num=None,down_sample=None):
        super().__init__()
        layers = []
        if down_sample:
            layers.append(CResBlock(in_channels=in_channels,out_channels=out_channels,down_sample=True))
        else:
            layers.append(CResBlock(in_channels=in_channels,out_channels=out_channels,down_sample=False))
        for i in range(1,layer_num):
            layers.append(CResBlock(in_channels=out_channels,out_channels=out_channels,down_sample=False))
        self.net = nn.Sequential(*layers)
        
    def forward(self,x):
        return self.net(x)


class CVGG34(nn.Module):
    def __init__(self):
        super().__init__()
        #layer 0
        self.conv0 = nn.Conv2d(in_channels=3,out_channels=64,kernel_size=7,stride=2,padding=3,bias=True)
        self.pool = nn.MaxPool2d(kernel_size=3,stride=2,padding=1)
        self.flat = nn.Flatten()
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(in_features=512,out_features=100)

        self.stage_0 = nn.Sequential(
            self.conv0,
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            self.pool
        )
        
        stage_layer_num = [3,4,6,3]
        self.stage1 = CStage(64,64,stage_layer_num[0],False)
        self.stage2 = CStage(64,128,stage_layer_num[1],True)
        self.stage3 = CStage(128,256,stage_layer_num[2],True)
        self.stage4 = CStage(256,512,stage_layer_num[3],True)
    
    def forward(self,x):
        x = self.stage_0(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        
        x = self.avgpool(x)
        x = self.flat(x)
        x = self.fc(x)
        
        return x

训练的代码

 writer = SummaryWriter('vgg_34')
    lr = 0.001
    epochs = 100
    loss = nn.CrossEntropyLoss().to(device=device)
    optimizer = torch.optim.SGD(params=my_vgg34.parameters(),lr=lr,weight_decay=0.003)
    train_process(my_vgg34,loss,optimizer,train_loader,eval0_loader,epochs,lr,writer)
    writer.close()

可视化训练的效果

在epoch=5即6个epoch达到10%的准确率，在epoch=10达到20%的准确率，后续和resnet34做对比我们再回头分析

残差网络resnet34的效果

下图为resnet34在mini-imagenet上（60000个sample，train set95% eval set5%）训练125个epoch(lr=0.001)过程中正确率的变化过程（10个小时）

使用的残差网络为pytorch官网的resnet34。然而我们对比这个resnet34的训练过程和vgg34的训练过程发现vgg34在epoch10达到20%准确率，vgg34在epoch=15达到30%准确率，但是resnet34在相同的epoch根本没有vgg34准确率高！！！！即实验结果和预期并不相符合！！！翻车了，官方的resnet34表现居然如此拉胯，好吧，我自己实现一个看看效果。

构建自己的残差网络resnet34

def conv3x3(in_channels=None,out_channels = None,stride=None):
    return nn.Conv2d(in_channels=in_channels,out_channels=out_channels,kernel_size=3,stride=stride,padding=1)

def bn(channels=None):
    return nn.BatchNorm2d(channels)

def relu():
    return nn.ReLU(inplace=True)

class CResBlock(nn.Module):
    def __init__(self,in_channels=None,out_channels=None,down_sample = None):
        super().__init__()
        self.down_sample = down_sample
        self.conv1 = None
        if down_sample:
            self.conv1 = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=2)
        else:
            self.conv1 = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=1)
        self.bn1 = bn(channels=out_channels)
        self.relu1 = relu()
        self.conv2 = conv3x3(in_channels=out_channels,out_channels=out_channels,stride=1)
        self.bn2 = bn(channels=out_channels)
        self.relu2 = relu()
        self.down_sample_rout = conv3x3(in_channels=in_channels,out_channels=out_channels,stride=2)
        self.bn_sample = bn(channels=out_channels)
        
    def forward(self,x):
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.down_sample:
            identity = self.down_sample_rout(x)
            identity = self.bn_sample(identity)
        out = self.relu2(out+identity)
        # out = self.relu2(out)
        
        return out


class CStage(nn.Module):
    def __init__(self,in_channels=None,out_channels=None,layer_num=None,down_sample=None):
        super().__init__()
        layers = []
        if down_sample:
            layers.append(CResBlock(in_channels=in_channels,out_channels=out_channels,down_sample=True))
        else:
            layers.append(CResBlock(in_channels=in_channels,out_channels=out_channels,down_sample=False))
        for i in range(1,layer_num):
            layers.append(CResBlock(in_channels=out_channels,out_channels=out_channels,down_sample=False))
        self.net = nn.Sequential(*layers)
        
    def forward(self,x):
        return self.net(x)


class CResNet34(nn.Module):
    def __init__(self):
        super().__init__()
        #layer 0
        self.conv0 = nn.Conv2d(in_channels=3,out_channels=64,kernel_size=7,stride=2,padding=3,bias=True)
        self.pool = nn.MaxPool2d(kernel_size=3,stride=2,padding=1)
        self.flat = nn.Flatten()
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(in_features=512,out_features=100)

        self.stage_0 = nn.Sequential(
            self.conv0,
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            self.pool
        )
        
        stage_layer_num = [3,4,6,3]
        self.stage1 = CStage(64,64,stage_layer_num[0],False)
        self.stage2 = CStage(64,128,stage_layer_num[1],True)
        self.stage3 = CStage(128,256,stage_layer_num[2],True)
        self.stage4 = CStage(256,512,stage_layer_num[3],True)
    
    def forward(self,x):
        x = self.stage_0(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        
        x = self.avgpool(x)
        x = self.flat(x)
        x = self.fc(x)
        
        return x

自己构建的resnet34的训练效果

显然自己构建的resnet34比pytorch 官方的resnet 34更好，学习的效率更高，因为同样是达到40%的准确率，官方的需要50个epoch左右，而自己构建的只要15个epoch。相比于vgg34达到40%准确率的25个epoch，自己构建的resnet34学习效率也要高很多。

导致翻车的原因

应该是自己的resnet34和官方的不一样导致。我仔细查看了官方结构中的conv3x3里面很多时候是没有bias的，这在一定程度上限制了学习能力。因为我们假定一个平面二分类的情况，没有bias那么在空间划分就会始终穿过原点，有bias则划分可以在平面任意位置。第二个不同是下采样的不同，官方的下采样是conv1x1，stride=2,但是我当时考虑认为这样会平白无故跳过一半的特征，所以采用conv3x3,stride = 2，来达到下采样的目的。其他方面应该是没有什么不同。整个实验我都采用的lr=0.001,因为有batchnormalization，所以都一致，不会因为层数不同训练阻力而加大或者减小lr，因此不会因为lr不同导致学习效率不同。