PPSIG:对比学习在监督学习中应用（分类任务）

最新推荐文章于 2025-04-09 13:42:49 发布

AI Studio

最新推荐文章于 2025-04-09 13:42:49 发布

阅读量3.4w

点赞数

文章标签：学习分类深度学习

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/4358899

版权

PPSIG: 以Contrastive Deep Supervision这篇论文来了解对比学习在全监督学习中的应用（图片分类任务）

论文代码地址：https://github.com/ArchipLab-LinfengZhang/contrastive-deep-supervision

论文PDF(Contrastive Deep Supervision.pdf和Supervised Contrastive Learning.pdf)已经放在项目中。数据集以及案例任务详述可见https://aistudio.baidu.com/aistudio/projectdetail/4197628 ，不看也无妨，并不影响本项目阅读

1. Abstract

深度学习的成功通常伴随着神经网络深度的增长。然而，传统的训练方法只对神经网络的最后一层进行监督，并逐层传播监督，导致中间层的优化出现困难。

最近，deep supervision被提出在深度神经网络的中间层中添加辅助分类器。通过对这些具有监督任务损失的辅助分类器进行优化，监督可以直接应用于浅层。（ Recently, deep supervision has been proposed to add auxiliary classifiers to the intermediate layers of deep neural networks. By optimizing these auxiliary classifiers with the supervised task loss, the
supervision can be applied to the shallow layers directly.）

然而，deep supervision与众所周知的观察结果相冲突，即the shallow layers学习low-level features，而不是任务偏向的高级语义特征( However, deep supervision conflicts with the well-known observation that the shallow layers learn low-level features instead of task-biased high-level semantic features)

为了解决这一问题，本文提出了一种新的训练框架Contrastive Deep Supervision，该框架通过基于增强的对比学习(augmentation-based contrastive learning)来监督中间层。在具9个流行数据集上测试了11个模型的实验结果，证明了它在监督学习、半监督学习和知识精馏中对一般图像分类、细粒度图像分类和目标检测的影响。代码已经在Github中发布了。(To address this issue, this paper proposes a novel training
framework named Contrastive Deep Supervision, which supervises the
intermediate layers with augmentation-based contrastive learning. Experimental results on nine popular datasets with eleven models demonstrate its effects on general image classification, fine-grained image classification and object detection in supervised learning, semi-supervised
learning and knowledge distillation. Codes have been released in Github.)
在这里插入图片描述

Fig1的翻译

对这四种方法的概述。黑色的“→”和蓝色的“→”表示正向计算的路径和梯度反向计算的路径。“proj”和“fc”分别表示projection heads and the fully connected classifiers。灰色的虚线表示该特征是与任务无关的还是有任务偏向性的。(a)传统的监督学习只对最后一层进行监督，并将其传播到前一层，导致梯度消失。©深度监督直接训练最后一层和中间层，解决梯度消失，但使所有层对任务有偏差。(d)我们的方法引入了对比学习来监督中间层，从而避免了这些问题。

2. Introduction

随着大规模数据集和计算资源的增长，深度神经网络已经成为各种任务的最主要的模型。然而，神经网络深度的增加也在其训练过程中带来了挑战。传统的监督训练方法只将监督应用于最后一层，然后将误差从最后一层传播到浅层(shallow layers) (图1(a))，导致中间层(intermediate layers)的优化困难，如梯度消失。

最近，deep supervision (a.k.a. deeply-supervised net) 通过优化intermediate layers直接来解决这个问题。如图1(b)所示，深度监督在不同深度的中间层中添加了几个辅助分类器。在训练阶段，这些分类器与原始的最终分类器一起进行优化（例如分类任务的交叉熵）。实验分析和理论分析都证明了它在促进模型收敛方面的有效性。
然而，deep supervision这个策略也存在一些问题。一般来说， different layers
in CNN tend to learn features at different levels。Usually,the shallow layers learn low-level features such as colors and edges, while the last several layers learn more high-level task-related semantic features such as categorical knowledge(分类知识,就是和任务很相关的知识)for classification tasks .然而，deep supervision 迫使 shallow layers去学习与任务相关的知识，这违背了神经网络中原始的特征提取过程。正如在MSDNet中所指出的那样，这种冲突有时会导致最终分类器的准确性下降（As pointed out in MSDNet , this conflict sometimes leads to accuracy degradation in the final classifier）。这一观察结果表明，the supervised task loss 可能不是优化中间层的最佳策略（the supervised task loss is probably not the best supervision for optimizing the intermediate layers.）。

在本文中，我们认为对比学习可以比监督任务损失提供更好的中间层监督（contrastive learning can provide better supervision for intermediate layers than the supervised task loss）。对比学习（Contrastive learning）是表征学习（representation learning）中最流行和最有效的技术之一。通常，它将来自同一图像的两个增强作为正对，不同图像作为负对。在训练期间，神经网络被训练为最小化一个正对的距离，同时最大化一个负对的距离。因此，网络可以学习对各种数据增强的不变性，如颜色抖动和随机灰度。考虑到这些数据增强不变性通常是低水平的、与任务无关的，并且可转移到各种视觉任务（low-level, task-irrelevant and transferable to various vision tasks）中，我们认为对比学习可以让中间层进行更好的学习到更加有益的信息（we argue that they are more beneficial knowledge to be learned by intermediate layers.）

基于上述这些原因，论文提出了一个新的训练框架，名为Contrastive Deep Supervision。它采用对比学习代替传统的监督学习来优化中间层。.如图1(d)所示，在神经网络的中间层中连接了 several projection heads，并经过训练来执行对比学习。 several projection heads可以在推理期间被丢弃，以避免额外的计算和存储。与 deep supervision训练中间层来学习特定任务的知识不同，我们的方法中的中间层被训练来学习数据增强的不变性（ the invariance to data augmentation, which makes the neural network generalize better），这使神经网络更好地泛化。此外，由于对比学习可以在未标记的数据上进行，因此所提出的对比深度监督（the proposed contrastive deep supervision）也可以很容易地扩展到半监督学习范式中。（Besides, since contrastive learning can be performed
on unlabeled data, the proposed contrastive deep supervision can also be easily
extended in the semi-supervised learning paradigm.）

此外，对比深度监督可以进一步提高另一种深度学习技术——知识蒸馏的性能。知识蒸馏(KD)是一种流行的模型压缩方法，其目的是将知识从一个笨重的教师模型转移到一个轻量级的学生模型（ a cumbersome teacher model to a lightweight
student model）。近年来，大量研究发现， distilling the
“crucial knowledge” inside the backbone features such as attention and relation [72,49,58] leads to better performance than directly distilling all the backbone features.。在本文中，我们证明了在对比深度监督中由中间层学习到的数据增强不变性（the data augmentation invariances）对于蒸馏学习是更有益的知识。通过结合对比深度监督和纳米特征蒸馏，蒸馏后的ResNetNet18在ImageNet上的准确率达到73.23%，分别比基线和第二优KD方法高4.02%和2.16%。

3. Related Work

3.1 Deep Supervision

Deep neural networks usually contain a large number of layers, which increases
the difficulty of optimization. To address this issue, deeply supervised net (a.k.a.
deep supervision) is proposed to directly supervise the intermediate layers of deep
neural networks [38]. Wang et al. show that deep supervision can alleviate the
vanishing gradient problem and thus leads to significant performance improvements [62]. Usually, deep supervision attaches several auxiliary classifiers at the
intermediate layers and supervises these auxiliary classifiers with the task loss
(e.g. cross-entropy loss in classification). Recently, several methods have been
proposed to improve deep supervision with knowledge distillation, which aims
to minimize the difference between the prediction of the deepest classifier and
the auxiliary classifiers in the intermediate layers [55,40]. Besides classification, abundant research has also demonstrated the effectiveness of deep supervision methods in dynamic neural networks [78], semantic segmentation [81,73,51], object detection [39], knowledge distillation [76] and so on.

3.2 Contrastive Learning

In the last several years, contrastive learning has become the most popular
method in representation learning [74,63,32,27,68,18,60,5,24,61]. Oord et al. propose the contrastive predictive coding, which aims to predict the low dimension
embedding of future signals with an auto-regressive model [47]. He et al. propose
MoCo, which introduces a dynamic memory bank to record the embeddings of
negative samples [19,9,11]. Then, SimCLR is proposed to show the importance
of large batch size and long training time in contrastive learning [7,8]. Recently,
abundant research has been proposed to study the influence of negative samples
further. BYOL is introduced to demonstrate(演示) that contrastive learning is effective
even without negative samples [16]. SimSiam gives a detailed study on the importance of batch normalization, negative samples, memory bank, and the stop gradient operation [10]. Besides self-supervised learning, contrastive learning has
also shown its power in the traditional supervised learning paradigm. Khosla et
al. show that state-of-the-art performance can be achieved on ImageNet with the
basic contrastive learning in SimCLR by building the positive pairs with label
supervision [34,6]. Park et al. apply contrastive learning to unpaired image-to-image translation, which breaks the limitation of cycle reconstruction [48].

3.3 Knowledge Distillation

Knowledge distillation, which aims to facilitate（促进） the training of a lightweight
student model under the supervision of an over-parameterized teacher model,
has become one of the most popular methods in model compression. Knowledge
distillation is first proposed by Bucilua et al. [2] and then expanded by Hinton et al. [23], who introduces a temperature-characterized softmax to soften the distribution of teacher logits. Instead of distilling the knowledge of the logits, more and more techniques are proposed to distill the information in teacher features or its variants, such as attention maps [72,42], negative values [22], task-oriented information [76], relational information [49,58,43], Gram matrix [69], mutual information [1], context information [75] and so on. Besides model compression, knowledge distillation has also achieved significant success in self-supervised learning [30,46], semi-supervised learning [37,56], multi-exit neural network [78,77,70], incremental learning [83] and model robustness [65,79]

4. Methodology

4.1 Deep Supervision

Deep Supervision损失主要有两部分组成，一部分是正常model最后一层layer的交叉熵loss,另一部分是监督中间层的交叉熵Loss。如果用上知识蒸馏的知识就是再加一个KL散度，保证中间layer的辅助分类器输出和最后一层layer的输出概率尽量相似。

4.2 Contrastive Deep Supervision

就是用对比损失替换掉原先监督中间layer的交叉熵损失。传统的对比学习是在无监督学习应用的，没有label信息，就把同一个图片为一类，把N张照片进行两次随机数据增强，对于同一张图片两次数据增强的结果就当作正样本，其他的2N-2张都为负样本。然后这里的损失是使用Supervised Contrastive Learning的损失，就是更好的使用label类的信息，让同一个类的东西的特征更近，不同类的东西特征更远。
在这里插入图片描述

5.代码讲解

5.1 数据处理部分

5.1.1 解压数据

import os
if not os.path.isdir("./data/d"):
    os.mkdir("./data/d")

!unzip -qo data/data156024/Sports10.zip -d ./data/d

5.1.2 实现几个数据增强方式

RandomGrayscale实现的是pytorch的RandomGrayscale，p代表图片有多少概率会被灰度化
RandomApply实现，p代表有多少概率使用其中的数据增强的func
Cutout用来给图片挖洞，n_holes个洞, 每个洞的h和w为length

from paddle.vision.transforms import BaseTransform
import random
class RandomGrayscale(BaseTransform):
    def __init__(self, p = 0.2, keys=None):
        super(RandomGrayscale, self).__init__(keys)
        self.p = p
        self.grayscale = transforms.Grayscale(num_output_channels=3)    
    def _apply_image(self, img):
        r = random.random()
        if r < self.p:
            img = self.grayscale(img)

        return img 

class RandomApply(BaseTransform):
    def __init__(self,func ,p , keys=None):
        super(RandomApply, self).__init__(keys)
        self.p = p
        self.func = func    
    def _apply_image(self, img):
        r = random.random()
        if r < self.p:
            img = self.func(img)

        return img 

import paddle
import numpy as np


class Cutout(BaseTransform):    
    """Randomly mask out one or more patches from an image.
    Args:
        n_holes (int): Number of patches to cut out of each image.
        length (int): The length (in pixels) of each square patch.
    """
    def __init__(self, n_holes, length,keys = None):
        super(Cutout, self).__init__(keys)
        self.n_holes = n_holes
        self.length = length

    def _apply_image(self, img):
        """
        Args:
            img (Tensor): Tensor image of size (C, H, W).
        Returns:
            Tensor: Image with n_holes of dimension length x length cut out of it.
        """
        # print(img)
        h = img.shape[1]
        w = img.shape[2]

        mask = np.ones((h, w), np.float32)

        for n in range(self.n_holes):
            y = np.random.randint(h)
            x = np.random.randint(w)

            y1 = np.clip(y - self.length // 2, 0, h)
            y2 = np.clip(y + self.length // 2, 0, h)
            x1 = np.clip(x - self.length // 2, 0, w)
            x2 = np.clip(x + self.length // 2, 0, w)

            mask[y1: y2, x1: x2] = 0.

        mask = paddle.to_tensor(mask)
        mask = mask.expand_as(img)
        img = img * mask

        return img

5.1.3 实现dataset和dataloader

训练模式时将同一张图片进行两种数据增强得到2张图片，这两张图片互为正样本。
验证模型时图片只是进行归一化和变成Tensor

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from visualdl import LogWriter
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import cv2
from paddle.vision.transforms import Resize
from paddle.vision import transforms
# from cutout import Cutout
log_writer = LogWriter("./log/gnet")
class Dataset(paddle.io.Dataset):
    def __init__(self,is_train = True):
        train_csv = pd.read_csv(r'./data/d/Sports10/train_split.csv', dtype='a')
        val_csv = pd.read_csv(r'./data/d/Sports10/val_split.csv', dtype='a')

        self.train_data = train_csv['filename']
        self.valid_data = val_csv['filename']
        self.train_label = train_csv['class']
        self.valid_label = val_csv["class"]
        self.class2id = {
            "AmericanFootball":0,
            "Basketball":1,
            "BikeRacing":2,
            "CarRacing":3,
            "Fighting":4,
            "Hockey":5,
            "Soccer":6,
            "TableTennis":7,
            "Tennis":8,
            "Volleyball":9
        } 
        self.is_train = is_train
        self.transform = Resize(size = (72,128)) 
        if self.is_train == True:
            self.size = len(self.train_data)
        else:
            self.size = len(self.valid_data)
        self.transform1 = transforms.Compose([
                         transforms.RandomHorizontalFlip(), transforms.ToTensor(),
                          Cutout(n_holes=3, length=10),
                         transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
                         ])
        self.transform2 = transforms.Compose([
        transforms.RandomHorizontalFlip(),
        RandomApply(transforms.ColorJitter(0.4, 0.4, 0.4, 0.1),p=0.8), 
        RandomGrayscale(p=0.2),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
        self.transform_valid = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
    @staticmethod
    def loader(path):
        return cv2.cvtColor(cv2.imread(path, flags=cv2.IMREAD_COLOR),cv2.COLOR_BGR2RGB)
    def __getitem__(self, index):
        if self.is_train == True:
            one_img = self.loader ("./data/d/Sports10/"+self.train_data[index])
            one_label = self.class2id[self.train_label[index]]
            one_img = self.transform(one_img)
            one_img1 = self.transform1(one_img)
            one_img2 = self.transform2(one_img)
            return [one_img1,one_img2],one_label
        else:
            one_img = self.loader("./data/d/Sports10/"+self.valid_data[index])
            one_label = self.class2id[self.valid_label[index]]
            one_img = self.transform(one_img)
            one_img = self.transform_valid(one_img)     
            return one_img,one_label

    def __len__(self):
        return self.size

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized

for img_list,label in Dataset():
    print(img_list[0].shape,img_list[1].shape)
    # a = img_list[0].transpose([1,2,0]).numpy()*255
    # print(a.shape)
    # cv2.imwrite("2.jpg",cv2.cvtColor(a,cv2.COLOR_RGB2BGR))

    break

for img,label in Dataset(False):
    print(img.shape)
    break

[3, 72, 128] [3, 72, 128]
[3, 72, 128]


W0725 03:04:48.349193   500 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0725 03:04:48.353560   500 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.

BATCH_SIZE =64
train_dataset = Dataset()
data_loader = paddle.io.DataLoader(train_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)

for img,label in data_loader():
    print(len(img),img[0].shape)
    break

2 [64, 3, 72, 128]

# valid_dataset = Dataset(False)
# BATCH_SIZE =64
# data_loader = paddle.io.DataLoader(valid_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)
# for img,label in data_loader:
#     print(img.shape,label.shape)
#     break

5.2 来自Supervised Contrastive Learning的损失设计

在这里插入图片描述

z_i是作为锚点的feature，z_p就代表是正样本（就是2N张图片中与锚点是同类的图片，除了锚点自己），z_a代表的是2N-1的其他图片的feature（除了锚点自己）,P(i)就是看2N图片中同类图片的数量。像“丁”的字母是温度系数 temperature parameter，这个丁字系数可以用来调整，这是一个超参。
另外注意.有exp()就是指数的操作，所有需要把所有特征都归一化（这一步操作在model中就处理好了），防止出现数字太大。
注意这个数学推导：log(exp(a)/b) = log(exp(a)) - log(b) =a - log(b)
具体请看SupConLoss代码实现(这里SupConLoss实现的是红框框出来的L_out，不是L_in,因为Supervised Contrastive Learning通过证明L_out要优于L_in，所以论文后面只是使用L_out)

crossentropy_loss = nn.CrossEntropyLoss()
def CrossEntropy(outputs, targets):
    # log_softmax_outputs = F.softmax(outputs, axis=1)
    # softmax_targets = F.softmax(targets, axis=1)

    # return -(log_softmax_outputs*softmax_targets).sum(axis = 1).mean()
    outputs = F.softmax(outputs, axis=1)
    return crossentropy_loss(outputs,targets)

class SupConLoss(nn.Layer):
    """Supervised Contrastive Learning: https://arxiv.org/pdf/2004.11362.pdf.
    It also supports the unsupervised contrastive loss in SimCLR"""
    def __init__(self, temperature=0.07, contrast_mode='all',
                 base_temperature=0.07):
        super(SupConLoss, self).__init__()
        self.temperature = temperature
        self.contrast_mode = contrast_mode
        self.base_temperature = base_temperature

    def forward(self, features, labels=None, mask=None):
        """Compute loss for model. If both `labels` and `mask` are None,
        it degenerates to SimCLR unsupervised loss:
        https://arxiv.org/pdf/2002.05709.pdf
        Args:
            features: hidden vector of shape [bsz, n_views, ...].
            labels: ground truth of shape [bsz].
            mask: contrastive mask of shape [bsz, bsz], mask_{i,j}=1 if sample j
                has the same class as sample i. Can be asymmetric.
        Returns:
            A loss scalar.
        """


        if len(features.shape) < 3:
            raise ValueError('`features` needs to be [bsz, n_views, ...],'
                             'at least 3 dimensions are required')
        if len(features.shape) > 3:
            features = features.reshape([features.shape[0], features.shape[1], -1])

        batch_size = features.shape[0]
        if labels is not None and mask is not None:
            raise ValueError('Cannot define both `labels` and `mask`')
        elif labels is None and mask is None:
            mask = paddle.eye(batch_size, dtype=paddle.float32)
        elif labels is not None:
            labels = labels.reshape([-1, 1])
            if labels.shape[0] != batch_size:
                raise ValueError('Num of labels does not match num of features')
            mask = (labels==labels.T).astype("float32")
        else:
            mask = mask.astype("float32")

        contrast_count = features.shape[1]  #   2
        contrast_feature = paddle.concat(paddle.unbind(features,axis=1),axis=0) #[bs,2,feature] ->[2*bs,feature]
        #   2N x 512
        if self.contrast_mode == 'one':
            anchor_feature = features[:, 0]
            anchor_count = 1
        elif self.contrast_mode == 'all':
            anchor_feature = contrast_feature   #   2N x   512
            anchor_count = contrast_count   #   2
        else:
            raise ValueError('Unknown mode: {}'.format(self.contrast_mode))

        # compute logits
        anchor_dot_contrast = paddle.divide(
            paddle.matmul(anchor_feature, contrast_feature.T),
            paddle.to_tensor(self.temperature))
        # for numerical stability
        # print (anchor_dot_contrast.shape)

        logits_max = paddle.max(anchor_dot_contrast,axis=1, keepdim=True)
        logits = anchor_dot_contrast - logits_max.detach()

        # tile mask
        mask = paddle.tile(mask, repeat_times=(anchor_count, contrast_count))
        # print(mask)
        # mask-out self-contrast cases
        # logits_mask = torch.scatter(
        #     torch.ones_like(mask),
        #     1,
        #     torch.arange(batch_size * anchor_count).view(-1, 1).to(device),
        #     0
        # )
        logits_mask = paddle.ones_like(mask)- paddle.eye(batch_size * anchor_count)
        # print(logits_mask)
        mask = mask * logits_mask

        # compute log_prob
        exp_logits = paddle.exp(logits) * logits_mask
        log_prob = logits - paddle.log(exp_logits.sum(1, keepdim=True))

        # compute mean of log-likelihood over positive
        mean_log_prob_pos = (mask * log_prob).sum(1) / mask.sum(1)
        loss = - (self.temperature / self.base_temperature) * mean_log_prob_pos
        # print(loss.mean())
        loss = loss.reshape([anchor_count, batch_size]).mean()
        return loss
import paddle.nn.functional as F
features = paddle.to_tensor([[[1,2,3],[2,2,2]],[[1,1,1],[2,2,4]],[[1,1,1],[1,0,6]]]).astype("float32")
features = F.normalize(features,axis=2)
label = paddle.to_tensor([1,2,1]).astype("float32")

loss = SupConLoss()(features,label)
print(loss)

Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [2.38157272])

5.3 模型部分

见ResNet18.py

#4个auxiliary就是4个projector

        self.auxiliary1 = nn.Sequential(
            SepConv(
                channel_in=64 * block.expansion,
                channel_out=128 * block.expansion
            ),
            SepConv(
                channel_in=128 * block.expansion,
                channel_out=256 * block.expansion
            ),
            SepConv(
                channel_in=256 * block.expansion,
                channel_out=512 * block.expansion
            ),
            nn.AvgPool2D(4, 4)
        )

        self.auxiliary2 = nn.Sequential(
            SepConv(
                channel_in=128 * block.expansion,
                channel_out=256 * block.expansion,
            ),
            SepConv(
                channel_in=256 * block.expansion,
                channel_out=512 * block.expansion,
            ),
            nn.AvgPool2D(4, 4)
        )
        self.auxiliary3 = nn.Sequential(
            SepConv(
                channel_in=256 * block.expansion,
                channel_out=512 * block.expansion,
            ),
            nn.AvgPool2D(4, 4)
        )
        self.auxiliary4 = nn.AvgPool2D(4, 4)

from ResNet18 import resnet18
import paddle
out,feat_list = resnet18()(paddle.randn([4,3, 360,640]))
label = paddle.to_tensor([1,2,3,4],dtype = "int64")
CrossEntropy(out,label)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:654: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")





Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [2.31439471])

5.4 训练部分

criterion = CrossEntropy
contra_criterion = SupConLoss()

BATCH_SIZE = 128
LR = 0.1
scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=LR, milestones=[30, 50, 90], gamma=0.8, verbose=True)
net = resnet18()
net.set_state_dict(paddle.load("model/84.pdparams"))
# optimizer =  paddle.optimizer.SGD(net.parameters(), lr=LR, weight_decay=5e-4, momentum=0.9)
#  paddle.optimizer.SGD(learning_rate=0.001, parameters=net.parameters(), weight_decay=5e-4, grad_clip=None, name=None)
optimizer = paddle.optimizer.Momentum(learning_rate=scheduler, momentum=0.9, parameters=net.parameters(), use_nesterov=False, weight_decay=5e-4, grad_clip=None, name=None)
train_dataset = Dataset()
data_loader = paddle.io.DataLoader(train_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)

Epoch 0: MultiStepDecay set learning rate to 0.1.

def valid_accurary(classifer_net):
     with paddle.set_grad_enabled(False):
        valid_dataset = Dataset(is_train=False)
        valid_loader = paddle.io.DataLoader(valid_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)
        all_acc = 0
        num = 0
        for one in valid_loader:
            img_data,cls=one
            out = classifer_net(img_data,False)
            out = nn.Softmax()(out)
            one_acc = paddle.metric.accuracy(input=out, label=cls.unsqueeze(1), k=1)
            all_acc += one_acc
            num += 1
            if num==10:
                break
        return all_acc/num

def train_accurary(classifer_net):
     with paddle.set_grad_enabled(False):
        train_dataset = Dataset(is_train=True)
        train_loader = paddle.io.DataLoader(train_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)
        all_acc = 0
        num = 0
        for one in tqdm(train_loader):
            img_data,cls=one
            out = classifer_net(img_data[0],False)
            out = nn.Softmax()(out)
            one_acc = paddle.metric.accuracy(input=out, label=cls.unsqueeze(1), k=1)
            all_acc += one_acc
            num += 1
            if num==10:
                break
        return all_acc/num

v_acc = valid_accurary(net)
print("验证集的acc为",v_acc)

验证集的acc为 Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [0.80000001])


import os
epoches =100
i = 0
supervision = True

v_acc_max = 0
for epoch in range(epoches):
    print("epoch",epoch)
    if epoch%1 == 0:
        net.eval()
        v_acc = valid_accurary(net)
        train_acc = train_accurary(net)
        net.train()
        print("epoch loss",loss.numpy()[0],v_acc.numpy()[0],train_acc.numpy()[0])
        log_writer.add_scalar(tag='train/v_acc', step=i, value=v_acc)
        log_writer.add_scalar(tag='train/train_acc', step=i, value=train_acc)
        if v_acc > v_acc_max:
            v_acc_max = v_acc
            save_param_path_model = os.path.join("model", 'Gmodel_state'+str(v_acc_max.numpy()[0])+'.pdparams')
            paddle.save(net.state_dict(), save_param_path_model)
    data_loader = paddle.io.DataLoader(train_dataset,batch_size=BATCH_SIZE,shuffle =True,drop_last=True)
    for data in tqdm(data_loader):
        inputs, labels = data
        bsz = labels.shape[0]
        #   inputs[0] -> origin, inputs[1] -> augmented
        inputs = paddle.concat([inputs[0], inputs[1]], axis=0)
        inputs, labels = inputs, labels
        # print(inputs)
        outputs, feat_list = net(inputs)
        outputs = outputs[:bsz]
        cross_loss = criterion(outputs, labels)
        c_loss = 0
        for index in range(len(feat_list)):
            features = feat_list[index]
            f1, f2 = paddle.split(features, [bsz, bsz], axis=0)
            features = paddle.concat([f1.unsqueeze(1), f2.unsqueeze(1)], axis=1)
            if supervision:
                #    use SUP as contrastive learning
                c_loss += contra_criterion(features, labels) * 1e-1
            else:
                #   use SimCLR as contrastive learning
                c_loss += contra_criterion(features) * 1e-1
        loss = cross_loss + c_loss
        optimizer.clear_grad()
        loss.backward()
        optimizer.step()

        log_writer.add_scalar(tag='train/交叉熵loss', step=i, value=cross_loss.numpy()[0])
        log_writer.add_scalar(tag='train/对比loss', step=i, value=c_loss.numpy()[0])
        log_writer.add_scalar(tag='train/loss', step=i, value=loss.numpy()[0])


        if i%100 == 3:
            print("loss",loss.numpy()[0],v_acc_max)
            
        i+=1


    scheduler.step()
    # break

6. Other Details and Tricks

Design of Projection Heads

In contrastive deep supervision, several projection heads are added to the intermediate layers of neural networks during
the training period. These projection heads map the backbone features into a
normalized embedding space, where the contrastive learning loss is applied. As
discussed in related works, the architecture of the projection head is crucial to
model performance [8]. Usually, the projection head is a non-linear projection
stacked by two fully connected layers and a ReLU function. However, in contrastive deep supervision, the input feature comes from the intermediate layers
instead of the final layer, and thus it is more challenging to project them properly [8]. Hence, we increase the complexity of these projection heads by adding
convolutional layers before the non-linear projection.

对于计算supcon loss特征归一化很重要.因为监督的是中间层，所以projector应该更加复杂。这个projector的设计真的很重要，嗯。

Contrastive Learning

The proposed contrastive deep supervision is a general training framework and does not depend on a specific contrastive learning
method. In this paper, we adopt SimCLR [7] and SupCon [34] as the contrastive
learning method in most experiments. We argue that the performance of our
method can be further improved by using better contrastive learning method.

本文更相当于一种把对比学习用在监督学习的one-stage一种完整的训练流程，你也可以利用更好的对比学习思想和对比损失进行涨点。

Negative Samples

Previous studies show that the number of negative samples has a vital influence on the performance of contrastive learning. Accordingly, a large batch size, a momentum encoder or a memory bank is usually required [7,19,16]. In contrastive deep supervision, we do not use any of these solutions because the supervised loss (LCE in Equation 5) is enough to prevent contrastive learning from converging to the collapsing solutions.

这一点，我并不确定，所以并不确定。至少更大的BS有利于速度更快，然后同样是训练3个epoch，BS大的acc明显比BS小的acc高，就是如果你用小的BS，你需要更多的epoch。这是我的经验性结果，仅此而已。另外关于负样本数量已经无关紧要这点我没有找到相关的论证，所以我还是愿意相信较大的BS对于对比学习更有效。

7. 总结

当前模型在图片为72*128大小，在验证集的准确率为0.80,原数据集图像部分大小为360*640或者720*1280，因此在本次场景识别训练,图片缩小太多，很有可能损失了大量的细节信息，但是由于为了训练模型的中间layer我使用的是对比学习，然后其中这个较大的batch_size很关键，因为需要将锚点与正样本和负样本比对，只有batch_size大了，比对样本才多。因此实际训练的时候为了保证较大的2N(2*batch_size ,batch_size为128)，我不得不在显存的限制下，将数据缩小处理。

另外使用了对比损失，在train_acc高达0.99时候，valid_acc仍然有机会上升，确实可以一定的增加模型鲁棒性。

另外本篇论文，可以很好的利用半监督学习数据。有标签的数据使用Contrastive Deep Supervision，没label的数据可以使用对比损失。

然后另外本篇论文也提出了如何更好的用它的方法给模型进行知识蒸馏，可以详见论文3.2 Contrastive Deep Supervision的Knowledge Distillation部分，就是多了一个教师网络和学生网络中间层原先计算对比损失的特征算KL散度。

此文仅为搬运，原作链接：https://aistudio.baidu.com/aistudio/projectdetail/4358899