【论文阅读】Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning

来日可期1314

已于 2023-05-15 14:10:53 修改

阅读量889

点赞数 2

分类专栏：论文阅读文章标签：论文阅读

于 2023-05-14 23:02:21 首次发布

本文链接：https://blog.csdn.net/ssjq123/article/details/130673565

版权

论文阅读专栏收录该内容

29 篇文章 0 订阅

订阅专栏

该论文探讨了在图像分类中的半监督学习方法，提出使用网络预测生成软伪标签来学习无标记数据。研究发现，简单的伪标签方法可能会因为确认偏差导致过拟合不正确的标签。为解决此问题，论文引入了MixUp增强和设置每个小批量的最小标记样本数作为有效的正则化技术，减少了确认偏差，并在CIFAR-10/100,SVHN和Mini-ImageNet等数据集上取得了最先进的结果，甚至优于一致性正则化方法。

摘要由CSDN通过智能技术生成

论文下载
 GitHub
bib:

@INPROCEEDINGS{,
  title		= {Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning},
  author	= {Eric Arazo and Diego Ortego and Paul Albert and Noel E O'Connor and Kevin McGuinness},
  booktitle	= {IJCNN},
  year		= {2020},
  pages     = {1--8}
}

1. 摘要

Semi-supervised learning, i.e. jointly learning from labeled and unlabeled samples, is an active research topic due to its key role on relaxing human supervision.

总览半监督学习。

In the context of image classification, recent advances to learn from unlabeled samples are mainly focused on consistency regularization methods that encourage invariant predictions for different perturbations of unlabeled samples.

提到半监督分类中的一致性正则。

We, conversely, propose to learn from unlabeled data by generating soft pseudo-labels using the network predictions.

提到本文中适用了伪标签技术（soft pseudo-labels）。

We show that a naive pseudo-labeling overfits to incorrect pseudo-labels due to the so-called confirmation bias and demonstrate that mixup augmentation and setting a minimum number of labeled samples per mini-batch are effective regularization techniques for reducing it.

核心的贡献。提出了确认偏差（confirmation bias），本文贡献是证明了mixup augmentation和setting a minimum number of labeled samples per mini-batch是有效减少确认偏差的正则技术。

The proposed approach achieves state-of-the-art results in CIFAR-10/100, SVHN, and Mini-ImageNet despite being much simpler than other methods.

These results demonstrate that pseudo-labeling alone can outperform consistency regularization methods, while the opposite was supposed in previous work.

这一点就很令人惊讶了，伪标签技术的方法超过了一致性正则的方法。还没看原文，应该是还没有出现FixMatch和FlexMatch方法。

2. 算法描述

符号	意义
$D_l = \{(x_i, y_i)\}^{N_l}_{i=1}$	有标记数据
$D_u = \{x_i\}^{N_u}_{i=1}$	无标记数据
$\widetilde{D}_u = \{(x_i, \widetilde{y}_i\}^{N}_{i=1}$	训练数据，其中对于有标记数据 $\widetilde{y}_i$ 表示真实标签，对于无标记数据 $\widetilde{y}_i$ 表示对应伪标签。
$h_{\theta}$	模型及对应的参数 $\theta$

经典的交叉熵损失函数:
$\ell^*(\theta) = -\sum_{i=1}^{N}\widetilde{y}_i^{\mathsf{T}}\log(h_{\theta}(x_i)) \tag{1}$
Note:

In particular, we store the softmax predictions $h_{\theta}(x_i)$ of the network in every mini-batch of an epoch and use them to modify the soft pseudo-label $\widetilde{y}$ for the $N_u$ unlabeled samples at the end of every epoch.

We proceed as described from the second to the last training epoch, while in the first epoch we use the softmax predictions for the unlabeled samples from a model trained in a 10 epochs warm-up phase using the labeled data subset $D_u$ .

Soft pseudo-labels在本文中表示上一个阶段网络对于无标记样本的预测。注意区别于Hard pseudo-labels，Soft pseudo-labels不是one-hot向量，而是对于样本预测的概率向量（softmax）。

Two Regularizations:
$\ell = \ell^*+\lambda_A R_A + \lambda_H R_H \tag{2}$
where

$R_A = \sum_{c=1}^{C}p_c\log(\frac{p_c}{\overline{h}_c})$ ;
$R_H = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}h_{\theta}^c(x_i) \log(h_{\theta}^c(x_i))$ .

$R_A$ 不鼓励将所有样本分配到单个类。其中 $p_c$ 表示类别 $c$ 的先验概率分布， $\overline{h}_c$ 表示模型在数据集中所有 $c$ 类别样本中的平均概率（softmax）。意思是本来有猫有狗的类别，网络为了省事，直接不管三七二十一，直接预测一个猫，这个现象在不平衡数据集上很容易出现。

$R_H$ （entropy regularization）鼓励每个软伪标记的概率分布集中在单个类上，避免了网络可能因弱引导而陷入的局部最优。这一点容易理解，就是对于一个样本，鼓励预测的类的概率远远大于其他类别。

Confirmation bias:

Overfitting to incorrect pseudo-labels predicted by the network is known as confirmation bias.
It is natural to think that reducing the confidence of the network on its predictions might alleviate this problem and improve generalization.

Note: 这里将确认偏差（confirmation bias）定义为网络对于不正确伪标签的过拟合。降低对于不正确标签的权重可以缓解这一现象。

mixup regularization:

Recently, mixup data augmentation introduced a strong regularization technique that combines data augmentation with label smoothing, which makes it potentially useful to deal with this bias.

Question:

mixup的细节，在单个批次中，怎么mixup？
Answer: 在batch中，没有区分标记数据和无标记数据，只是在epoch结束期间会根据unlabeled_indexes来更新无标记数据伪标签。所以，在mixup期间，随机配对两个样本来mixup。

def mixup_data(x, y, alpha=1.0, device='cuda'):
    '''Returns mixed inputs, pairs of targets, and lambda'''
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size()[0]
    if device=='cuda':
        index = torch.randperm(batch_size).cuda()
    else:
        index = torch.randperm(batch_size)

    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

mixup样本的标签如何确定？
Answer:

def loss_mixup_reg_ep(preds, labels, targets_a, targets_b, device, lam, args):
    prob = F.softmax(preds, dim=1)
    prob_avg = torch.mean(prob, dim=0)
    p = torch.ones(args.num_classes).to(device) / args.num_classes

    mixup_loss_a = -torch.mean(torch.sum(targets_a * F.log_softmax(preds, dim=1), dim=1))
    mixup_loss_b = -torch.mean(torch.sum(targets_b * F.log_softmax(preds, dim=1), dim=1))
    # 值得注意的是这里和原始的mixup中loss(mix(x_a, x_b), mix(y_a, y_b))的计算流程不一样
    # 这里我没有推导，猜测两者应该是相等的关系，下式是展开的形式。
    mixup_loss = lam * mixup_loss_a + (1 - lam) * mixup_loss_b

    L_p = -torch.sum(torch.log(prob_avg) * p)
    L_e = -torch.mean(torch.sum(prob * F.log_softmax(preds, dim=1), dim=1))

    loss = mixup_loss + args.reg1 * L_p + args.reg2 * L_e
    return prob, loss

setting a minimum number of labeled samples per mini-batch:

Oversampling the labelled examples by setting a minimum number of labeled samples per mini-batch k (as done in other works provides a constant reinforcement with correct labels during training, reducing confirmation bias and helping to produce better pseudo-labels.

Question:

单个批次样本如何配置，多少个有标记数据，多少个无标记数据？
Answer: 从代码中可知，论文中提到的Oversampling其实策略和mean teacher中采取的构造batch的思路一致。

# Code obtained from:
# https://github.com/CuriousAI/mean-teacher/blob/bd4313d5691f3ce4c30635e50fa207f49edf16fe/pytorch/mean_teacher/data.py

import itertools
import logging
import os.path

from PIL import Image
import numpy as np
from torch.utils.data.sampler import Sampler



class TwoStreamBatchSampler(Sampler):
    """Iterate two sets of indices
    An 'epoch' is one iteration through the primary indices.
    During the epoch, the secondary indices are iterated through
    as many times as needed.
    """
    def __init__(self, primary_indices, secondary_indices, batch_size, secondary_batch_size):
        self.primary_indices = primary_indices
        self.secondary_indices = secondary_indices
        self.secondary_batch_size = secondary_batch_size
        self.primary_batch_size = batch_size - secondary_batch_size

        assert len(self.primary_indices) >= self.primary_batch_size > 0
        assert len(self.secondary_indices) >= self.secondary_batch_size > 0

    def __iter__(self):
        primary_iter = iterate_once(self.primary_indices)
        secondary_iter = iterate_eternally(self.secondary_indices)
        return (
            primary_batch + secondary_batch
            for (primary_batch, secondary_batch)
            in  zip(grouper(primary_iter, self.primary_batch_size),
                    grouper(secondary_iter, self.secondary_batch_size))
        )

    def __len__(self):
        return len(self.primary_indices) // self.primary_batch_size


def iterate_once(iterable):
    return np.random.permutation(iterable)


def iterate_eternally(indices):
    def infinite_shuffles():
        while True:
            yield np.random.permutation(indices)
    return itertools.chain.from_iterable(infinite_shuffles())


def grouper(iterable, n):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3) --> ABC DEF"
    args = [iter(iterable)] * n
    return zip(*args)