@INPROCEEDINGS{,
title = {Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning},
author = {Eric Arazo and Diego Ortego and Paul Albert and Noel E O'Connor and Kevin McGuinness},
booktitle = {IJCNN},
year = {2020},
pages = {1--8}
}
1. 摘要
Semi-supervised learning, i.e. jointly learning from labeled and unlabeled samples, is an active research topic due to its key role on relaxing human supervision.
总览半监督学习。
In the context of image classification, recent advances to learn from unlabeled samples are mainly focused on consistency regularization methods that encourage invariant predictions for different perturbations of unlabeled samples.
提到半监督分类中的一致性正则。
We, conversely, propose to learn from unlabeled data by generating soft pseudo-labels using the network predictions.
提到本文中适用了伪标签技术(soft pseudo-labels)。
We show that a naive pseudo-labeling overfits to incorrect pseudo-labels due to the so-called confirmation bias and demonstrate that mixup augmentation and setting a minimum number of labeled samples per mini-batch are effective regularization techniques for reducing it.
核心的贡献。提出了确认偏差(confirmation bias),本文贡献是证明了mixup augmentation
和setting a minimum number of labeled samples per mini-batch
是有效减少确认偏差的正则技术。
The proposed approach achieves state-of-the-art results in CIFAR-10/100, SVHN, and Mini-ImageNet despite being much simpler than other methods.
These results demonstrate that pseudo-labeling alone can outperform consistency regularization methods, while the opposite was supposed in previous work.
这一点就很令人惊讶了,伪标签技术的方法超过了一致性正则的方法。还没看原文,应该是还没有出现FixMatch
和FlexMatch
方法。
2. 算法描述
符号 | 意义 |
---|---|
D l = { ( x i , y i ) } i = 1 N l D_l = \{(x_i, y_i)\}^{N_l}_{i=1} Dl={(xi,yi)}i=1Nl | 有标记数据 |
D u = { x i } i = 1 N u D_u = \{x_i\}^{N_u}_{i=1} Du={xi}i=1Nu | 无标记数据 |
D ~ u = { ( x i , y ~ i } i = 1 N \widetilde{D}_u = \{(x_i, \widetilde{y}_i\}^{N}_{i=1} D u={(xi,y i}i=1N | 训练数据,其中对于有标记数据 y ~ i \widetilde{y}_i y i表示真实标签,对于无标记数据 y ~ i \widetilde{y}_i y i表示对应伪标签。 |
h θ h_{\theta} hθ | 模型及对应的参数 θ \theta θ |
经典的交叉熵损失函数:
ℓ
∗
(
θ
)
=
−
∑
i
=
1
N
y
~
i
T
log
(
h
θ
(
x
i
)
)
(1)
\ell^*(\theta) = -\sum_{i=1}^{N}\widetilde{y}_i^{\mathsf{T}}\log(h_{\theta}(x_i)) \tag{1}
ℓ∗(θ)=−i=1∑Ny
iTlog(hθ(xi))(1)
Note:
In particular, we store the softmax predictions h θ ( x i ) h_{\theta}(x_i) hθ(xi) of the network in every mini-batch of an epoch and use them to modify the soft pseudo-label y ~ \widetilde{y} y for the N u N_u Nu unlabeled samples at the end of every epoch.
We proceed as described from the second to the last training epoch, while in the first epoch we use the softmax predictions for the unlabeled samples from a model trained in a 10 epochs warm-up phase using the labeled data subset D u D_u Du.
Soft pseudo-labels
在本文中表示上一个阶段网络对于无标记样本的预测。注意区别于Hard pseudo-labels
,Soft pseudo-labels
不是one-hot向量,而是对于样本预测的概率向量(softmax)。
Two Regularizations:
ℓ
=
ℓ
∗
+
λ
A
R
A
+
λ
H
R
H
(2)
\ell = \ell^*+\lambda_A R_A + \lambda_H R_H \tag{2}
ℓ=ℓ∗+λARA+λHRH(2)
where
- R A = ∑ c = 1 C p c log ( p c h ‾ c ) R_A = \sum_{c=1}^{C}p_c\log(\frac{p_c}{\overline{h}_c}) RA=∑c=1Cpclog(hcpc);
- R H = − 1 N ∑ i = 1 N ∑ c = 1 C h θ c ( x i ) log ( h θ c ( x i ) ) R_H = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}h_{\theta}^c(x_i) \log(h_{\theta}^c(x_i)) RH=−N1∑i=1N∑c=1Chθc(xi)log(hθc(xi)).
R A R_A RA不鼓励将所有样本分配到单个类。其中 p c p_c pc表示类别 c c c的先验概率分布, h ‾ c \overline{h}_c hc表示模型在数据集中所有 c c c类别样本中的平均概率(softmax)。意思是本来有猫有狗的类别,网络为了省事,直接不管三七二十一,直接预测一个猫,这个现象在不平衡数据集上很容易出现。
R
H
R_H
RH(entropy regularization
)鼓励每个软伪标记的概率分布集中在单个类上,避免了网络可能因弱引导而陷入的局部最优。这一点容易理解,就是对于一个样本,鼓励预测的类的概率远远大于其他类别。
Confirmation bias:
Overfitting to incorrect pseudo-labels predicted by the network is known as confirmation bias.
It is natural to think that reducing the confidence of the network on its predictions might alleviate this problem and improve generalization.
Note:
这里将确认偏差(confirmation bias)定义为网络对于不正确伪标签的过拟合。降低对于不正确标签的权重可以缓解这一现象。
mixup regularization:
Recently, mixup data augmentation introduced a strong regularization technique that combines data augmentation with label smoothing, which makes it potentially useful to deal with this bias.
Question:
- mixup的细节,在单个批次中,怎么mixup?
Answer:
在batch中,没有区分标记数据和无标记数据,只是在epoch结束期间会根据unlabeled_indexes
来更新无标记数据伪标签。所以,在mixup期间,随机配对两个样本来mixup。
def mixup_data(x, y, alpha=1.0, device='cuda'):
'''Returns mixed inputs, pairs of targets, and lambda'''
if alpha > 0:
lam = np.random.beta(alpha, alpha)
else:
lam = 1
batch_size = x.size()[0]
if device=='cuda':
index = torch.randperm(batch_size).cuda()
else:
index = torch.randperm(batch_size)
mixed_x = lam * x + (1 - lam) * x[index, :]
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
- mixup样本的标签如何确定?
Answer:
def loss_mixup_reg_ep(preds, labels, targets_a, targets_b, device, lam, args):
prob = F.softmax(preds, dim=1)
prob_avg = torch.mean(prob, dim=0)
p = torch.ones(args.num_classes).to(device) / args.num_classes
mixup_loss_a = -torch.mean(torch.sum(targets_a * F.log_softmax(preds, dim=1), dim=1))
mixup_loss_b = -torch.mean(torch.sum(targets_b * F.log_softmax(preds, dim=1), dim=1))
# 值得注意的是这里和原始的mixup中loss(mix(x_a, x_b), mix(y_a, y_b))的计算流程不一样
# 这里我没有推导,猜测两者应该是相等的关系,下式是展开的形式。
mixup_loss = lam * mixup_loss_a + (1 - lam) * mixup_loss_b
L_p = -torch.sum(torch.log(prob_avg) * p)
L_e = -torch.mean(torch.sum(prob * F.log_softmax(preds, dim=1), dim=1))
loss = mixup_loss + args.reg1 * L_p + args.reg2 * L_e
return prob, loss
setting a minimum number of labeled samples per mini-batch:
Oversampling
the labelled examples by setting a minimum number of labeled samples per mini-batch k (as done in other works provides a constant reinforcement with correct labels during training, reducing confirmation bias and helping to produce better pseudo-labels.
Question:
- 单个批次样本如何配置,多少个有标记数据,多少个无标记数据?
Answer:
从代码中可知,论文中提到的Oversampling
其实策略和mean teacher
中采取的构造batch的思路一致。
# Code obtained from:
# https://github.com/CuriousAI/mean-teacher/blob/bd4313d5691f3ce4c30635e50fa207f49edf16fe/pytorch/mean_teacher/data.py
import itertools
import logging
import os.path
from PIL import Image
import numpy as np
from torch.utils.data.sampler import Sampler
class TwoStreamBatchSampler(Sampler):
"""Iterate two sets of indices
An 'epoch' is one iteration through the primary indices.
During the epoch, the secondary indices are iterated through
as many times as needed.
"""
def __init__(self, primary_indices, secondary_indices, batch_size, secondary_batch_size):
self.primary_indices = primary_indices
self.secondary_indices = secondary_indices
self.secondary_batch_size = secondary_batch_size
self.primary_batch_size = batch_size - secondary_batch_size
assert len(self.primary_indices) >= self.primary_batch_size > 0
assert len(self.secondary_indices) >= self.secondary_batch_size > 0
def __iter__(self):
primary_iter = iterate_once(self.primary_indices)
secondary_iter = iterate_eternally(self.secondary_indices)
return (
primary_batch + secondary_batch
for (primary_batch, secondary_batch)
in zip(grouper(primary_iter, self.primary_batch_size),
grouper(secondary_iter, self.secondary_batch_size))
)
def __len__(self):
return len(self.primary_indices) // self.primary_batch_size
def iterate_once(iterable):
return np.random.permutation(iterable)
def iterate_eternally(indices):
def infinite_shuffles():
while True:
yield np.random.permutation(indices)
return itertools.chain.from_iterable(infinite_shuffles())
def grouper(iterable, n):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3) --> ABC DEF"
args = [iter(iterable)] * n
return zip(*args)