【论文阅读】Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels

最新推荐文章于 2025-07-16 20:13:44 发布

来日可期1314

最新推荐文章于 2025-07-16 20:13:44 发布

阅读量605

点赞数

CC 4.0 BY-SA版权

文章标签：论文阅读

本文链接：https://blog.csdn.net/ssjq123/article/details/134904654

文章提出了一种名为Co-teaching的深度学习方法，用于对抗噪声标签。通过同时训练两个神经网络并让它们相互教授，筛选出可能的清洁标签数据进行训练，以此提高模型鲁棒性。不同于其他研究，Co-teaching利用了深度网络的记忆特性，特别适用于噪声标签的学习任务。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文下载
 GitHub
bib:

@INPROCEEDINGS{,
	 title		= {Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels},
	 author	    = {Bo Han and Quanming  Yao and Xingrui Yu and Gang Niu and Miao Xu and Weihua Hu and Ivor Tsang and Masashi Sugiyama},
	 booktitle	= {NeurIPS},
	 year		= {8535--8545},
	 pages      = {2018}
}

1. 摘要

Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training.

存在的问题！

Nonetheless, recent studies on the memorization effects of deep neural networks show that they would first memorize training data of clean labels and then those of noisy labels.

尽管如此，最近对深度神经网络记忆效果的研究表明，它们会首先记住干净标签的训练数据，然后记住噪声标签的训练数据。

Therefore in this paper, we propose a new deep learning paradigm called “Co-teaching” for combating with noisy labels.

combating with noisy labels $\rightarrow$ 对抗嘈杂的标签

Namely, we train two deep neural networks simultaneously, and let them teach each other given every mini-batch: firstly, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this mini-batch should be used for training; finally, each network back propagates the data selected by its peer network and updates itself.

感觉上和Co-training的内容很像。

Empirical results on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that Co-teaching is much superior to the state-of-the-art methods in the robustness of trained deep models.

算法实现结果。

2. 算法描述

在这里插入图片描述
本文提出的算法描述比较简单，看懂这个伪代码就几乎明白算法的细节了。
$\bar{\mathcal{D}}_f = \argmin_{\mathcal{D}^{'}:|\mathcal{D}^{'}| \geq R(T)|\bar{\mathcal{D}}|} \ell(f, \mathcal{D}^{'}) \tag{1}$

其中，明白 $\bar{\mathcal{D}}_f$ 计算，就几乎明白了整个算法了。

$\mathcal{D}^{'}$ 表示当前的mini-batch， $f$ 表示模型。
$:|\mathcal{D}^{'}| \geq R(T)|\bar{\mathcal{D}}|$ $\argmin$ 的条件是选出子集的数量的大小大于等于 $R(T)|\bar{\mathcal{D}}|$ 。
$R (T)$ 是一个随着时间变化的比率。

由此， $\bar{\mathcal{D}}_f$ 就表示 $\mathcal{D}^{'}$ 前 $R (T)$ %个损失最小的样本损失和。原文中的表述是sample R(T)% small-loss instances。这里的问题是限制条件中是大于等于，由于是 $\argmin$ ，实际上只能是等于。

其实吧，我觉得这里也没有使用数学证明，看看就好了，有点像是强行说明。当然，也可以学习下怎么巧妙的强行说明。
Q1: Why can sampling small-loss instances based on dynamic $R(T)$ help us find clean instances?

本文建立了small loss 和 clean instances的关系。注意，这一点在本文中是没有理论推导的。

Intuitively, when labels are correct, small-loss instances are more likely to be the ones which are correctly labeled. Thus, if we train our classifier only using small-loss instances in each mini-bach data, it should be resistant to noisy labels.

本文基于的一个核心观点就是The “memorization” effect of deep networks，也就是说深度网络在训练含有噪声标签数据集的训练初期，会优先拟合干净的标签数据（有噪声标签更难拟合，会在训练后期强行记住）。因此，他们有能力在训练开始时使用损失值（选择small loss instances）过滤掉噪声实例。随着训练进行，网络最后会在嘈杂标签上过拟合。这一点是通过逐步变小的R(T)%来解决的。首先，训练初期，网络优先拟合干净标签数据，R(T)%可以较大保留更多实例；随着训练进行，网络会逐步拟合噪声标签，R(T)%逐步变小，这样我们就可以保持干净的实例，并在我们的网络记忆它们之前丢弃那些有噪声的实例。

Q2: Why do we need two networks and cross-update the parameters?
以“同行评审”作为支持性例子。当学生检查自己的试卷时，他们很难发现任何错误或错误，因为他们对答案有一些个人偏见。幸运的是，他们可以请同学审阅他们的论文。然后，他们就更容易发现自己潜在的错误。综上所述，由于一个网络的误差不会直接传回自身，因此我们可以预期，与自进化网络相比，我们的协同教学方法可以处理更重的噪声。

Relations to Co-training
直观上，不同的分类器可以产生不同的决策边界，进而具有不同的学习能力。因此，在对噪声标签进行训练时，我们也期望它们能够具有不同的过滤标签噪声的能力。

尽管协同教学是由协同训练驱动的，但唯一的相似之处是训练了两个分类器。它们之间存在着根本的区别。协同训练需要两个视图（两组独立的特征），而协同教学需要单个视图。 (ii) 协同训练不利用深度神经网络的记忆，而协同教学则利用深度神经网络的记忆。 (iii) 协同训练是为半监督学习（SSL）而设计的，而协同教学是为带有噪声标签的学习（LNL）而设计的；由于 LNL 不是 SSL 的特例，因此我们不能简单地将协同训练从一种问题设置转换为另一种问题设置。

3. 总结

这篇文章基于The “memorization” effect of deep networks，算法比较容易理解，但是缺少理论证明。其中，Co-teaching很类似于半监督中的Co-training，但是两者的应用场景不同，Co-teaching是为了在噪声标签中学习，而Co-training是为了利用无标记数据。