半监督领域论文笔记——Billion-scale semi-supervised learning for image classification

最新推荐文章于 2023-02-19 21:52:52 发布

'Themis'

最新推荐文章于 2023-02-19 21:52:52 发布

阅读量704

点赞数

分类专栏：半监督学习论文笔记文章标签：深度学习半监督学习

本文链接：https://blog.csdn.net/s000da/article/details/109232063

版权

11 篇文章 1 订阅

订阅专栏

3 篇文章 0 订阅

订阅专栏

2019年，Facebook

半监督学习

借助非标注数据，提升现有模型效果

采用了teacher/student的学习机制，借助了billion级别的unable data和相对小数量级的label data，提升了当前已有模型在图像分类任务上的效果

2018年，也是Facebook，提出了weakly supervised方向的研究“Exploring the Limits of Weakly Supervised Pretraining”，采用billion级别的weakly supervised data（图像有hashtag标签，图像来源是Instagram）
本方法受启发于好几个方向：self- training, distillation, or boosting.

大量无标签+相对少量的有标签。

（billions of unlabeled images along with a relatively smaller set of task-specific labeled data）

在labeled data 数据集上A训一个teacher model
用teacher对unlabeled data打伪标签，对每一类class进行数据选择（根据伪标签prediction排序，再选top-K images），构建一个新的训练集B
在数据集B上训一个student模型，作为pre-train，student的模型规模比teacher要小
在label data数据集A上，fine-tune这个student模型

在文章的第二页有一个table 1，罗列了6点文章作者对大规模半监督学习过程的建议，浓缩了文中多项实验的精华，非常值得细品：

本人详细解读如下：

避免了数据长尾分布（long-tail distribution）问题。本方法对unlabeled data打标后手动选择，可以人工决定数据量和分布（selecting same number of images per label），避免了不同类别数量不均匀的问题
weakly supervised 的噪声问题。文章提到“significant amount of inherent noise in the labels due to non-visual, missing and irrelevant tags which can significantly hamper the learning of models”