[解读] Unsupervised Feature Learning via Non-Parametric Instance Discrimination

最新推荐文章于 2024-05-28 11:00:45 发布

天在那边

最新推荐文章于 2024-05-28 11:00:45 发布

阅读量1.9k

点赞数 4

分类专栏：深度学习机器学习

本文链接：https://blog.csdn.net/weipf8/article/details/105756285

版权

深度学习同时被 2 个专栏收录

24 篇文章 8 订阅

订阅专栏

机器学习

24 篇文章 3 订阅

订阅专栏

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

链接: https://arxiv.org/abs/1805.01978v1
解释链接: https://blog.csdn.net/qq_16936725/article/details/51147767

本文提出了一种视觉特征的无监督学习算法, 这种方法能够学习到实例间的相似性和差异性. 首先利用卷积网络进行特征学习, 再通过非参数的 Softmax 变换, 将一个图片转换成一个特征表示.

方法

在这里插入图片描述

按照传统的卷积网络分类器设计思路, 要进行实例级别的分类任务, 对每一个样本都需要一个权重向量 $\mathbf{w}$ , 样本属于某一类的概率为
$\mathbf{v})=\frac{\exp \left(\mathbf{w}_{i}^{T} \mathbf{v}\right)}{\sum_{j=1}^{n} \exp \left(\mathbf{w}_{j}^{T} \mathbf{v}\right)},$
其中 $\mathbf{v}$ 是卷积网络输出的特征表示. $i$ 是预测类别(实例级). 参数 $\mathbf{w}$ 是需要优化的, 然而 $\mathbf{w}$ 做为分类(大类)原型, 不能做到实例间的区分. 本文将每一个样本做为一个类别, 实现实例级别的判别, 主要的改进如下:
$\mathbf{v})=\frac{\exp \left(\mathbf{v}_{i}^{T} \mathbf{v} / \tau\right)}{\sum_{j=1}^{n} \exp \left(\mathbf{v}_{j}^{T} \mathbf{v} / \tau\right)},$
其中 $\tau$ 是一个超参数, 用来调整类别分布的集中程度. 这是一种非参数的 softmax 分类器, 大大减少参数的数目.

然而当样本数很多时, 计算量是非常大的. 为了避免计算 $\mathbf{v})$ , 于是本文将使用 NCE [9] 方法来进行参数估计.
$\mathbf{v}):=P(D=1 | i, \mathbf{v})=\frac{P(i | \mathbf{v})}{P(i | \mathbf{v})+m P_{n}(i)}.$
上式表达的是样本 $i$ 来自于真实样本 ( $D = 1$ ) 的概率, $D = 0$ 则意味着来自于噪声样本. 优化目标为
$\begin{aligned} J_{N C E}(\boldsymbol{\theta}) &=-E_{P_{d}}[\log h(i, \mathbf{v})] \\ &-m \cdot E_{P_{n}}\left[\log \left(1-h\left(i, \mathbf{v}^{\prime}\right)\right)\right].\end{aligned}$
最小化优化目标即可得到卷积网络的参数 $\theta$ .

在正向计算时, 分母项 $\sum_{j=1}^{n} \exp \left(\mathbf{v}_{j}^{T} \mathbf{v} / \tau\right)$ 的计算是无法避免的, 直接计算的计算量同样很大, 于是本文使用蒙特卡罗方法来估计这一项:
$\simeq Z_{i} \simeq n E_{j}\left[\exp \left(\mathbf{v}_{j}^{T} \mathbf{f}_{i} / \tau\right)\right]=\frac{n}{m} \sum_{k=1}^{m} \exp \left(\mathbf{v}_{j k}^{T} \mathbf{f}_{i} / \tau\right).$
由于每次训练时, 每个样本相当于一个类别, 训练过程会非常不稳定, 产生很大的波动, 为了解决这个问题, 在损失函数上增加一项针对 $\mathbf{v}$ 的惩罚, 来稳定训练过程:
$-\log h\left(i, \mathbf{v}_{i}^{(t-1)}\right)+\lambda\left\|\mathbf{v}_{i}^{(t)}-\mathbf{v}_{i}^{(t-1)}\right\|_{2}^{2}$

实验

实验进行了四组, 第一组实验在 CIFAR-10 数据集上进行非参数和参数 softmax 的对比. 结果显示本文提出的模型远远超过参数化 softmax 算法.

第二组在 ImageNet 上与其它无监督学习算法进行对比, 有 self-supervised learning
[2, 47, 27, 48], adversarial learning [4], and Exemplar
CNN [3]. split-brain autoencoder [48] 则做为基准.

为了研究训练好的网络是否能有益于其他任务和迁移学习, 进行半监督学习的对比测试: (1)
Scratch, i.e. fully supervised training on the small labeled
subsets, (2) Split-brain [48] for pre-training, and (3) Colorization
[19] for pre-training. 结果显示本文方法远远优于对比方法.

为了进一步评估泛化性, 将模型迁移, 进行目标检测的测试. 在数据集 PASCAL VOC 2007 [6] 上测试, 对比方法为 Fast R-CNN [7] with AlexNet and
VGG16 architectures, and Faster R-CNN [32] with ResNet-50. 结果表明在Resnet-50测试中, 大幅领先对比方法.

可能的进一步改进

上面的改进是比较朴素的, 优化方法是最大化对数似然. 我觉得可能会引发一种不好的情况, 特征 $\mathbf{v}$ 在球面中的分布可能会趋于均匀分布, 也就是说信息熵最小, 这种距离的远近是否就能准确地表明实例间的差异大小? 有这个担心是因为最终的分类依赖于 $k$ 近邻算法. 改进的方法一方面可以从特征表示上进行, 另一方面构造一种更恰当的距离度量.

参考

[2] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 1, 2, 5, 6, 8
[3] C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. arXiv preprint arXiv:1708.07860, 2017. 2, 5, 6
[4] J. Donahue, P. Kr¨ahenb¨uhl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016. 2, 5, 6, 8
[5] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014. 1, 2, 5
[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 8
[7] R. Girshick. Fast r-cnn. In ICCV, 2015. 8
[9] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010. 2, 4
[15] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR. IEEE, 2012. 2
[19] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. CVPR, 2017. 8
[22] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017. 2
[27] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Springer, 2016. 2, 5, 6
[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 8
[33] S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. Adv. Neural Inf. Process. Syst.(NIPS), 17, 2004. 2
[35] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unifiedembeddingforfacerecognitionandclustering. InCVPR, 2
[43] F.Wang, X.Xiang, J.Cheng, andA.L.Yuille. Normface: l_2 hypersphere embedding for face verification. arXiv preprint arXiv:1704.06369, 2017. 2, 3
[47] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. ECCV, 2016. 2, 5, 6, 8
[48] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. CVPR, 2017, 5, 6, 8