MOCO： Momentum Contrast for Unsupervised Visual Representation Learning

最新推荐文章于 2024-04-16 14:19:48 发布

64318@461

最新推荐文章于 2024-04-16 14:19:48 发布

阅读量1.6k

点赞数 1

分类专栏：自监督特征提取文章标签：深度学习

本文链接：https://blog.csdn.net/weixin_56836871/article/details/122523742

版权

动机：

Unsupervised representation learning is highly successful in natural language processing，but supervised pre-training is still dominant in computer vision. The reason may stem from differences in their respective signal spaces, Language tasks have discrete signal spaces, Computer vision, in contrast, as the raw signal is in a continuous, high-dimensional space and is not structured
【CC】无监督在NLP领域大获成功，但在CV领域没啥动静。大佬认为可能是两个领域的信息空间差异比较大：NLP是离散化的/低维度的结构化信息，CV是连续的/高维度非结构化信息

意义：

These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks
【CC】本文的方法在CV大幅抹平监督-无监督的GAP

前置知识- 对比学习 as dictionary look-up
Though driven by various motivations, these methods can be thought of as building dynamic dictionaries. The“keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss.
【CC】大佬认为对比学习的本质是“构造字典-查字典”. 一个key，类比NLP的token，是由一个encoder（学习出来的NN网络）从一幅图片或者图片的一部分编码而成. 假设这个encoder已经训练好了，现在来做“查字典”：给定已经编码好的一条“query”（即待确认的一副图片），该“query”应该跟正样本的距离更近而跟负样本的距离更远（很像triplet loss）. 整个过程就是学习一个encoder使得这个contrastive loss最小

Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task Consider an encoded query q and a set of encoded samples {k0, k1, k2, …} that are the keys of a dictionary. Assume that there is a single key (denot

最低0.47元/天解锁文章

64318@461

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MOCO： Momentum Contrast for Unsupervised Visual Representation Learning

动机：Unsupervised representation learning is highly successful in natural language processing，but supervised pre-training is still dominant in computer vision. The reason may stem from differences in their respective signal spaces, Language tasks have discr
复制链接

扫一扫