HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

这里分享我今天看到的一句话:“大多数人对吃苦的含义理解的太过肤浅了,穷就是穷,穷不是吃苦。吃苦不是受穷的能力,吃苦的本质是长时间聚焦在一件事情上的能力,以及为了长期做一件事情的过程中,所放弃的娱乐生活、无效社交、无意义的消费还有在这个过程中所忍受的孤独和不被理解,本质是一种自制能力、自控能力和坚持能力,以及深度思考。”

Abstract

这篇文章中提出了HyperTransformer,基于Transformer,用于监督/半监督 few-shot learning。方法是利用一个high-capacity Transformer model来生成一个small CNN model,这个小cnn是基于一个具体的任务,作者认为这篇文章所提出的方法能够有效地隔断large task space与某个individual 任务的复杂度

In this paper we propose a new few-shot learning approach that allows us to decouple the complexity of the task space from the complexity of individual tasks. The main idea is to use the Transformer model (Vaswani et al., 2017) that given a few-shot task episode, generates an entire inference model by producing all model weights in a single pass. This allows us to encode the intricacies of the available training data inside the Transformer model, while producing specialized tiny models for a given individual task. Reducing the size of the generated model and moving the computational overhead to the Transformer-based weight generator, we can lower the cost of the inference on new images. This can reduce the overall computation cost in cases where the tasks change infrequently and hence the weight generator is only used sporadically.

Related works

这篇文章相关的工作,有Metric-based learning、optimization-based methods。

Metric based learning

这一类依赖metric计算距离。所以这个方法的问题在于结构本身的受限,因为这个结构的目的是为了构建一个统一的函数。

One broad family of few-shot image classification methods frequently referred to as metric-based learning, relies on pretraining an embedding and then using some distance in the embedding space to label query samples based on their closeness to known labeled support samples. These methods proved on numerous benchmards, however the capabilities of the solver are limited by the capacity of the architecture itself, as these methods try to build a universal embedding function.

Optimization-based methods

另一类方法是基于优化的方法,比如MAML(可以看一下这篇知乎了解一下https://zhuanlan.zhihu.com/p/57864886)。作者认为这一类方法的问题是,需要fit与预训练模型一样大小的参数,target task如果太小的话,这个限制就会加深地更大。

On the other hand, optimization-based methods such as seminal MAML algorithm (Finn et al., 2017) can fine-tune the embedding by performing additional SGD updates on all parameters φ of the model producing it. This partially ad dresses the constraints of metric-based methods by learning a new embedding for each new task. However, in many of these methods, all the knowledge extracted during training on different tasks and describing the solver still has to “fit” into the same number of parameters as the model itself. Such limitation becomes more severe as the target models get smaller, while the richness of the task set increases.

其他

在这篇文章中,作者讨论了为什么要用Transformer作为base。

The choice of self-attention mechanism for the weight generator is not random. One reason behind this choice is that the output produced by generator with the basic self-attention is by design invariant to input permutations, i.e., permutations of samples in the training dataset. This also makes it suitable for processing unbalanced batches and batches with a variable number of samples (see Sec. 5.3). Now we show that the calculation performed by a self-attention model with properly chosen parameters can mimic basic few-shot learning algorithms further motivating its utility.

这篇文章的缺点在于不能很好地处理一些大规模的模型。

While this additional capacity proves to be very advantageous for smaller generated models, larger CNNs can accommodate sufficiently complex representations and our approach does not provide a clear advantage compared to other methods in this case.

这篇文章的方法能够直接扩展到unlabel的情况。

我其实是觉得这个应该不是优势吧,感觉MAML的方法也能够扩展到unlabel的情况,加一个unlabel的token就行了。

emmm,其实即使在MAML的方法中已经可能隐式存在这样的优势,也可以展开说一下。因为别人也不知道在这篇文章中是不是有这样的优势。

We additionally can extend our method to support unlabeled samples by appending a special input token that encodes unknown classes to all unlabeled examples.

作者在分析实验中,探索了全部生成/生成一部分

We also explore the capability of our approach to generate all weights of the CNN model, adjusting both the logits layer and all intermediate layers producing the sample embedding. We show that by generating all layers we can improve both the training and test accuracies of CNN models below a certain size. Above this model size threshold, however, generation of the logits layer alone on top of a episode-agnostic embedding appears to be suffificient for reaching peak performance (see Figure 3). This threshold is expected to depend on the variability and the complexity of the training tasks.

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值