Induction Networks for Few-Shot Text Classification

一只小菜狗:D

已于 2022-02-14 16:47:26 修改

阅读量641

点赞数

分类专栏： NLP 文章标签：人工智能深度学习 nlp

于 2021-11-28 21:50:42 首次发布

本文链接：https://blog.csdn.net/init__/article/details/121585118

版权

NLP 专栏收录该内容

10 篇文章

订阅专栏

文章目录

reference
Problem Definition
Model
LOSS FUNCTION

reference

一文入门元学习（Meta-Learning）（附代码）
Induction Networks for Few-Shot Text Classification
meta learining leason
code for induction network

Problem Definition

Few-shot classification is a task in which a classifier must be adapted to accommodate new classes not seen in training, given only a few examples of each of these new classes. We have a large labeled training set with a set of classes $C_{train}$ . However, after training, our ultimate goal is to produce classifiers on the testing set with a disjoint set of new classes $C_{test}$ , for which only a small labeled support set will be available. If the support set contains K labeled examples for each of the C unique classes, the target few-shot problem is called a C-way Kshot problem. Usually, the K is too small to train a supervised classification model. Therefore, we aim to perform meta-learning on the training set, and extract transferable knowledge that will allow us to deliver better few-shot learning on the support set and thus classify the test set more accurately.

一般的ML分类任务无法识别新的类别（因为训练集中没有这个标签的数据），因此few-shot 分类任务可以学习到适应于新的类别的function。如果我们要进行C-ways，K-shot的文本分类任务（数据中包含C个类别，每个类别有K个文本），一般来说，K值一般足够小，以至于我们没法用它来进行有监督学习模型。因此，论文中在training set中使用meta-learning，从而能够提升新的类别的表现力。

其训练流程如下：
在这里插入图片描述

Model

在这里插入图片描述
模型总共包含三个模块：Encoder，Induction和Relation。

Encoder模块：
Encoder模块是为了获取每个样本的语义表示，可以使用典型的CNN，LSTM和transformer。

本文的做法是：
给定一个输入文本 $x=(w_1,w_2,\cdots,w_T)$ ， $w_i$ 表示第i个word的embedding，然后我们对其做如下处理：
step 1:
在这里插入图片描述
step 2:

step 3:

step 4:

Induction模块:
Induction模块用于从支撑集的样本语义中归纳出类别特征。

support set通过eq4得到的称之为 $e^s$ ，query set通过eq4得到的称之为 $e^q$ ，本节的目的就是通过induction 模块得到一个非线性的映射，该映射可以学到class vector $c_i$ :
在这里插入图片描述
接下来，要对 $e_{ij}^s$ 进行非线性变换：

这里的故事是说：
In order to accept any-way any-shot inputs in our model, a weight-sharable transformation across all sample vectors in the support set is employed. All of the sample vectors in the support set share the same transformation weights $W_s ∈ R^{2u×2u}$ and bias $b_s$ , so that the model is flexible enough to handle the support set at any scale.
共享相同的变换权重和偏置项，这样模型就具有足够的弹性可以处理各种规模的支撑集，这样的原因可能是：我们希望通过这种编码建立一种从样本特征到不变的语义特征的关系

在这里插入图片描述
其中squash是一个非线性的函数，该函数能够保证向量的方向不变但是减少向量的大小。对于一个给定的向量x，squash函数的定义如下：

为了保证类向量能够自动的封装该类的样本特征向量，因此迭代的应用动态路由算法：
在这里插入图片描述
这里使用的是胶囊网络的概念。以下是我对胶囊网络动态路由的理解，如有不对，欢迎指正：

一般的神经元output的是一个标量，而胶囊网络输出的是一个vec。从其输出可以看出一个Capsule的输出能够表达出更多的信息，比如识别人眼，【杏仁眼】、【丹凤眼】在神经元中应该是两个不同的神经元被激活，但是胶囊网络可以表达出人眼的信息，对于【杏仁眼】、【丹凤眼】只是输出的vec不同。
计算如下：

这个 $c^1,c^2$ 就是coupling coefficients(与 Algorithm2中的b是一个作用)，那么这个 $c$ 是怎么确定的？
从上图可以看出这个 $c$ 有点类似attention中的 $\alpha$ ，如果 $u^i$ 和最终的 $a^r$ 不相似，它的weight就会越来越小。