目录
2.1 meta-learning与machine learning区别
extension:元学习学习初始化权重的方法和预训练方法有什么区别?
3.4 Metric-based Meta-Learning Algorithm: Prototypical Network
4.1 HG-Meta: Graph Meta-learning over Heterogeneous Graphs (2022, SDM)
4.2 G-Meta: Graph learning via local subgraphs (2020, NeurlPS 2020)
5.1 事件抽取(Zero-shot Transfer Learning for Event Extraction)
1. background
元学习(meta-learning)是过去几年最火爆的学习方法之一,各式各样的paper都是基于元学习展开的。深度学习模型training特别吃硬件,尤其是人为调参的时候,更需要大量的计算。另一个头疼的问题是在某个任务下大量数据训练的模型,切换到另一个任务后,模型需要重新训练,这样非常耗时耗力。工业界财大气粗有大量的GPU可以承担起这样的计算成本,但学生界因为经费有限经不起这样的消耗。元学习可以有效缓解大量调参和任务切换模型重新训练带来的计算成本问题。
2. concept of meta-learning
元学习希望使得模型获取一种学会学习调参的能力,使其可以在获取已有知识的基础上快速学习新的任务。
2.1 meta-learning与machine learning区别
1. 机器学习先是人为调参,之后直接训练特定任务下深度模型。元学习则是通过其他任务训练出一个较好的参数,然后再对特定任务进行训练。
2. meta-learning的训练单位是task任务,机器学习的训练单位是一条数据。机器学习通过数据来对模型进行优化,数据划分为训练集、验证集和测试集;meta-learning分为训练阶段(Train Tasks/Across tasks跨任务)和测试阶段(Test Task/Within task单任务)。训练阶段包括许多子任务来进行学习,目的是学习出一个较好的超参数--称为先验知识,测试任务是利用训练任务学习出的超参数对特定任务进行训练出特定任务下模型的参数。训练阶段的每个子任务的数据可以分为support set和query set; 测试阶段中的数据分为训练集和测试集。
2.3 meta-learning category
2.3.1 学习预处理数据集
对数据进行预处理的时候,数据增强data augmentation会增加模型的鲁棒性,一般的数据增强比较死板,只是对图像进行旋转,颜色变换,伸缩变换等。元学习可以自动地,多样化地为数据进行增强,相关的代表作为 DADA。
论文名称:DADA: Differentiable Automatic Data Augmentation
论文链接:https://arxiv.org/pdf/2003.03780v1.pdf
论文详情:ECCV 2020
2.3.2 学习初始化的参数
权重参数初始化的好坏可以影响模型最后的分类性能,元学习可以通过学出一个较好的权重初始化参数有助于模型在新的任务上进行学习。元学习学习初始化参数的代表作是 MAML(Model-Agnostic-Meta-Learning)。它专注于提升模型整体的学习能力,而不是解决某个具体问题的能力,训练时,不停地在不同的任务上切换,从而达到初始化网络参数的目的,最终得到的模型,面对新的任务时可以学习得更快。
论文名称:Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
论文链接:https://arxiv.org/pdf/1703.03400.pdf
论文详情:ICML2017
2.3.3 学习网络结构
神经网络的结构设定是一个很头疼的问题,网络的深度是多少,每一层的宽度是多少,每一层的卷积核有多少个,每个卷积核的大小又该怎么定,需不需要 dropout 等等问题,到目前为止没有一个定论或定理能够清晰准确地回答出以上问题,所以神经网络结构搜索 NAS 应运而生。归根结底,神经网络结构其实是元学习地一个子类领域。值得注意的是,网络结构的探索不能通过梯度下降法来获得,这是一个不可导问题,一般情况下会采用强化学习或进化算法来解决。
论文名称:Neural Architecture Search with Reinforcement Learning
论文链接:https://arxiv.org/abs/1611.01578
论文详情:ICLR 2017
2.3.4 学习选择优化器
神经网络训练的过程中很重要的一环就是优化器的选取,不同的优化器会对优化参数时对梯度的走向有很重要的影响。熟知的优化器有Adam,RMsprop,SGD,NAG等,元学习可以帮我们在训练特定任务前选择一个好的的优化器,其代表作有
论文名称:Learning to learn by gradient descent by gradient descent
论文链接:https://arxiv.org/pdf/1606.04474.pdf
论文详情:NIPS 2016
3. Meta-learning Algorithm
元学习分为两个阶段,阶段一是训练任务训练;阶段二是测试任务训练。对应于一些论文的算法流程图,训练任务是在 outer loop 里,测试任务任务是在 inner loop 里。
3.1 阶段一:训练任务训练
在训练任务中给定 个子训练任务,每个子训练任务的数据集分为 Support set 和 Query set。首先通过这h个子任务的 Support set 训练 ,分别训练出针对各自子任务的模型参数 。然后用不同子任务中的 Query set 分别去测试 的性能,并计算出预测值和真实标签的损失 。接着整合这h个损失函数为 :
最后利用梯度下降法去求出L'(φ)去更新参数 ,从而找到最优的超参设置;如果L(φ)不可求,则可以采用强化学习或者进化算法去解决。阶段一中训练任务的训练过程被整理在如下的框图中
3.2 阶段二:测试任务训练
测试任务就是正常的机器学习的过程,它将数据集划分为训练集和测试集。阶段一中训练任务的目的是找到一个好的超参设置--先验知识 ,利用这个先验知识可以对特定的测试任务进行更好的进行训练。阶段二中测试任务的训练过程被整理在如下的框图中。
3.3 元学习初始化参数的算法
第一步:将所有子任务分类器的网络结构设置为一样的,从 个子任务中随机采样出 个子任务,并将初始权重 赋值给这 个网络结构。
第二步:采样出的 个子任务分别在各自的 Support set 上进行训练并更新参数 。在 MAML 中参数 更新一步,在 Reptile 中参数 更新多步。
第三步:利用上一步训练出的 在 Query set 中进行测试,计算出各自任务下的损失函数 。
第四步:将不同子任务下的损失函数 进行整合得到 。
第五步:求出损失函数 关于 的导数,并对初始化参数 进行更新。
循环以上个步骤,直到达到要求为止。
extension:元学习学习初始化权重的方法和预训练方法有什么区别?
pre-training = transfer learning训练集是由数据组成;而元学习训练是由一个个小任务组成,每个任务由训练Support set和测试Query set组成。
可以发现在相同的网络结构下,预训练是只有一套模型参数在不同的任务中进行训练,元学习是在不同的任务中有不同的模型参数进行训练。对比二者的梯度公式可以发现,预训练过程简单粗暴它想找到一个在所有任务(实际情况往往是大多数任务)上都表现较好的一个初始化参数,这个参数要在多数任务上当前表现较好。元学习过程相对繁琐,但它更关注的是初始化参数未来的潜力。
3.4 Metric-based Meta-Learning Algorithm: Prototypical Network
metric-based 元学习,或多或少利用了一些人为度量的先验知识,比如度量类别与类别之间的距离(欧式距离或余弦距离等)。
Prototypical Network虽然比Matching Network发表的要晚,但算法设计上要比Matching Network更加简单,更加容易理解一些。
Prototypical直接求解每个类别在浅层空间的向量表示的prototype(即使用每个类别中所有样本的均值表示,也就是c1,c2,c3),那么一旦训练好一个好的空间映射关系,那么对于一个没有见过的样本x,其类别由其在浅层空间距离最近的prototype的类别所决定,即nearest neighbor classifier。
3.4.1 名词解释
1) episode: occurring as part of a larger sequence,即N-Way K-shot sampling, N为类别数, K为每个类别中采样获得的数据量。上面伪代码中Nc表示类别数, Ns+Nq为每个类别的数据量。
2) support set: 包含Nc个类别,每个类别包含Ns条样本。
3) query set: 包含Nc个类别,每个类别包含Nq条样本。
3.4.2 算法流程解释
(1)从一个大的数据集上随机采样一个N-Way K-shot任务,为一个episode
(2)将这个episode分为支持集、询问集
(3)根据支持集按照均值计算prototypes
(4)按照类别顺序,利用所有类别询问集中的数据计算损失函数
(5)不断抽取不同的episode,按照(1)(2)(3)(4)循环
这里面的损失函数我们深入分析下。
4. meta-learning papers
4.1 HG-Meta: Graph Meta-learning over Heterogeneous Graphs (2022, SDM)
4.2 G-Meta: Graph learning via local subgraphs (2020, NeurlPS 2020)
5. 应用
5.1 事件抽取(Zero-shot Transfer Learning for Event Extraction)
- Problem: Most previous event extraction studies have relied heavily on features derived from annotated event mentions, thus can not be applied to new event types without annotation /tyˌæn.əˈteɪ.ʃən/ n. 标注 effort.
- Solution: We designed a transferable neural architecture, mapping event mentions and types jointly into a shared semantic space using structural and compositional neural networks, where the type of each evnet mention can be determined by the closest of all candidate types.
- Scheme: By leveraging (1) available manual annotation for a small set of existing event types and (2) existing event ontologies /ɒnˈtɒl.ə.dʒi/ n. 本体, our framework applies to new event types without requiring additional annotation.
(1) goal of event extraction: event triggers; event arguments from unstructural data.
--->poor portability /ˌpɔː.təˈbɪl.ə.ti/ n. 可移植性of traditional supervised methods and the limited coverage of available event annotations.
--->problem: handling new event types means to start from scratch without being able to re-use annotations for old event types.
reasons: thest approaches modeled event extraction as a classification problem, encoding features only by measuring the similarity between rich features encoded for test event mentions and annotated event mentions.
--->We observed that both event mentions and types can be represented with structures.
event mention structure <--- constructed from trigger and candidate arguments
event type structure <--- consists of event type and predefined roles
---> Figure 2.
Figure 2: Examples of Event Mention and Type Structures from ERE.
AMR --> abstract meaning representation, to identify candidate arguments and construct event mention structures.
ERE --> entity relation event, event types can also be represented with structures form ERE.
besides the lexical semantics that relates a trigger to its type, their structures also tend to ben similar.
this observation is similar to the theory that the semantics of an event structure can be generalized and mapped to event mention structures in semantic and predictable way.
event extraction task --> by mapping each mention to its semantically closest event type in the ontology.
---> one possible implementation: Zero-Shot Learning(ZSL), which had been successfully exploited in visual object classification.
main idea of ZSL for vision tasks: is to represent both images and type labels in a multi-dimensional vector space separately, and then learn a regression model to map from image semantic space to type label semantic space based on annotated images for seen labels. This regression model can be further used to predict the unseen labels of any given image.
---> one goal is to effectively transfer the knowledge of events from seen types to unseen types, so we can extract event mentions of any types defined in the ontology.
We design a transferable neural architecture, which jointly learns and maps the structural representation of both event mentions and types into a shared semantic space by minimizing the distance between each event mention and its corresponding type.
unseen types' event mentions, their structures will be projected into the same semantic space using the same framework and assigned types with top-ranked similarity values.
(2) Approach
Event Extraction: triggers; arguments
Figure 3: Architecture Overview
1) a sentence S, start by identifying candidate triggers and arguments based on AMR parsing.
e.g. dispatching is the trigger of a Transport_Person event with four arguments(0, China; 1, troops; 2, Himalayas; 3, time)
we build a structure St using AMR as shown in Figure 3. e.g. dispatch-01
2) each structure is composed of a set of tuples, e.g. <dispatch-01, :ARG0, China>
we use a matrix to represent each AMR relation, composing its semantics with two concepts for each tuple, and feed all tuple representations into CNN to generate event mention structure representation Vst(namely candidate trigger).
----> pooling & concatenation. --> Vst
Shared CNN----> Convolution Layer
----> Structure Composition Layer <--St
3) Given a target event ontology, for each type y, e.g. Transport_Person, we construct a type structure Sy by incorporating its predefined roles, and use a tensor to denote the implicit relation between any types and arguments.
compose the semantics of type and argument role with the tensor for each tuple, e.g. <Tranport_Person, Destination>
we generate the event type structure representation Vsy using the same CNN.
4) By minimizing the semantic distance between dispatch-01 and Transport_Person Vst and Vsy.
we jointly map the representations of event mention and event types into a shared semantic space, where each mention is closest to its annotated type.
5) After training, the compositional functions and CNNs can be further used to project any new event mention(e.g. donate-01) into the semantic space and find its closest event type()
(3) Joint Event Mention and Type Label Embedding
CNN is good at capture sentence level information in various NLP tasks.
--> we use it to generate structure-label representations.
For each event mention structure St=(u1,u2,..., un) and each event type structure Sy=(u1', u2', ...., up') which contains h and p tuples respectively.
--> we apply a weight-sharing CNN to each input structure to jointly learn event mention and type structural representations, which will be later used to learn the ranking function for zero-shot event extraction.
--> Input layer is a sequence of tuples, where the order of tuples is represented by a d * 2 dimensional vector, thus each mention structure and each type stucture are represented as a feature map of dimensionality d x 2h* and d x 2p* respectively.
--> Convolution Layer
--> Max-Pooling
--> Learning
(4) Joint Event Argument and Role Embedding
(5) Zero-Shot Classification