Literature Survey: Few-Shot Representation Learning for Out-Of-Vocabulary Words (ACL'19)

Preface

This is the literature survey done by Zhiping (Patricia) Xiao, during taking the CS263: NLP course taught by Prof. Kaiwei Chang in UCLA Spring 2020 quarter, submitted as a homework.

Resources

Link to the paper: https://arxiv.org/abs/1907.00505

Code of the paper (official): https://github.com/acbull/HiCE

Background & Problem Formalization

The paper is published by ACL at year 2019, proposing a model named HiCE. The links to their paper and the corresponding code are as attached above.

To begin with, we need to make sure everyone's on the same page by clarifying what are:

  1. few-shot representation learning
  2. out-of-vocabulary words (abbriviated as OOV words later on)

In brief, just as its name suggests, few-shot learning is when we learn something from only a few observations. This is something that is typically easily handled by human, but not such obviously doable by machine. In machine learning stories, most of the cases we rely heavily on the massive amount of data to train. However, it is not always practical in real life, as no matter how many data you give to your model, your observation is always no more than just a glance at the real world.

As for the word representation learning tasks, this very obvious problem gives rise to coming across what is called "out-of-vocabulary words", a.k.a OOV words. Those words are not observed in the training set, but they appear in the testing set / in the application in practice.

To formalize, the scenario is: given some pre-trained word-embeddings (trained on set D_T), and then in the test set (D_N) observing an unknown OOV word with only a few examples demonstrating its usage, try to learn the embedding.

The problem is therefore naturally formalized into a few-shot regression problem.

They use the other famous word-embedding models, pretrained, as baseline models. With some OOV words given, try to use the methods that those model use to learn embeddings, so as to learn the unseen words' embeddings. It is quite convincing, that those baseline experiments should be considered strong in this case, for those are the most intuitive methods of learning OOV words, and make much sense.

The models are evaluated by their performaces on a famous OOV benchmark dataset called Chimera, as well as the performance of using the learned embeddings on downstream tasks: (1) named entity recognition (NER) and (2) partof-speech (POS) tagging. 

HiCE outperforms the rest in all test cases in all the evaluations.


Methodology

Figure 1 in the paper

The overall structure of the HiCE model is proposed as depicted in the figure. It is not complex in general, and in the following parts we're going through some details.

Given the training set D_T and the test set D_N with some OOV words included, the methodology could be concluded as follows:

  • The general idea is to learn an neural regression function F_\theta(\cdot) on D_T , parameterized by \thetaF_\theta(\cdot) is expected to take in: (1) the few contexts of OOV words (in D_N) (2) morphological features of the OOV word, and output the OOV word's approximate embedding vector (called "oracle embedding" in the paper). What is inferred here as a "neural regression function", you can simply regard it as a deep learning model.
  • Select N words with ufficient observations w = \{ w_1, w_2, \dots, w_N \}\forall t \in \{1, 2, \dots N\}, let T_{w_t} be the corresponding embedding (the oracle embedding of w_t), and use S_t to denote all sentences in D_T that contain w_t inside.
  • \forall w_t \in w, we use the sequence of the characters in it to represents its features, denoted as C_t.
  • In each episode of training F_\theta(\cdot), build a masked supporting context set S_t^K = \{s_{t,k}\}_{k=1}^K out of K randomly-selected sentences from S_t, where each sentence in this set, say, s_{t,k}, represents the sentence k selected from the K-sentences subset, with the target word w_t masked out.

The training objective is as simple as:

\hat{\theta} = \text{arg}\max_\theta \sum_{w_t} \sum_{S_t^K \sim S_t} \cos(F_\theta(S_t^K, C_t), T_{w_t})

One might also need some background knowledge in self-attention mechanism to fully understand what's going on in the context encoder and multi-context aggregator parts illustrated in the figure (this figure is Figure 1 in the original paper).


Self-Attention Encoder

For better understanding what is a self-attention encoder, I recommend reading the related papers I mention in the end of this survey. I'll write some introduction that serves as a brief introduction to self-attention.

So to begin with, in order to understand what means self-attention, we need to understand "what is attention". Attention in real life, from a human perspective, is to have weighted focus on different things / different part of something.

For machine learning, the attention mechanism didn't come into being all at once. In fact, at first, people were more comfortable with the form of "key-value" pairs (a pair of vectors) stored in the memory, and when there comes a "query" (a vector with the same dimensionality as a key), compare it with the keys, find the most similar key, and reply the associated value.

Latter on people learned to treat this probelem "softly", meaning that comparing the similarities between the query and every key, returning the weighted sum of the values —— the weight is computed according to the similarity between the query and the corresponding key.

That idea still remains the idea of attention nowadays. To describe it more mathematically, we have K, Q, V be the set of keys, the set of queries, and the set of value, accordingly. The attention A is calculated as:

A = softmax\Big( \frac{Q K^T}{\sqrt{d}} \Big) V

where K \in \mathbb{R}^{n \times d}Q \in \mathbb{R}^{q \times d}V \in \mathbb{R}^{n \times m}, thus A \in \mathbb{R}^{q \times m}.

When we say that it is self-attention, what we mean is that, the K, Q, V all come from the same source. Normally, a linear transformation could be applied to the original source, so that when the self-attention is applied, the different dimensions of the sequence of vectors are treated differently.

x is the input matrix in this case, and three differen linear projections are applied to it to generate K, Q, VK = x W^KQ = x W^QV = x W^V.

Furthermore, this is a multi-head attention setting, meaning that, in fact we calculate the attention multiple times, so that instead of having a single attention result A, we have multiple a_{self, i} instead, where i is the index of any head. Therefore, more precisely, we have: K_i, Q_i, V_iK_i = x W_i^KQ_i = x W_i^QV_i = x W^V_i.

a_{self, i} = softmax\Big( \frac{Q_i K_i^T}{\sqrt{d}} \Big) V_i

Feed-forward network and the normalization components are easy to understand. A special and not that intuitively-understood component is the positional encoding. You know that the information is incomplete without the position, when you are dealing with a sequence. You know that the position in a sequence is informative. In other parts of the model, there's no way of keeping the positional information.

The currently-dominant strategy of doing this is to multiply the vector at each position by a positional score. The positional score is usually defined by a sin / cos function. 


Model Agnostic Meta-Learning (MAML)

The design so far sounds great, but when D_T and D_N are so different from linguistic and semantic perspectives, e.g. same word different meanings in different dataset, it could still be problematic. Directly adapt F_\theta(\cdot) from D_T to D_N by fine-tuning might not be robust enough.

Therefore, MAML is adoped (a simplified variation of the original MAML is used here). In brief, it is a method of "learning to fine-tune".

In brief, what it does is to:

  1. learn an updated \theta^* on D_T with sufficient data;
  2. optimize (= calculate gradient etc.) using \theta^* as initial parameter and with loss calculated on the limited corpus in D_N.

Related Work

Attention Is All You Need (NeurIPS'17)

transformer

Link to the paper: https://arxiv.org/abs/1706.03762

This paper was published at the time when almost all (NLP) sequence transduction models were based on a complex network involving RNN and/or CNN. They therefore proposed the transformer, where self-attention mechanism is heavily relied upon, and there are some additional components to help with, such as the positional-encoding cell and the feed-forward network, etc.

This paper is a famous one, anyone who have been curious of BERT and XLNet should have known it pretty well, for in BERT and XLNet everything was built upon the structure called transformer. And transformer is proposed in this paper we are talking about. To be more precise, XLNet is not built upon transformer, rather, it relies on Transformer-XL, a model that in a way could be regarded as an extention of transformer, but more expressive since it could handle arbitrarily-long sentences; unlike transformer, which has a very limited fixed upper-bound of sentences' length.

Also along this transformer routine, but less famous and slightly differently, there are GPT (Improving Language Understanding by Generative Pre-Training) and GPT-2 (Language Models are Unsupervised Multitask Learners).

Transformer is related to this work, for one of the most important component-the self-attention encoder cell, is about the same structure.

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (ICML'17)

Paper on ArXiv: https://arxiv.org/abs/1703.03400

The paper is where MAML algorithm used in this paper comes from. The paper proposed a model-agnostic meta-learning algorithm. Model-agnostic here refers to that it could be applied to any model, not just specifically limited to one type or so.

The idea of meta-learning is similar with what we've been introduced as the few-shot learning. The goal of meta-learning is to gain the ability of solving new learning tasks with only a few samples. The method of achieving so is always by pre-training the model on a variation of learning tasks, so that it captures some generally-applicable rules. Similar ideas also serve as the intuition behind multi-task learning.

They proposed an algorithm of adaptation that is different from the most straightforward pre-training and fine-tuning framework, making the parameters be explicitly trained. It shows that with only a few gradient steps and a small amount of training data from a new task, this algorithm produces good generalization performance.

Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts (NAACL'19)

Paper on ArXiv: https://arxiv.org/abs/1904.01617

This paper serves as the foundation of the next paper (Rare Words, AAAI'20) to introduce. In brief, it proposes another solution of learning rare word embeddings.

The high-level idea of this paper is exactly the same with the design of this work: "given embeddings learned by a standard algorithm, a model is first trained to reproduce embeddings of frequent words from their surface form and then used to compute embeddings for rare words."

The differences are in the details. They used the Form-Context Model (FCM) and learn context attention. Their model seems to be more simple, but they claimed to improve the quality of embedding of not only the rare words, but also medium-frequency words.

Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking (AAAI'20)

Paper on ArXiv: https://arxiv.org/abs/1904.06707

This paper is closely-related to this work on its topic. Extending the work from Attentive Mimicking, it adapted Attentive Mimicking to BERT. Now we see that by the nature of BERT (we mentioned above when talking about the transformer), this model becomes much more complex that Attentive Mimicking and HiCE. But interestingly, by introducing BERT, the model's components are becoming more and more similar with the HiCE model we've been talking about.

In a way, we can say that, this work has the same high-level design as HiCE thus having very similar framework in general; having some very similar cells, but the models' overall structures are different. Without comparing them and conducting experiments, it is really beyond my knowledge to say which one is better.

Learning to Customize Model Structures for Few-shot Dialogue Generation Tasks (ACL'20)

Paper on DeepAi: https://deepai.org/publication/learning-to-customize-language-model-for-generation-based-dialog-systems

CMAML

This paper is mostly related to this paper from the perspective of using MAML to solve few-shot learning tasks. The model is completely different in other parts, for that the task it focuses on is completely different. This project is working on meta-learning as well, thus from a very abstract level, we can say that their idea is similar with HiCE, by both using MAML.

But in their settings, they are targeting at "building personalized chatbots that can interact with different users with according content and language styles". The overall structure of their model, CMAML, is as shown in the figure above.

Learning to Learn Words from Visual Scenes

Its own webpage for exhibition: https://expert.cs.columbia.edu/

Paper on ArXiv: https://arxiv.org/abs/1911.11237

The work cited HiCE as one of its references, and the Rare Words paper (AAAI'20) as another. It also cited a lot more papers. I think the work is still ongoing.

So far, they've designed a meta-learning framework, working on learning how to learn ("meta-learning" always means "learning how to learn") word representations from some unconstrained visual scenes, leveraging the structure of language.

By mentioning the works on rare words / OOV words, they claimed to be a different model, since they incorporated OOV words as both an input to, and an output from, the model. And they've claimed that they don't require extra training, nor do they need to do gradient updates on new words.

And many more...

In general, this paper touches many interesting fields, such that in order to fully understand every details involved, one might want to learn something on attention mechanism, meta-learning, optimization, etc.

Hope this literature review helps you have a better understanding of the field.

Have a nice day.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值