跨语言学习归纳总结 Cross-Lingual Learning paper summary

最新推荐文章于 2023-12-31 01:46:17 发布

Jay_Tang

最新推荐文章于 2023-12-31 01:46:17 发布

阅读量2k

点赞数 1

分类专栏： NLP 核心推导文章标签：自然语言处理

本文链接：https://blog.csdn.net/Jay_Tang/article/details/111782773

版权

跨语言学习（CLL）通过利用其他语言的注释数据来解决低资源语言的数据缺乏问题。本文概述了多种跨语言资源，如多语言分布表示、平行语料库、词对齐和预训练的多语言语言模型，并详细讨论了标签、特征、参数和表示转移等转移学习技术。未来的研究方向包括多语言数据集的标准化、减少真正低资源语言的挑战以及解决多语言性的诅咒。

摘要由CSDN通过智能技术生成

往期文章链接目录

文章目录

Cross-lingual learning

Most languages do not have training data available to create state-of-the-art models and thus our ability to create intelligent systems for these languages is limited as well.

Cross-lingual learning (CLL) is one possible remedy to solve the lack of data for low-resource languages. In essence, it is an effort to utilize annotated data from other languages when building new NLP models. When CLL is considered, target languages usually lack resources, while source languages are resource-rich and they can be used to improve the results for the former.

Cross-lingual resources

The domain shift, i.e. the difference between source and target languages, is often quite severe. The languages might have different vocabularies, syntax or even alphabets. Various cross-lingual resources are often employed to address the gap between languages.

Here is a short overview of different resources that might be used.

Multilingual distributional representations

With multilingual word embeddings (MWE), words from multiple languages share one semantic vector space. In this space semantically similar words are close together independently on the language they come from.

During the training MWE usually require additional cross-lingual resources, e.g. bilingual dictionaries or parallel corpora. Multilingual sentence embeddings work on similar principle, but they use sentences instead of words. Ideally, corresponding sentences should have similar representations.

Evaluation of multilingual distributional representations

Evaluation of multilingual distributional representations can be done either intrinsically or extrinsically.

With intrinsic evaluation authors usually measure how well does semantic similarity reflect in the vector space, i.e. how far apart are semantically similar words or sentences.
On the other hand, extrinsic evaluation measure how good are the representations for downstream tasks, i.e. they are evaluated on how well they perform for CLL.

Parallel corpus

Parallel corpora are one of the most basic linguistic resources. In most cases, sentence-aligned parallel corpus of two languages is used. Wikipedia is sometimes used as a comparable parallel corpus, although due to its complex structure it can also be used as a multilingual knowledge base.

Parallel corpora are most often created for specific domains, such as politics, religion or movie subtitles. Parallel corpora are also used as training sets for machine translation systems and for creating multilingual distributional representations, which makes parallel corpora even more important.

Word Alignments

In some cases, sentence alignment in parallel corpora might not be enough.

For word alignments, one word from a sentence in language $\ell_A$ can be aligned with any number of words from corresponding sentence in language $\ell_B$ . In most cases automatic tools are used to perform word alignment over existing parallel sentences. Machine Translation systems can also often provide word alignment information for their generated sentences.

Machine Translation

Machine translation (MT) can be used instead of parallel corpora to generate parallel sentences. Parallel corpus generated by MT is called pseudo parallel corpus. Although in recent years MT achieved great improvements by using neural encoder-decoder models, machine translation is still far away from providing perfect translations. MT models are usually trained from parallel corpora.

By using samples generated by MT systems we inevitably inject noise into our models; The domain shift between a language $\ell_A$ and what MT systems generate as language $\ell_A$ needs to be addressed.

Universal features (out of fashion)

Universal features are inherently language independent to some extent, e.g. emojis or punctuation. These can be used as features for any language. As such, model trained with such universal features should be easily applied to other languages.

The process of creating language independent features for words is called delexicalization. Delexicalized text has words replaced with universal features, such as POS tags. We lose the lexical information of the words in this process, thus the name delexicalization.

Bilingual dictionary

Bilingual dictionaries are the most available cross-lingual resource in our list. They exist for many language pairs and provide a very easy and natural way of connecting words from different languages. However, they are often incomplete and context insensitive.

Pre-trained multilingual language models

Pre-trained language models are a state-of-the-art NLP technique. A large amount of text data is used to train a high capacity language model. Then we can use the parameters from this language model to initialize further training with different NLP tasks. The parameters are fine-tuned with the additional target task data. This is a form of transfer learning, where we use language modeling as a source task. The most well known pre-trained language models are BERT.

Multilingual language models (MLMs) are an extension of this concept. A single language model is trained with multiple languages at the same time.

This can be done without any cross-lingual supervision, i.e. we feed the model with text from multiple languages and we do not provide the model with any additional information about the relations between the languages. Such is the case of multilingual BERT model (mBERT). Interestingly enough, even with no information about how are the languages related, the representations this model creates are partially language independent. The model is able to understand the connections between languages even without being explicitly told to do so. To know more about mBERT, check my previous post about common MLMs.
The other case are models that directly work with some sort of cross-lingual supervision, i.e. they use data that help them establish a connection between different languages. Such is the case of XLM, which make use of parallel corpora and machine translation to directly teach the model about corresponding sentences. To know more about XLM, check my previous post about common MLMs.

Transfer learning techniques for Cross-lingual Learning

在这里插入图片描述
Four main categories for CLL:

Label transfer: Labels or annotations are transferred between corresponding $L_S$ and $L_T$ samples.
Feature transfer: Similar to label transfer, but sample features are transferred instead (transfer knowledge about the features of the sample).
parameter transfer: Parameter values are transferred between parametric models. This effectively transfers the behaviour of the model.
Representation transfer: The expected values for hidden representation are transferred between models. The target model is taught to create desired representations.