跨语言学习归纳总结 Cross-Lingual Learning paper summary

往期文章链接目录

Cross-lingual learning

Most languages do not have training data available to create state-of-the-art models and thus our ability to create intelligent systems for these languages is limited as well.

Cross-lingual learning (CLL) is one possible remedy to solve the lack of data for low-resource languages. In essence, it is an effort to utilize annotated data from other languages when building new NLP models. When CLL is considered, target languages usually lack resources, while source languages are resource-rich and they can be used to improve the results for the former.

Cross-lingual resources

The domain shift, i.e. the difference between source and target languages, is often quite severe. The languages might have different vocabularies, syntax or even alphabets. Various cross-lingual resources are often employed to address the gap between languages.

Here is a short overview of different resources that might be used.

Multilingual distributional representations

With multilingual word embeddings (MWE), words from multiple languages share one semantic vector space. In this space semantically similar words are close together independently on the language they come from.

During the training MWE usually require additional cross-lingual resources, e.g. bilingual dictionaries or parallel corpora. Multilingual sentence embeddings work on similar principle, but they use sentences instead of words. Ideally, corresponding sentences should have similar representations.

Evaluation of multilingual distributional representations

Evaluation of multilingual distributional representations can be done either intrinsically or extrinsically.

  • With intrinsic evaluation authors usually measure how well does semantic similarity reflect in the vector space, i.e. how far apart are semantically similar words or sentences.

  • On the other hand, extrinsic evaluation measure how good are the representations for downstream tasks, i.e. they are evaluated on how well they perform for CLL.

Parallel corpus

Parallel corpora are one of the most basic linguistic resources. In most cases, sentence-aligned parallel corpus of two languages is used. Wikipedia is sometimes used as a comparable parallel corpus, although due to its complex structure it can also be used as a multilingual knowledge base.

Parallel corpora are most often created for specific domains, such as politics, religion or movie subtitles. Parallel corpora are also used as training sets for machine translation systems and for creating multilingual distributional representations, which makes parallel corpora even more important.

Word Alignments

In some cases, sentence alignment in parallel corpora might not be enough.

For word alignments, one word from a sentence in language ℓ A \ell_A A can be aligned with any number of words from corresponding sentence in language ℓ B \ell_B B. In most cases automatic tools are used to perform word alignment over existing parallel sentences. Machine Translation systems can also often provide word alignment information for their generated sentences.

Machine Translation

Machine translation (MT) can be used instead of parallel corpora to generate parallel sentences. Parallel corpus generated by MT is called pseudo parallel corpus. Although in recent years MT achieved great improvements by using neural encoder-decoder models, machine translation is still far away from providing perfect translations. MT models are usually trained from parallel corpora.

By using samples generated by MT systems we inevitably inject noise into our models; The domain shift between a language ℓ A \ell_A A and what MT systems generate as language ℓ A \ell_A A needs to be addressed.

Universal features (out of fashion)

Universal features are inherently language independent to some extent, e.g. emojis or punctuation. These can be used as features for any language. As such, model trained with such universal features should be easily applied to other languages.

The process of creating language independent features for words is called delexicalization. Delexicalized text has words replaced with universal features, such as POS tags. We lose the lexical information of the words in this process, thus the name delexicalization.

Bilingual dictionary

Bilingual dictionaries are the most available cross-lingual resource in our list. They exist for many language pairs and provide a very easy and natural way of connecting words from different languages. However, they are often incomplete and context insensitive.

Pre-trained multilingual language models

Pre-trained language models are a state-of-the-art NLP technique. A large amount of text data is used to train a high capacity language model. Then we can use the parameters from this language model to initialize further training with different NLP tasks. The parameters are fine-tuned with the additional target task data. This is a form of transfer learning, where we use language modeling as a source task. The most well known pre-trained language models are BERT.

Multilingual language models (MLMs) are an extension of this concept. A single language model is trained with multiple languages at the same time.

  • This can be done without any cross-lingual supervision, i.e. we feed the model with text from multiple languages and we do not provide the mod

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 好的,这是比较流行的 20 个 NLU 深度学习模型: 1. BERT 2. GPT 3. Transformer 4. RoBERTa 5. XLNet 6. ERNIE 7. Albert 8. T5 9. Pegasus 10. XLM 11. DeBERTa 12. Text-GPT 13. MT-DNN 14. ELECTRA 15. FlauBERT 16. Text-XLNet 17. CTRL 18. Funnel-Transformer 19. UniLM 20. BART 希望这些模型能够帮助你。如果你想了解这些模型的更多信息,我可以为你提供一些资源。 ### 回答2: 以下是20个比较流行的NLU深度学习模型: 1. BERT (Bidirectional Encoder Representations from Transformers) 2. GPT (Generative Pre-trained Transformer) 3. XLNet (eXtreme Multi-task Learning) 4. RoBERTa (Robustly Optimized BERT Approach) 5. ALBERT (A Lite BERT) 6. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 7. DistilBERT (Distilled BERT) 8. TinyBERT (Smaller BERT for Fine-tuning) 9. CamemBERT (BERT for French Language) 10. Multilingual BERT (BERT for Multiple Languages) 11. SpanBERT (BERT for Span-based Question Answering) 12. MT-DNN (Multi-Task Dual Encoder with Transformers) 13. T5 (Text-to-Text Transfer Transformer) 14. GPT-3 (Generative Pre-trained Transformer 3) 15. BART (Bidirectional and Auto-Regressive Transformers) 16. XLM (Cross-lingual Language Model) 17. Transformer-XL (Transformer with Long-term Dependency) 18. ULMFiT (Universal Language Model Fine-tuning) 19. ELMo (Embeddings from Language Models) 20. LASER (Language-Agnostic SEntence Representations) 这些深度学习模型在NLU领域中得到广泛应用和研究,并结合自然语言处理任务,如文本分类、命名实体识别、情感分析等,能够帮助处理和理解自然语言文本。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值