We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages.
universal的含义:
- 对语言是不敏感的
- 在一种语言上训练后,可以直接应用到其他语言
.
Comparing to similar efforts such as Multilingual BERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019), three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives.
- 文章在Multilingual BERT和XLM的基础上,提出了三种新的预训练任务。
.
Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline.
两个检验任务:跨语言推断和跨语言问答。
简介
Multilingual BERT trains a BERT model based on multilingual Wikipedia,
which covers 104 languages. As its vocabulary contains tokens from all languages, Multilingual BERT can be used to cross-lingual tasks directly.
关于Multilingual BERT的一些简介:
- 包含104种语言
- 共享词表