【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks-CSDN博客

本文链接：https://blog.csdn.net/gjh1716718326/article/details/122085422

We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages.

universal的含义：

对语言是不敏感的

在一种语言上训练后，可以直接应用到其他语言

Comparing to similar efforts such as Multilingual BERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019), three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives.

文章在Multilingual BERT和XLM的基础上，提出了三种新的预训练任务。

Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline.

两个检验任务：跨语言推断和跨语言问答。

简介

Multilingual BERT trains a BERT model based on multilingual Wikipedia,
which covers 104 languages. As its vocabulary contains tokens from all languages, Multilingual BERT can be used to cross-lingual tasks directly.

关于Multilingual BERT的一些简介：

包含104种语言

共享词表