常见多语言模型详解 (M-Bert, LASER, MultiFiT, XLM)

最新推荐文章于 2024-07-02 10:51:28 发布

Jay_Tang

最新推荐文章于 2024-07-02 10:51:28 发布

阅读量7k

点赞数 8

分类专栏： NLP 核心推导文章标签：自然语言处理 pytorch 机器学习

本文链接：https://blog.csdn.net/Jay_Tang/article/details/107873888

版权

本文详细介绍了四种常见的多语言模型：M-Bert, LASER, MultiFiT 和 XLM。M-Bert 使用共享词汇表训练，但存在跨语言理解的问题。LASER 提供通用的跨语言句子嵌入，适用于零样本跨语言推理。MultiFiT 通过高效的多语言模型微调方法解决词汇覆盖问题。XLM 利用翻译语言建模目标进行跨语言学习，表现出色。" 113350393,5757806,jenkins自动化构建部署java Maven项目指南,"['jenkins', 'maven', 'java开发', '自动化部署', 'tomcat服务器']

摘要由CSDN通过智能技术生成

文章目录

往期文章链接目录

Multilingual Models are a type of Machine Learning model that can understand different languages. In this post, I’m going to discuss four common multi-lingual language models Multilingual-Bert (M-Bert), Language-Agnostic SEntence Representations (LASER Embeddings), Efficient multi-lingual language model fine-tuning (MultiFiT) and Cross-lingual Language Model (XLM).

Ways of tokenization

Word-based tokenization

Word-based tokenization works well for the morphologically poor English, but results in very large and sparse vocabularies for morphologically rich languages, such as Polish and Turkish. Some languages such as Chinese don’t really even have the concept of a “word”, so require heuristic segmentation approaches, which tend to be complicated, slow, and inaccurate.

Character-based tokenization

Character-based models use individual characters as tokens. While in this case the vocabulary (and thus the number of parameters) can be small, such models require modelling longer-term dependencies and can thus be harder to train and less expressive than word-based models.

Subword tokenization

Subword tokenization strikes a balance between the two approaches above by using a mixture of character, subword and word tokens, depending on how common they are.

subword tokenization has two very desirable properties for multilingual language modelling:

Subwords more easily represent inflections (the change in the form of a word), including common prefixes and suffixes and are thus well-suited for morphologically rich languages.
Subword tokenization is a good fit for open-vocabulary problems and eliminates out-of-vocabulary tokens, as the coverage is close to 100% tokens.

Existing approaches for cross-lingual NLP

Parallel data across languages — that is, a corpus of documents with exactly the same contents, but written in different languages. This is very hard to acquire in a general setting.
A shared vocabulary — that is, a vocabulary that is common across multiple languages. This approach over-represents languages with a lot of data (e.g., Multi-lingual BERT, which I’ll discuss in this post).

Out-of-vocabulary (OOV) problem in mono/multi-lingual settings

It has been shown that the performance on many NLP tasks drops dramatically on held-out data when a significant percentage of words do not appear in the training data, i.e., out-of-vocabulary (OOV) words. OOV problems have been addressed in previous works under monolingual settings, through replacing OOV words with their semantically similar in-vocabulary words or using character/word information or subword information like byte pair encoding (BPE).

All those monolingual pre-trained models (e.x. BERT, GPT) rely on language modeling, where a common trick is to tie the weights of softmax and word embeddings. However, in multilingual setting, due to the expensive computation of softmax and data imbalance across different languages, the vocabulary size for each language in a multilingual model is relatively small compared to the monolingual BERT models, especially for low-resource languages. Even for a high-resource language like Chinese, its vocabulary size 10k in the multilingual BERT is only half the size of that in the Chinese BERT. Just as in monolingual settings, the OOV problem also hinders the performance of a multilingual model on tasks that are sensitive to token-level or sentence-level information.

M-BERT (Multi-lingual BERT)

Multilingual BERT is pre-trained in the same way as monolingual BERT, but instead of being trained only on monolingual English data with an English-derived vocabulary, it is trained on the Wikipedia pages of 104 languages with a shared word piece vocabulary. The vocabulary is 119,547 WordPiece model, and the input is tokenized into word-pieces (also known as subwords) so that each word piece is an element of the dictionary. Non-word-initial units are prefixed with ## as a continuation symbol except for Chinese characters which are surrounded by spaces before any tokenization takes place.

To account for the differences in the size of Wikipedia, some languages are sub-sampled, and some are super-sampled using exponential smoothing (assigns exponentially decreasing weights as the observation get older).

It does not use any marker denoting the input language, and does not have any explicit mechanism to encourage translation equivalent pairs to have similar representations.

WHY MULTILINGUAL BERT WORKS

Definitions:

Word-piece overlap: the texts from different languages share some common word-piece vocabulary (like numbers, links, etc… including actual words, if they have the same script).
structural similarity: They define the structure of a language as every property of an individual language that is invariant to the script of the language (e.g., morphology, word-ordering, word frequency are all parts of structure of a language).

In the paper Cross-lingual ability of multilingual bert, the authors provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability.

The most notable finding is that word-piece overlap on the one hand, and multi-head attention on the other, are both not significant, whereas structural similarity and the depth of the model are crucial for its cross-lingual ability.

Note:

Previous work hypothesizes that M-BERT generalizes across languages because these shared word-pieces force the other word-pieces to be mapped to the same shared space. The paper shows that the contribution of w

最低0.47元/天解锁文章

Jay_Tang

关注

8
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
常见多语言模型详解 (M-Bert, LASER, MultiFiT, XLM)

文章目录往期文章链接目录Ways of tokenizationWord-based tokenizationCharacter-based tokenizationSubword tokenizationExisting approaches for cross-lingual NLPOut-of-vocabulary (OOV) problem in mono/multi-lingual settingsM-BERT (Multi-lingual BERT)WHY MULTILINGUAL BERT W
复制链接

扫一扫

专栏目录