令牌生成器nlps构建块

最新推荐文章于 2024-05-23 09:34:40 发布

weixin_26731327

最新推荐文章于 2024-05-23 09:34:40 发布

阅读量325

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/tokenizers-nlps-building-block-9ab17d3e6929

版权

The truth is, tokenizers are not that interesting. When I first read the BERT paper, I raced past the WordPiece tokenizing section because it wasn’t as exciting as the rest of the paper. But tokenization has evolved from word to sub-word tokenization and different transformers use different tokenizers which are quite a handful to understand.

事实是，令牌生成器并不是那么有趣。当我第一次阅读BERT论文时，我飞过了WordPiece标记化部分，因为它不像本文的其余部分那样令人兴奋。但是标记化已从词演变为子词标记化，并且不同的转换器使用不同的标记器，这些标记器很容易理解。

There are already good articles that discuss and explain tokenizers —the ones I like the most is a detailed blog post by FloydHub and a short tutorial by Hugging Face.

已经有不错的文章讨论和解释令牌生成器了-我最喜欢的是FloydHub撰写的详细博客文章和Hugging Face的简短教程。

Instead, I want to focus on application — specifically how tokenizers of different models behave out of the box, and how that affects our models’ ability to comprehend. If you start your NLP task with a pre-trained Transformer model (which usually makes more sense than training from scratch), you are stuck with the model’s pre-trained tokenizer and its vocabulary — knowing its behaviour and quirks allows you to choose the best model and debug issues more easily.

相反，我要专注于应用程序-特别是不同模型的标记化器如何开箱即用，以及如何影响我们的模型理解能力。如果您使用预先训练的Transformer模型(通常比从头开始进行训练更有意义)来开始NLP任务，那么您将陷于该模型的预先训练的令牌生成器及其词汇表中-了解其行为和怪癖可以选择最佳的对问题进行建模和调试更加容易。

But first, some basics to get us started so this piece can be read independently. Feel free to skip this section if you already know the fundamentals of tokenizing.

但是首先，有一些基础知识可以帮助我们入门，因此可以独立阅读此部分。如果您已经了解标记化的基础知识，请随时跳过本节。

基本原理 (Fundamentals)

Subword tokenization: In an ideal world with infinite memory and compute, we would save every word in our vocabulary and have a slot in our vocabulary for every word. Sadly that’s not true, so we need to have a fixed vocabulary, usually around 30-50k slots. The need to limit vocabulary size means that there will almost definitely be words that are not ‘important’ enough to be included, i.e. “out-of-vocabulary” or OOV. This results in the dreaded <UNK> token, i.e. unknown token — this is relegated to every unknown word and as a result, the model will have a hard time understanding its semantics. But with subword tokenization, we are able to tokenize uncommon words with more frequent subwords and hence get the best of both worlds, having a smaller vocabulary while still being able to tokenize rare or misspelt words.

子词标记化：在一个拥有无限内存和计算能力的理想世界中，我们会将每个词保存在词汇表中，并在词汇表中为每个词留出一个空位。遗憾的是，这是不对的，因此我们需要有固定的词汇表，通常是30-50k个插槽。限制词汇量的需求意味着几乎肯定会有一些单词“重要性”不足以被包括在内，即“词汇量不足”或OOV。这将导致可怕的<UNK>标记，即未知的令牌 - 这个被转移到每一个未知的字，因此，该车型将拥有一个很难理解它的语义。但是，通过子词标记化，我们可以用频率更高的子词对不常见的词进行标记，从而获得两全其美，词汇量较小，同时仍然能够对罕见或拼写错误的词进行标记。

Vocabulary Building: Earlier I mentioned that only important words are included in the vocabulary. So how is that ‘importance’ determined? We start off with the base characters, then build a vocabulary by merging characters into subwords until we reach the maximum vocabulary size. Major tokenizing methods differ in what subwords to consider first (i.e. the order of subwords to merge) and also the merging decision. The graphic from Floydhub’s blog below illustrates the differences between 3 major subword tokenizing methods.

词汇建立：前面我提到词汇中只包含重要的单词。那么如何确定“重要性”呢？我们从基础字符开始，然后通过将字符合并为子词来建立词汇表，直到达到最大词汇量。主要的分词方法在首先考虑哪些子字(即要合并的子字的顺序)以及合并决策方面有所不同。下面的Floydhub博客中的图形说明了3种主要子词标记方法之间的区别。

Image for post — Source: https://blog.floydhub.com/tokenization-nlp/

Embeddings: The input tokens are represented by an embedding layer, which are multi-dimensional projections for each token. By passing the embeddings through transformer blocks, a contextual understanding of the inputs is gained, i.e. the embeddings of a token depends on the other words in the sequence. The more layers there are, the more context specific the representations become [1].

嵌入：输入令牌由嵌入层表示，嵌入层是每个令牌的多维投影。通过使嵌入物通过转换器块，可以获得对输入的上下文理解，即令牌的嵌入物取决于序列中的其他单词。层越多，表示就越具体于上下文[1]。

分词器如何查看表情符号 (How Tokenizers View Emojis)

To analyze any modern text, especially user generated content like tweets or messages, our model should understand what emojis mean. Ideally, the model should be able to read emojis directly with as little pre-processing as possible to retain the original context of the sentence.

要分析任何现代文本，尤其是用户生成的内容(如推文或消息)，我们的模型应了解表情符号的含义。理想情况下，模型应能够通过尽可能少的预处理直接读取表情符号，以保留句子的原始上下文。

Armed with our initial understanding of how tokenizers work, we know the ability of the model to read emojis simply depends on whether the characters were added to the vocabulary of the model.

有了对分词器工作原理的初步了解，我们知道模型读取表情符号的能力仅取决于是否将字符添加到模型的词汇表中。

Loading 🤗’s pre-trained models that contain the full gamut of pre-trained models and their tokenizer, we see only ROBERTA’s vocabulary contain emojis —the secret sauce in its Byte Level BPE allows it to tokenize all characters and avoid the dreaded <UNK> token.

加载contain的预训练模型，其中包含预训练模型的全部范围及其标记程序，我们仅看到ROBERTA的词汇表中包含表情符号-字节级BPE中的秘密调味品可使其标记所有字符并避免可怕的<UNK >令牌。

The consequence of this is big — Even if you feed a pre-trained BERT more training data, it will never learn the difference between 😀 and 🤬, because the emojis are not in the vocabulary.

这样的结果是很大的-即使您向预训练的BERT提供更多的训练数据，它也永远不会学习😀和the之间的差异，因为表情符号不在词汇表中。

To see this in action, we look at how different models view polar sentences containing different emojis. Only Roberta’s BPEtokenizer was able to differentiate between 2 sentences that differ only in emojis, as seen in the 2 disparate dots, whereas BERT or ALBERT have the same embedding projection of the polar sentences.

为了了解这一点，我们看看不同的模型如何查看包含不同表情符号的极地句子。如两个不同的点所示，只有Roberta的BPEtokenizer能够区分仅表情符号不同的2个句子，而BERT或ALBERT具有与极性句子相同的嵌入投影。

This example is contrived as usually there are other different words that help the model understand the sentence, but the point is that if we want our models to truly discern the meaning of our emoji laden tweets, we need to either pre-process and replace emojis with their corresponding meanings or re-train the tokenizer/model from scratch if we are not using ROBERTA of GPT-2.

该示例的设计方法是，通常会有其他不同的词来帮助模型理解句子，但是要点是，如果我们希望模型真正识别载有表情符号的推文的含义，则需要预处理并替换表情符号以及它们的相应含义，或者如果我们不使用GPT-2的ROBERTA，请从头开始重新训练令牌生成器/模型。

分词器如何查看数字 (How Tokenizers View Numbers)

When applying NLP to finance, a key consideration is how our models view numbers — and that is also affected by the tokenizer/model itself.

在将NLP应用于金融时，一个关键的考虑因素是我们的模型如何查看数字-这也受到令牌生成器/模型本身的影响。

If we visualise some sample sentences’ hidden states we see that BERT’s WordPiece and ROBERTA’s BPE are much less sensitive to numbers as compared to XLNet’s SentencePiece — suggesting that models with SentencePiece tokenizers are better for documents with lots of numbers.

如果我们可视化一些示例句子的隐藏状态，就会发现与XLNet的SentencePiece相比，BERT的WordPiece和ROBERTA的BPE对数字的敏感度要低得多-这表明带有SentencePiece标记器的模型更适合具有大量数字的文档。

For more experiments I did, you can check out my gist here.

我做了更多的实验，您可以在这里查看我的要点。

结论 (Conclusion)

What is the best transformer tokenizer? As to all messy things, it depends. It depends on your data, and as you do your EDA and visualise some sample text, examining the model’s worst errors and checking if any important words are <UNK> is going to help you know what the pre-trained model understands and lets you make a better choice.

最好的变压器令牌生成器是什么？至于所有杂乱的事情，要看情况。这取决于你的数据，你做你的EDA和可视化一些示例文本，检验模型的最糟糕的错误，并检查是否任何重要的字是<UNK>将帮助您了解预先训练模型，了解什么，让你做更好的选择。

资料来源 (Sources)

[1] How Contextual are Contextual Embeddings? by Kawin Ethayarajh

[1]上下文嵌入的上下文如何？由Kawin Ethayarajh

翻译自: https://towardsdatascience.com/tokenizers-nlps-building-block-9ab17d3e6929

weixin_26731327

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
令牌生成器nlps构建块

The truth is, tokenizers are not that interesting. When I first read the BERT paper, I raced past the WordPiece tokenizing section because it wasn’t as exciting as the rest of the paper. But tokenizat...
复制链接

扫一扫