单词嵌入_神秘的文本分类:单词嵌入简介

单词嵌入

Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple reordering and replacing rules.

自然语言处理(NLP)是始于1950年代的一门古老科学。 1954年, 乔治敦IBM的实验是朝着全自动文本翻译迈出的一大步。 使用简单的重新排序和替换规则,将60多个俄语句子翻译成英语。

The statistical revolution in NLP started in late the 1980s. Instead of hand-crafting a set of rules, a large corpus of text was analyzed to create rules using statistical approaches. Different metrics were calculated for given input data, and predictions were made using decision trees or regression-based calculations.

NLP的统计革命始于1980年代后期。 与其手工制定一组规则,不如分析大量文本集以使用统计方法创建规则。 针对给定的输入数据计算了不同的指标,并使用决策树或基于回归的计算进行了预测。

Today, complex metrics are replaced by more holistic approaches that create better results and that are easier to maintain.

如今,复杂的指标已被更全面的方法所取代,这些方法可以产生更好的结果并且更易于维护。

This post is about word embeddings, which is the first part of my machine learning for coders series (with more to follow!).

这篇文章是关于单词嵌入的,这是我的机器学习编码器系列文章的第一部分(还有更多后续内容!)。

什么是词嵌入? (What are word embeddings?)

Traditionally, in natural language processing (NLP), words were replaced with unique IDs to do calculations. Let’s take the following example:

传统上,在自然语言处理(NLP)中,单词被替换为唯一的ID进行计算。 让我们来看下面的例子:

This approach has the disadvantage that you will need to create a huge list of words and give each element a unique ID. Instead of using unique numbers for your calculations, you can also use vectors to that represent their meaning, so-called word embeddings:

这种方法的缺点是您将需要创建大量单词并为每个元素赋予唯一的ID。 除了使用唯一的数字进行计算之外,您还可以使用向量来表示其含义,即所谓的词嵌入:

In this example, each word is represented by a vector. The length of a vector can be different. The bigger the vector is, the more context information it can store. Additionally, the calculation costs go up as vector size increases.

在此示例中,每个单词都由一个向量表示。 向量的长度可以不同。 向量越大,它可以存储的上下文信息越多。 此外,计算成本随着向量大小的增加而增加。

The element count of a vector is also called the number of vector dimensions. In the example above, the word example is expressed with (4 2 6), whereby 4 is the value of the first dimension, 2 of the 2nd, and 6 of the 3rd dimension.

向量的元素计数也称为向量维数。 在上面的示例中,单词example用(4 2 6)表示,其中4是第一维的值,2是第二维的值以及6是第三维的值。

In more complex examples, there might be more than 100 dimensions that can encode a lot of information. Things like:

在更复杂的示例中,可能有100多个维度可以编码很多信息。 像:

  • gender,

    性别,
  • race,

    种族,
  • age,

    年龄,
  • type of word

    字词类型

will be stored.

将被存储。

A word such as one is a word that is a quantity like many. Therefore, both vectors are closer compared to words that are more different in their usage.

一个这样的词是一个数量众多的词。 因此,两个向量与用法不同的词相比更接近。

Simplified, if vectors are similar, then the words have similarities in their usage. For other NLP tasks, this has a lot of advantages because calculations can be made based upon a single vector with only a few hundreds of parameters in comparison to a huge dictionary with hundreds of thousands of IDs.

简化后,如果向量相似,则单词的用法相似。 对于其他NLP任务,这具有很多优势,因为与具有数十万个ID的庞大字典相比,可以基于仅具有数百个参数的单个向量进行计算。

Additionally, if there are unknown words that were never seen before, then this is no problem. You just need a good word embedding of the new word, and calculations are similar. The same applies to other languages. This is basically the magic of word embeddings that enables things like fast learning, multi-language processing, and much more.

另外,如果有以前从未见过的未知单词,那么这没问题。 您只需要在新单词上嵌入一个好的单词即可,并且计算结果相似。 其他语言也一样。 基本上,这就是单词嵌入的魔力,它可以实现快速学习,多语言处理等功能。

创建单词嵌入 (Creation of word embeddings)

It’s very popular to extend the concept of word embeddings to other domains. For example, a movie rental platform can create movie embeddings and do calculations upon vectors instead of movie IDs.

将词嵌入的概念扩展到其他领域非常流行。 例如,电影租赁平台可以创建电影嵌入并根据矢量而不是电影ID进行计算。

但是,如何创建此类嵌入? (But how do you create such embeddings?)

There are various techniques out there, but all of them follow the key aspect that the meaning of a word is defined due to its usage.

那里有各种各样的技术,但是所有这些技术都遵循一个关键方面,即由于单词的用法而定义了单词的含义。

Let’s say we have a set of sentences:

假设我们有一组句子:

text_for_training = [
    'he is a king',
    'she is a queen',
    'he is a man',
    'she is a woman',
    'she is a daughter',
    'he is a son'
]

The sentences contain 10 unique words, and we want to create a word embedding for each word.

句子包含10个唯一的单词,我们要为每个单词创建一个单词嵌入。

{
    0: 'he',
    1: 'a',
    2: 'is',
    3: 'daughter',
    4: 'man',
    5: 'woman',
    6: 'king',
    7: 'she',
    8: 'son',
    9: 'queen'
}

There are various approaches for how to create embeddings out of them. Let’s pick one of the most used approaches called word2vec. The concept behind this technique uses a very simple neural network to create vectors that represent meanings of words.

有多种方法可以用来创建嵌入。 让我们选择一种最常用的方法word2vec 。 该技术背后的概念使用非常简单的神经网络来创建代表单词含义的向量。

Let’s start with the target word “king”. It is used within the context of the masculine pronoun “he”. Context in this example means it just is part of the same sentence. The same applies to “queen” and “she”. It also makes sense to do the same approach for more generic words. The word “he“ can be the target word and “is” is the context word.

让我们从目标词“ king ”开始。 它在男性代词“ he ”的上下文中使用。 在此示例中,上下文意味着它只是同一句子的一部分。 “ 皇后 ”和“ ”也一样。 对更通用的单词执行相同的方法也很有意义。 “ ”可以是目标词,“ ”是上下文词。

If we do this for every combination, we can actually get simple word embeddings. More holistic approaches add more complexity and calculations, but they are all based on this approach.

如果对每种组合都执行此操作,则实际上可以得到简单的单词嵌入。 更具整体性的方法会增加更多的复杂性和计算量,但它们都是基于此方法的。

To use a word as an input for a neural network we need a vector. We can decode a word's unique id in a vector by putting a 1 at the position of the word of our dictionary and keep every other index at 0. This is called a one-hot encoded vector:

要将单词用作神经网络的输入,我们需要一个向量。 我们可以通过在字典的单词的位置放置1并将每个其他索引保持在0来解码矢量中单词的唯一ID,这称为单热编码矢量:

Between the input and the output is a single hidden layer. This layer contains as many elements as the word embedding should have. The more elements word embeddings have, the more information they can store.

在输入和输出之间是单个隐藏层。 该层包含的元素数量应与嵌入一词一样多。 单词嵌入的元素越多,它们可以存储的信息就越多。

You might think, then just make it very big. But we have to consider that we need to store an embedding for each existing word that quickly adds up to a decent amount of data to be stored. Additionally, bigger embeddings mean a lot more calculations for neural networks that use embeddings.

您可能会认为,然后使其变得非常大。 但是我们必须考虑到,我们需要为每个现有单词存储一个嵌入,以快速将大量的数据存储起来。 此外,更大的嵌入意味着使用嵌入的神经网络需要进行更多的计算。

In our example, we will just use 5 as an embedding vector size.

在我们的示例中,我们将仅使用5作为嵌入矢量大小。

The magic of neural networks lies in what's in between the layers, called weights. They store information between layers, where each node of the previous layer is connected with each node of the next layer.

神经网络的魔力在于层之间的权重。 它们在层之间存储信息,其中上一层的每个节点与下一层的每个节点连接。

Each connection between the layers is a so-called parameter. These parameters contain the important information of neural networks. 100 parameters - 50 between input and hidden layer, and 50 between hidden and output layer - are initialized with random values and adjusted by training the model.

层之间的每个连接都是所谓的参数。 这些参数包含神经网络的重要信息。 使用随机值初始化100个参数-输入层和隐藏层之间的50个参数,以及隐藏层和输出层之间的50个参数-并通过训练模型进行调整。

In this example, all of them are initialized with 0.1 to keep it simple. Let’s think through an example training round, also called an epoch:

在此示例中,所有这些都使用0.1进行了初始化以保持简单。 让我们通过一个示例训练回合(也称为纪元)来思考:

At the end of the calculation of the neural network, we don’t get our expected output that tells us for the given context “he” that the target is “king”.

在神经网络计算的最后,我们没有得到预期的输出,该预期的输出告诉我们在给定的上下文中“ ”,目标是“ 国王 ”。

This difference between the result and the expected result is called the error of a network. By finding better parameter values, we can adjust the neural network to predict for future context inputs that deliver the expected target output.

结果与预期结果之间的差异称为网络错误。 通过查找更好的参数值,我们可以调整神经网络,以预测提供预期目标输出的将来上下文输入。

The contents of our layer connections will change after we try to find better parameters that get us closer to our expected output vector. The error is minimized as soon as the network predicts correctly for different target and context words. The weights between the input and hidden layer will contain all our word embeddings.

尝试找到更好的参数使我们更接近预期的输出矢量后,层连接的内容将发生变化。 一旦网络针对不同的目标词和上下文词正确预测,就可以将错误最小化。 输入层和隐藏层之间的权重将包含我们所有的词嵌入。

You can find the complete example with executable code here. You can create a copy and play with it if you press “Open in playground.”

您可以在此处找到带有可执行代码的完整示例。 如果您按“在操场上打开”,则可以创建副本并进行播放。

If you are not familiar with notebooks, it’s pretty simple: it can be read from top to bottom, and you can click and edit the Python code directly.

如果您不熟悉笔记本,它非常简单:可以从上到下阅读,并且可以直接单击和编辑Python代码。

By pressing “SHIFT+Enter,” you can execute code snippets. Just make sure to start at the top by clicking in the first snipped and pressing SHIFT+Enter, wait a bit and press again SHIFT+Enter, and so on and so on.

通过按“ SHIFT + Enter”,您可以执行代码段。 只需确保通过单击第一个片段并单击SHIFT + Enter并从顶部开始,稍等片刻然后再次按SHIFT + Enter,依此类推。

结论 (Conclusion)

In a nutshell, word embeddings are used to create neural networks in a more flexible way. They can be built using neural networks that have a certain task, such as prediction of a target word for a given context word. The weights between the layers are parameters the are adjusted over time. Et voilà, there are your word embeddings.

简而言之,单词嵌入用于以更灵活的方式创建神经网络。 可以使用具有特定任务的神经网络来构建它们,例如预测给定上下文单词的目标单词。 层之间的权重是随时间调整的参数。 等等,这里有您的单词嵌入。



I hope you enjoyed the article. If you like it and feel the need for a round of applause, follow me on Twitter.

希望您喜欢这篇文章。 如果您喜欢它并感到需要掌声,请在Twitter上关注我

I am a co-founder of our revolutionary journey platform called Explore The World. We are a young startup located in Dresden, Germany and will target the German market first. Reach out to me if you have feedback and questions about any topic.

我是我们的创新旅程平台“ 探索世界”的共同创始人。 我们是一家年轻的初创公司,位于德国德累斯顿,并将首先瞄准德国市场。 如果您有关于任何主题的反馈和问题,请与我联系。

Happy AI exploring :)

快乐的AI探索:)



References

参考文献

翻译自: https://www.freecodecamp.org/news/demystify-state-of-the-art-text-classification-word-embeddings/

单词嵌入

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值