NLP之Word2Vec【CBOW/Skip-Gram】：《Efficient Estimation of Word Representations in Vector Space向量空间中词表示的有

一个处女座的程序猿

已于 2023-07-03 00:54:42 修改

阅读量1.4w

点赞数 4

分类专栏： NLP/LLMs Paper 文章标签：自然语言处理算法人工智能

于 2018-10-15 15:15:00 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/83059283

版权

NLP/LLMs 同时被 2 个专栏收录

557 篇文章 426 订阅

订阅专栏

Paper

71 篇文章 46 订阅

订阅专栏

NLP之Word2Vec【CBOW/Skip-Gram】：《Efficient Estimation of Word Representations in Vector Space向量空间中词表示的有效估计》翻译与解读

导读：本文提出两种模型架构，在学习单词分布表示上有显著优势。用于从大规模数据集学习高质量的单词分布表示。与传统神经网络模型相比，本文模型在计算成本和准确率上都有显著提升。作者通过测量任务来评估单词表示的质量，例如单词相似性任务。
>> 计算成本低、学习速度快：模型训练成本低，1.6亿单词的数据集只需要一天时间就能学习出高质量的单词表示。
>> 模型准确率高：在测试集上能更好捕捉单词之间的句法和语义相似性。
>> 能从大量数据中学习到高质量的单词表示：实验结果表明，使用大量数据可以学习到高质量的单词向量表示，包含多种单词之间的复杂关系。

介绍了从包含数十亿个单词的大型数据集中学习高质量词向量的技术。与以往的模型相比，提出的模型架构在训练效率和准确性方面取得了显著的进展。通过使用分布式单词表示和简单的代数运算，可以获取具有多个相似度程度的单词向量，并且这些向量之间存在线性规律。通过设计全面的测试集，证明了该模型在句法和语义规律学习方面的高准确性。这些技术的进一步发展有望为未来的自然语言处理应用提供重要的基础。

比较了传统的feedforward模型和新的RNN模型在学习分布式单词表示时的计算复杂性和准确性。提出优化方法减少计算复杂度。并采用分布式训练来处理大规模数据。

本文提出2种新模型架构来学习分布式单词表示，并比较了这2种模型与传统feedforward和RNN模型，说明新模型的优势。总的来说，这2种新模型架构去除了隐藏层，计算效率更高。但依然能捕捉单词间的关系，学习出高质量的词向量。

本文设计的测试集可以有效评估词向量的质量；提出的CBOW模型和Skip-gram模型能在低计算成本的情况下学习高质量的分布式单词表示。使用大规模数据并行训练可以进一步提高词向量的效果，最终领先于传统的feedforward和RNN模型。

本文学习到的词向量可以有效表达不同的语义和语法关系。虽然存在不足，但效果已经证明使用分布式单词表示有优势。进一步提高词向量质量和应用广度仍然有很多空间，可以开发更多新应用。

本文比较了不同模型学习词向量的效果。实验证明采用简单的CBOW和Skip-gram模型可以学习出高质量的高维词表示。计算效率更高。词向量已经在很多NLP任务上产生了良好结果。未来还有广阔的应用前景。

《Efficient Estimation of Word Representations in Vector Space》翻译与解读

地址	论文：https://arxiv.org/abs/1301.3781
时间	2013年1月16日
作者	Google

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

本文提出了两种新颖的模型架构，用于从大规模数据集中计算单词的连续向量表示。这些表示的质量通过单词相似性任务进行衡量，并与基于不同类型神经网络的先前表现最好的技术进行了比较。我们观察到在更低的计算成本下准确性有很大提高，即从一个包含16亿个单词的数据集中学习高质量单词向量仅需不到一天的时间。此外，我们展示了这些向量在衡量句法和语义单词相似性的测试集上提供了最先进的性能。

1 Introduction

本文介绍了从包含数十亿个单词的大型数据集中学习高质量词向量的技术。与以往的模型相比，提出的模型架构在训练效率和准确性方面取得了显著的进展。通过使用分布式单词表示和简单的代数运算，可以获取具有多个相似度程度的单词向量，并且这些向量之间存在线性规律。通过设计全面的测试集，证明了该模型在句法和语义规律学习方面的高准确性。这些技术的进一步发展有望为未来的自然语言处理应用提供重要的基础。

>> 传统的NLP系统将单词视为原子单位，忽略了单词间的相似性。例如n元模型。

>> 但对于有限数据下的任务，如语音识别，简单的方法已经达到瓶颈。

>> 最近出现的分布式单词表示方式有效改善了这一点。如神经网络语言模型。

>> 本文目标是学习高质量的单词向量表示，以纽接正规化和语义之间相似性。

>> 通过新的模型架构，优化单词间线性关系，来提高准确率。

>> 结果表明，使用大量数据训练可以很好捕捉语法和语义之间的多样关系。

>> 训练时间和准确率与向量维度和训练数据量有关。

>> 过去也有很多研究使用连续向量表示单词，但本文提出的模型更简单高效。

Many current NLP systems and techniques treat words as atomic units - there is no notion of similar-ity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is possible to train N-grams on virtually all available data (trillions of words [3]).

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.

With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

许多当前的自然语言处理系统和技术将单词视为原子单位-没有考虑单词之间的相似性，因为它们被表示为词汇表中的索引。这种选择有几个好处-简单性、鲁棒性以及简单模型在大量数据上训练的表现优于在少量数据上训练的复杂系统。例如，统计语言建模中使用的流行的N-gram模型-如今，可以对几乎所有可用数据进行N-gram训练（万亿个单词[3]）。

然而，对于许多任务来说，简单技术已经达到了极限。例如，自动语音识别中相关领域的数据量有限-性能通常受高质量转录语音数据的大小所限制（通常只有数百万个单词）。在机器翻译中，许多语言的现有语料库只包含几十亿个单词甚至更少。因此，在基本技术的简单扩展不会带来任何重大进展的情况下，我们必须关注更高级的技术。

随着机器学习技术在近年来的进展，可以在更大的数据集上训练更复杂的模型，而这些模型通常优于简单模型。其中最成功的概念可能是使用单词的分布式表示[10]。例如，基于神经网络的语言模型明显优于N-gram模型[1，27，17]。

1.1 论文的目标Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

We use recently proposed techniques for measuring the quality of the resulting vector representa-tions, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, 14].

本文的主要目标是介绍可以用于从包含数十亿个单词和数百万个词汇的大规模数据集中学习高质量单词向量的技术。据我们所知，先前提出的架构尚未成功地在数亿个单词上进行训练，单词向量的维度通常在50-100之间。

我们使用最近提出的技术来衡量生成的向量表示的质量，期望类似的单词不仅靠近彼此，还可以具有多个相似度程度[20]。这在屈折语言的语境中早已观察到-例如，名词可以有多个词尾，如果我们在原始向量空间的子空间中搜索相似单词，可以找到具有相似词尾的单词[13，14]。

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are per-formed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vec-tor(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities1, and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

令人惊讶的是，发现单词表示的相似性超越了简单的句法规律。使用单词偏移技术，在单词向量上执行简单的代数运算，例如 vector("King") - vector("Man") + vector("Woman") 的结果向量最接近单词"Queen"的向量表示[20]。

在本文中，我们通过开发新的模型架构来最大化这些向量操作的准确性，以保留单词之间的线性规律性。我们设计了一个新的全面的测试集，用于衡量句法和语义规律性，并展示了许多这样的规律性可以以高准确性进行学习。此外，我们讨论了训练时间和准确性如何取决于单词向量的维度和训练数据的数量。

1.2 先前工作Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.
It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison2. However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].

将单词表示为连续向量的概念已经有很长的历史[10, 26, 8]。一个非常流行的模型架构用于估计神经网络语言模型（NNLM）在[1]中提出，其中使用前馈神经网络、线性投影层和非线性隐藏层来共同学习单词向量表示和统计语言模型。这项工作之后受到了许多其他研究的关注。

另一个有趣的NNLM架构在[13, 14]中提出，其中单词向量首先使用具有单个隐藏层的神经网络进行学习。然后使用这些单词向量来训练NNLM。因此，即使没有构建完整的NNLM，也可以学习单词向量。在这项工作中，我们直接扩展了这个架构，只关注使用简单模型学习单词向量的第一步。

后来发现单词向量可以显著改进和简化许多NLP应用[4, 5, 29]。单词向量本身的估计是使用不同的模型架构和在各种语料库上训练的[4, 29, 23, 19, 9]，并且一些生成的单词向量已经供未来的研究和比较使用。然而，据我们所知，与[13]提出的方法相比，这些架构在训练时的计算成本更高，除了某些使用对角权重矩阵的对数双线性模型的版本[23]。

2 模型架构Model Architectures

本文比较了传统的feedforward模型和新的RNN模型在学习分布式单词表示时的计算复杂性和准确性。提出优化方法减少计算复杂度。并采用分布式训练来处理大规模数据。

>> 模型的计算复杂性和精度是评价标准。研究 minimizing complexity while maximizing accuracy。

>> Feedforward模型包含投影层和隐藏层。计算复杂度是O( N×D+N×D×H + H×V ) ，其中N×D×H通常是主要复杂度。

>> 为提高效率，使用hierarchical softmax，将词汇表表示为Huffman编码树，从log2(V)到log2(Unigram perplexity(V))。

>> RNN模型包含隐藏层和输出层，没有投影层。通过隐藏层的循环连接实现短期记忆。

>> RNN模型的计算复杂度是O(H×H + H×V) ，主要复杂度来自H×H。同样使用hierarchical softmax减少H×V项。

>> 为训练大规模数据，本文实现了模型在DistBelief(分布式框架)上。使用mini-batch技巧。

Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

Similar to [18], to compare different model architectures we define first the computational complex-ity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity.

许多不同类型的模型被提出用于估计单词的连续表示，包括众所周知的潜在语义分析（LSA）和潜在狄利克雷分配（LDA）。在本文中，我们关注通过神经网络学习的单词分布式表示，因为先前的研究表明，与LSA相比，它们在保持单词之间线性规律性方面表现更好[20, 31]；而且，LDA在大型数据集上的计算成本非常高。

类似于[18]，为了比较不同的模型架构，我们首先定义模型的计算复杂性为需要访问的参数数量。接下来，我们将尽量在最小化计算复杂性的同时最大化准确性。

For all the following models, the training complexity is proportional to

O = E × T × Q,

where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture. Common choice is E = 3 − 50 and T up to one billion. All models are trained using stochastic gradient descent and backpropagation [26].

对于以下所有模型，训练复杂性与以下比例成正比：

O = E × T × Q，

其中E是训练的迭代次数，T是训练集中的单词数量，而Q对于每个模型架构则有进一步的定义。常见选择是E = 3-50，T最多可达十亿。所有模型都使用随机梯度下降和反向传播进行训练[26]。

2.1 前馈神经网络语言模型Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [1]. It consists of input, projection, hidden and output layers. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

概率前馈神经网络语言模型最早在[1]中提出。它由输入层、投影层、隐藏层和输出层组成。在输入层，前N个单词使用1-of-V编码进行编码，其中V是词汇表的大小。然后，将输入层投影到具有维度N × D的投影层P上，使用共享投影矩阵。由于任何给定时间只有N个输入是活动的，所以投影层的组合是一个相对廉价的操作。

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of N = 10, the size of the projection layer (P ) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity per each training example is

Q = N × D + N × D × H + H × V,

where the dominating term is H × V . However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized models completely by using models that are not normalized during training [4, 9]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around log2(V ). Thus, most of the complexity is caused by the term N × D × H.

NNLM架构在投影层和隐藏层之间的计算中变得复杂，因为投影层中的值是密集的。对于常见的选择N = 10，投影层（P）的大小可能是500到2000，而隐藏层的大小H通常是500到1000个单元。此外，隐藏层用于计算整个词汇表上的概率分布，导致具有维度V的输出层。因此，每个训练示例的计算复杂度为

Q = N × D + N × D × H + H × V，

其中主导项是H × V。然而，为了避免这个问题，提出了几种实际解决方案；可以使用softmax的分层版本[25, 23, 18]，或者在训练期间完全避免使用标准化模型[4, 9]。通过使用词汇表的二叉树表示，需要评估的输出单元数量可以减少到约log2(V)。因此，大部分的复杂性来自于项N × D × H。

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated: while balanced binary tree would require log2(V ) outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about log2(Unigram perplexity(V )). For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the N ×D×H term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax normalization.

在我们的模型中，我们使用词汇表表示为Huffman二叉树的分层softmax。这是根据先前的观察，词频在获得神经网络语言模型中的类别时效果良好[16]。Huffman树为频繁出现的单词分配了短的二进制编码，这进一步减少了需要评估的输出单元数量：当平衡的二叉树需要评估log2(V)个输出时，基于Huffman树的分层softmax只需要大约log2(Unigram perplexity(V))个输出。例如，当词汇表大小为一百万个单词时，这将导致评估速度提高约两倍。虽然这对于神经网络语言模型来说不是关键的加速，因为计算瓶颈在于N × D × H项，但我们将在后文中提出一些没有隐藏层的架构，因此严重依赖softmax归一化的效率。

2.2 循环神经网络语言模型Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model N), and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the recurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step.

基于循环神经网络的语言模型被提出来克服前馈NNLM的某些限制，比如需要指定上下文长度（模型的阶数N），并且理论上RNN可以有效地表示比浅层神经网络更复杂的模式[15, 2]。RNN模型没有投影层，只有输入层、隐藏层和输出层。这种模型的特殊之处在于将隐藏层与自身连接的递归矩阵，使用时间延迟连接。这使得递归模型可以形成一种短期记忆，因为来自过去的信息可以由隐藏层状态表示，该状态基于当前输入和前一时间步的隐藏层状态进行更新。

The complexity per training example of the RNN model is

Q = H × H + H × V,

where the word representations D have the same dimensionality as the hidden layer H. Again, the term H × V can be efficiently reduced to H × log2(V ) by using hierarchical softmax. Most of the complexity then comes from H × H.

RNN模型每个训练示例的复杂度为

Q = H × H + H × V，

其中单词表示D与隐藏层H具有相同的维度。同样，通过使用分层softmax，项H × V可以有效地减少为H × log2(V)。大部分复杂性来自于H × H项。

2.3 神经网络的并行训练Parallel Training of Neural Networks

To train models on huge data sets, we have implemented several models on top of a large-scale distributed framework called DistBelief [6], including the feedforward NNLM and the new models proposed in this paper. The framework allows us to run multiple replicas of the same model in parallel, and each replica synchronizes its gradient updates through a centralized server that keeps all the parameters. For this parallel training, we use mini-batch asynchronous gradient descent with an adaptive learning rate procedure called Adagrad [7]. Under this framework, it is common to use one hundred or more model replicas, each using many CPU cores at different machines in a data center.

为了在大型数据集上训练模型，我们在一个称为DistBelief的大规模分布式框架上实现了几个模型，包括前馈NNLM和本文提出的新模型。该框架允许我们并行运行多个相同模型的副本，每个副本通过一个集中的服务器同步其梯度更新。对于这种并行训练，我们使用自适应学习率过程Adagrad进行小批量异步梯度下降[7]。在这个框架下，通常使用一百个或更多模型副本，在数据中心的不同机器上使用多个CPU核心。

3 新的对数线性模型New Log-linear Models

本文提出2种新模型架构来学习分布式单词表示，并比较了这2种模型与传统feedforward和RNN模型，说明新模型的优势。总的来说，这2种新模型架构去除了隐藏层，计算效率更高。但依然能捕捉单词间的关系，学习出高质量的词向量。

>> CBOW模型：移除隐藏层，直接使用投影层。使各个单词的向量相加。CBOW模型的复杂度为O(N×D + D×log2(V))，它使用前后文上下文信息来预测当前词。

>> CSG模型：采用一个单词作为输入，预测其上下文单词。CSG模型的复杂度为O(C×(D + D×log2(V)))，它使用一个词来预测其周围C个词。

>> 这2种模型去除了复杂的隐藏层，计算成本更低。

>> 作者发现，增加上下文范围可以提高词向量质量，但也增加了计算成本。

In this section, we propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.

在本节中，我们提出了两种新的模型架构，用于学习词的分布表示，并尽量减少计算复杂度。前一节的主要观察结果是，大部分复杂性是由模型中的非线性隐藏层引起的。虽然这正是神经网络如此吸引人的地方，但我们决定探索更简单的模型，这些模型可能无法像神经网络那样精确地表示数据，但可能可以更高效地在更多数据上进行训练。

The new architectures directly follow those proposed in our earlier work [13, 14], where it was found that neural network language model can be successfully trained in two steps: first, continuous word vectors are learned using simple model, and then the N-gram NNLM is trained on top of these distributed representations of words. While there has been later substantial amount of work that focuses on learning word vectors, we consider the approach proposed in [13] to be the simplest one. Note that related models have been proposed also much earlier [26, 8].

这些新的架构直接遵循我们早期工作中提出的架构[13, 14]，在那些工作中发现神经网络语言模型可以通过两个步骤成功训练：首先，使用简单模型学习连续的词向量，然后在这些分布式的词表示之上训练N-gram NNLM。虽然后来有大量的研究专注于学习词向量，但我们认为[13]中提出的方法是最简单的方法。注意，也有早期的相关模型提出[26, 8]。

3.1连续词袋模型Continuous Bag-of-Words Model

The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). We call this archi-tecture a bag-of-words model as the order of words in the history does not influence the projection. Furthermore, we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word. Training complexity is then

Q = N × D + D × log2(V ).

第一个提出的架构类似于前馈NNLM，其中移除了非线性隐藏层，并且投影层在所有单词上是共享的（不仅仅是投影矩阵）；因此，所有单词都被投影到相同的位置（它们的向量被平均）。我们将这种架构称为词袋模型，因为单词在历史中的顺序不会影响投影。此外，我们还使用来自未来的单词；我们通过在输入中使用四个未来单词和四个历史单词来构建一个具有最佳性能的对数线性分类器，训练准则是正确分类当前（中间）单词。训练复杂度为

Q = N × D + D × log2(V)。

We denote this model further as CBOW, as unlike standard bag-of-words model, it uses continuous distributed representation of the context. The model architecture is shown at Figure 1. Note that the weight matrix between the input and the projection layer is shared for all word positions in the same way as in the NNLM.

我们进一步将这个模型称为CBOW，因为与标准的词袋模型不同，它使用连续的分布式上下文表示。模型架构如图1所示。请注意，输入层和投影层之间的权重矩阵与NNLM中的方式相同，对于相同位置的所有单词位置是共享的。

3.2 连续Skip-gram模型Continuous Skip-gram Model

The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.

第二种架构与CBOW相似，但不是基于上下文来预测当前单词，而是基于同一句子中的另一个单词来最大化单词的分类。更准确地说，我们将每个当前单词作为输入传递给具有连续投影层的对数线性分类器，并预测当前单词之前和之后一定范围内的单词。我们发现增加范围可以提高生成的词向量的质量，但也会增加计算复杂度。由于较远的单词通常与当前单词的相关性较低，我们在训练示例中对这些单词的采样权重较低。

The training complexity of this architecture is proportional to

Q = C × (D + D × log2(V )),

where C is the maximum distance of the words. Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and R words from the future of the current word as correct labels. This will require us to do R × 2 word classifications, with the current word as input, and each of the R + R words as output. In the following experiments, we use C = 10.

这种架构的训练复杂度与C（单词的最大距离）成正比。

Q = C × (D + D × log2(V))，

其中C是单词的最大距离。因此，如果我们选择C = 5，对于每个训练单词，我们将随机选择一个范围在<1; C>内的数R，然后使用来自当前单词历史和未来的R个单词作为正确标签。这将要求我们对R×2个单词进行分类，其中当前单词作为输入，每个R+R个单词作为输出。在后续的实验中，我们使用C = 10。

4 结果Results

本文设计的测试集可以有效评估词向量的质量；提出的CBOW模型和Skip-gram模型能在低计算成本的情况下学习高质量的分布式单词表示。使用大规模数据并行训练可以进一步提高词向量的效果，最终领先于传统的feedforward和RNN模型。

>> 本文设计了一个测试集来评估不同模型学习的词向量的质量，包括5类语义问题和9类语法问题。

>> 作者发现适当增加词向量维度和训练数据量可以大幅提高准确率，但增益越来越小。

>> 作者比较不同模型架构和参数，发现Skip-gram模型在语义任务上效果最好。

>> 使用分布式框架并行训练词向量，能加速训练速度。Skip-gram模型(1000维)只需2.5天。

>> Skip-gram模型在Microsoft句子完成任务上取得48.0%的准确率，并且可以与RNN联合起来达到58.9%。

>> 实验结果表明，使用高维和大规模数据训练的词向量能更好捕捉单词间复杂的语法和语义关系。

To compare the quality of different versions of word vectors, previous papers typically use a table showing example words and their most similar words, and understand them intuitively. Although it is easy to show that word France is similar to Italy and perhaps some other countries, it is much more challenging when subjecting those vectors in a more complex similarity task, as follows. We follow previous observation that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. Example of another type of relationship can be word pairs big - biggest and small - smallest [20]. We further denote two pairs of words with the same relationship as a question, as we can ask: ”What is the word that is similar to small in the same sense as biggest is similar to big?”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operations with the vector representation of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector X = vector(”biggest”) − vector(”big”) + vector(”small”). Then, we search in the vector space for the word closest to X measured by cosine distance, and use it as the answer to the question (we discard the input question words during this search). When the word vectors are well trained, it is possible to find the correct answer (word smallest) using this method.

为了比较不同版本的词向量的质量，以前的论文通常使用表格显示示例单词及其最相似的单词，并直观地理解它们。虽然很容易展示出单词"France"与"Italy"以及其他一些国家相似，但在更复杂的相似性任务中，这种向量的应用更具挑战性。我们遵循以前的观察，单词之间可能存在许多不同类型的相似性，例如，单词"big"与"bigger"的相似性与"small"与"smaller"的相似性相同。另一种类型的关系的例子可以是单词对"biggest"-"big"和"smallest"-"small"。我们将具有相同关系的两对单词称为一个问题，因为我们可以提问："在与"biggest"与"big"的关系相同的意义上，与"small"类似的单词是什么？"

令人惊讶的是，通过对单词的向量表示执行简单的代数运算，可以回答这些问题。为了找到与"small"在与"biggest"与"big"的关系相同的意义上类似的单词，我们可以简单地计算向量X = vector("biggest") - vector("big") + vector("small")。然后，我们在向量空间中搜索最接近X的单词，以余弦距离来衡量，并将其作为问题的答案（在此搜索过程中，丢弃输入的问题单词）。当词向量训练良好时，可以使用此方法找到正确答案（"smallest"）。

Finally, we found that when we train high dimensional word vectors on a large amount of data, the resulting vectors can be used to answer very subtle semantic relationships between words, such as a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectors with such semantic relationships could be used to improve many existing NLP applications, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented.

最后，我们发现当我们在大量数据上训练高维词向量时，得到的向量可以用于回答单词之间非常微妙的语义关系，例如城市与所属国家之间的关系，例如法国与巴黎之间的关系，类似德国与柏林之间的关系。具有这种语义关系的词向量可以用于改进许多现有的自然语言处理应用，例如机器翻译、信息检索和问答系统，并可能为其他尚未发明的未来应用铺平道路。

4.1 任务描述Task Description

To measure quality of the word vectors, we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Two examples from each category are shown in Table 1. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each category were created in two steps: first, a list of similar word pairs was created manually. Then, a large list of questions is formed by connecting two word pairs. For example, we made a list of 68 large American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs at random. We have included in our test set only single token words, thus multi-word entities are not present (such as New York).

We evaluate the overall accuracy for all question types, and for each question type separately (se-mantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes. This also means that reaching 100% accuracy is likely to be impossible, as the current models do not have any input information about word morphology. However, we believe that usefulness of the word vectors for certain applications should be positively correlated with this accuracy metric. Further progress can be achieved by incorporating information about structure of words, especially for the syntactic questions.

为了衡量词向量的质量，我们定义了一个综合的测试集，其中包含五种类型的语义问题和九种类型的句法问题。表1中显示了每个类别的两个示例。总共有8869个语义问题和10675个句法问题。

每个类别的问题都是通过两个步骤创建的：首先，手动创建了一个相似词对列表；然后，通过连接两个词对形成了一个大列表的问题。例如，我们制作了一个包含68个美国大城市及其所属州的列表，并通过随机选择两个词对形成约2.5K个问题。我们的测试集中只包含单词，因此不包含多个单词的实体（如纽约）。

我们评估所有问题类型的总体准确率，以及每个问题类型的单独准确率（语义、句法）。只有当使用上述方法计算的向量的最接近单词与问题中的正确单词完全相同时，问题才被认为回答正确；同义词被视为错误。这也意味着达到100%的准确率可能是不可能的，因为当前的模型没有关于词形态的任何输入信息。然而，我们相信词向量对某些应用的有用性应与此准确度指标呈正相关。通过结合有关单词结构的信息，特别是对于句法问题，可以实现进一步的进展。

4.2 准确率的最大化Maximization of Accuracy

We have used a Google News corpus for training the word vectors. This corpus contains about 6B tokens. We have restricted the vocabulary size to 1 million most frequent words. Clearly, we are facing time constrained optimization problem, as it can be expected that both using more data and higher dimensional word vectors will improve the accuracy. To estimate the best choice of model architecture for obtaining as good as possible results quickly, we have first evaluated models trained on subsets of the training data, with vocabulary restricted to the most frequent 30k words. The results using the CBOW architecture with different choice of word vector dimensionality and increasing amount of the training data are shown in Table 2.

It can be seen that after some point, adding more dimensions or adding more training data provides diminishing improvements. So, we have to increase both vector dimensionality and the amount of the training data together. While this observation might seem trivial, it must be noted that it is currently popular to train word vectors on relatively large amounts of data, but with insufficient size (such as 50 - 100). Given Equation 4, increasing amount of training data twice results in about the same increase of computational complexity as increasing vector size twice.

我们使用了Google News语料库来训练词向量。该语料库包含约60亿个标记。我们将词汇量限制为最常见的100万个单词。很明显，我们面临时间约束的优化问题，因为可以预期使用更多的数据和更高维的词向量将改善准确性。为了估计获得尽快取得尽可能好的结果的最佳模型架构的选择，我们首先评估了在训练数据子集上训练的模型，词汇表限制为最常见的30k个单词。表2显示了使用不同选择的词向量维度和不断增加的训练数据量的CBOW架构的结果。

可以看出，在某个点之后，增加维度或增加训练数据提供的改进效果递减。因此，我们必须同时增加向量维度和训练数据量。虽然这个观察可能看起来平凡，但必须注意，当前流行的训练词向量方法通常使用相对较大的数据量，但规模不足（例如50-100）。根据公式4，将训练数据量增加一倍与将向量大小增加一倍的计算复杂性增加相同。

For the experiments reported in Tables 2 and 4, we used three training epochs with stochastic gradi-ent descent and backpropagation. We chose starting learning rate 0.025 and decreased it linearly, so that it approaches zero at the end of the last training epoch.

对于表2和表4中报告的实验，我们使用了三个训练周期，采用随机梯度下降和反向传播。我们选择了起始学习率为0.025，并线性降低它，以使其在最后一个训练周期结束时接近零。

4.3 模型架构的比较Comparison of Model Architectures

First we compare different model architectures for deriving the word vectors using the same training data and using the same dimensionality of 640 of the word vectors. In the further experiments, we use full set of questions in the new Semantic-Syntactic Word Relationship test set, i.e. unrestricted to the 30k vocabulary. We also include results on a test set introduced in [20] that focuses on syntactic similarity between words3.

The training data consists of several LDC corpora and is described in detail in [18] (320M words, 82K vocabulary). We used these data to provide a comparison to a previously trained recurrent neural network language model that took about 8 weeks to train on a single CPU. We trained a feed-forward NNLM with the same number of 640 hidden units using the DistBelief parallel training [6], using a history of 8 previous words (thus, the NNLM has more parameters than the RNNLM, as the projection layer has size 640 × 8).

首先，我们比较了使用相同训练数据和相同640维词向量维度的不同模型架构来得出词向量。在进一步的实验中，我们在新的语义-句法词关系测试集中使用了全部问题，即不限于30k个词汇表。我们还包括了在[20]中介绍的针对单词句法相似性的测试集的结果。

训练数据包括几个LDC语料库，详细描述在[18]中（320M个词，82K个词汇）。我们使用这些数据与先前训练的递归神经网络语言模型进行比较，该模型在单个CPU上训练了大约8周。我们使用相同数量的640个隐藏单元训练了一个前馈NNLM，使用DistBelief并行训练[6]，使用8个先前单词的历史记录（因此，NNLM的参数比RNNLM多，因为投影层的大小为640×8）。

In Table 3, it can be seen that the word vectors from the RNN (as used in [20]) perform well mostly on the syntactic questions. The NNLM vectors perform significantly better than the RNN - this is not surprising, as the word vectors in the RNNLM are directly connected to a non-linear hidden layer. The CBOW architecture works better than the NNLM on the syntactic tasks, and about the same on the semantic one. Finally, the Skip-gram architecture works slightly worse on the syntactic task than the CBOW model (but still better than the NNLM), and much better on the semantic part of the test than all the other models.

Next, we evaluated our models trained using one CPU only and compared the results against publicly available word vectors. The comparison is given in Table 4. The CBOW model was trained on subset of the Google News data in about a day, while training time for the Skip-gram model was about three days.

在表3中，可以看出RNN中的词向量（如[20]中使用的）在大多数句法问题上表现良好。NNLM的词向量表现比RNN好得多-这并不令人意外，因为RNNLM中的词向量与非线性隐藏层直接连接。CBOW架构在句法任务上的表现优于NNLM，并在语义任务上与其大致相同。最后，Skip-gram模型在句法任务上的表现略逊于CBOW模型（但仍优于NNLM），而在语义测试的部分上比其他模型表现更好。

接下来，我们使用仅一个CPU训练了我们的模型，并将结果与公开可用的词向量进行了比较。比较结果如表4所示。CBOW模型在约一天的时间内使用Google News数据的子集进行训练，而Skip-gram模型的训练时间约为三天。

For experiments reported further, we used just one training epoch (again, we decrease the learning rate linearly so that it approaches zero at the end of training). Training a model on twice as much data using one epoch gives comparable or better results than iterating over the same data for three epochs, as is shown in Table 5, and provides additional small speedup.

对于进一步报告的实验，我们仅使用一个训练周期（同样，我们线性减小学习率，使其在训练结束时接近零）。使用两倍于原来的数据进行一个周期的训练，得到的结果与对相同数据进行三个周期的迭代相比，准确率相当或更好，如表5所示，并提供了额外的小幅加速。

4.4 模型的大规模并行训练Large Scale Parallel Training of Models

As mentioned earlier, we have implemented various models in a distributed framework called Dis-tBelief. Below we report the results of several models trained on the Google News 6B data set, with mini-batch asynchronous gradient descent and the adaptive learning rate procedure called Ada-grad [7]. We used 50 to 100 model replicas during the training. The number of CPU cores is an estimate since the data center machines are shared with other production tasks, and the usage can fluctuate quite a bit. Note that due to the overhead of the distributed framework, the CPU usage of the CBOW model and the Skip-gram model are much closer to each other than their single-machine implementations. The result are reported in Table 6.

如前所述，我们在一个名为DistBelief的分布式框架中实现了各种模型。下面我们报告在Google News 6B数据集上使用小批量异步梯度下降和自适应学习率过程AdaGrad [7]训练的几个模型的结果。我们在训练过程中使用了50到100个模型副本。CPU核心数是一个估计值，因为数据中心的机器与其他生产任务共享，并且使用情况可能会有很大波动。需要注意的是，由于分布式框架的开销，CBOW模型和Skip-gram模型的CPU使用情况相比单机实现更加接近。结果如表6所示。

4.5 微软研究句子完成挑战Microsoft Research Sentence Completion Challenge

The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices. Performance of several techniques has been already reported on this set, including N-gram models, LSA-based model [32], log-bilinear model [24] and a combination of recurrent neural networks that currently holds the state of the art performance of 55.4% accuracy on this benchmark [19].

We have explored the performance of Skip-gram architecture on this task. First, we train the 640-dimensional model on 50M words provided in [32]. Then, we compute score of each sentence in the test set by using the unknown word at the input, and predict all surrounding words in a sentence. The final sentence score is then the sum of these individual predictions. Using the sentence scores, we choose the most likely sentence.

最近，微软句子完成挑战被引入为推进语言建模和其他自然语言处理技术的任务。该任务包括1040个句子，每个句子中缺少一个单词，目标是从给定的五个合理选择中选择与其余部分最相符的单词。已经在该数据集上报告了几种技术的性能，包括N-gram模型、基于LSA的模型[32]、对数双线性模型[24]以及目前在该基准测试中保持最先进性能的递归神经网络的组合模型，准确率为55.4% [19]。

我们对Skip-gram架构在此任务上的性能进行了探索。首先，我们使用[32]提供的5000万个单词来训练640维模型。然后，我们通过在输入中使用未知词，并预测句子中的所有周围单词，来计算测试集中每个句子的得分。最终句子得分是这些单个预测的总和。使用句子得分，我们选择最有可能的句子。

A short summary of some previous results together with the new results is presented in Table 7. While the Skip-gram model itself does not perform on this task better than LSA similarity, the scores from this model are complementary to scores obtained with RNNLMs, and a weighted combination leads to a new state of the art result 58.9% accuracy (59.2% on the development part of the set and 58.7% on the test part of the set).

表7中呈现了一些先前结果的简要总结以及新结果。虽然Skip-gram模型本身在此任务上的表现不如LSA相似性，但该模型的得分与RNNLM得到的得分互补，加权组合导致了一个新的最先进结果，准确率为58.9%（在数据集开发部分为59.2%，测试部分为58.7%）。

5 通过学习的关系示例Examples of the Learned Relationships

总的来说，本文学习到的词向量可以有效表达不同的语义和语法关系。虽然存在不足，但效果已经证明使用分布式单词表示有优势。进一步提高词向量质量和应用广度仍然有很多空间，可以开发更多新应用。

>> 作者通过减去两个词向量然后加上第三个词向量，来查找对应的词。

>> 虽然词向量可以表达一些复杂关系，但效果仍有不足。

>> 使用更大规模数据和更高维词向量可以进一步提高准确率。

>> 作者认为，词向量可以用于开发新应用，比如信息检索和问题回答系统。

>> 另一种提高效果的方法是提供多个示例来建立词向量关系。

>> 通过计算多个词的平均向量，然后查找最不同的词，可以解决选择问题。

Table 8 shows words that follow various relationships. We follow the approach described above: the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example, Paris - France + Italy = Rome. As it can be seen, accuracy is quite good, although there is clearly a lot of room for further improvements (note that using our accuracy metric that assumes exact match, the results in Table 8 would score only about 60%). We believe that word vectors trained on even larger data sets with larger dimensionality will perform significantly better, and will enable the development of new innovative applications. Another way to improve accuracy is to provide more than one example of the relationship. By using ten examples instead of one to form the relationship vector (we average the individual vectors together), we have observed improvement of accuracy of our best models by about 10% absolutely on the semantic-syntactic test.

表8展示了遵循不同关系的单词。我们采用了上述方法：通过两个词向量相减来定义关系，然后将结果添加到另一个单词上。例如，Paris - France + Italy = Rome。可以看到，准确性相当不错，尽管显然还有很大的改进空间（请注意，使用我们的准确度指标假设完全匹配，表8中的结果仅得分约60%）。我们相信，在更大的数据集和更大的维度上训练的词向量将表现得更好，并将推动新的创新应用的开发。提高准确性的另一种方法是提供不止一个关系示例。通过使用十个示例而不是一个来形成关系向量（我们将各个向量取平均），我们观察到我们最佳模型在语义-句法测试上的准确性绝对提高了约10%。

It is also possible to apply the vector operations to solve different tasks. For example, we have observed good accuracy for selecting out-of-the-list words, by computing average vector for a list of words, and finding the most distant word vector. This is a popular type of problems in certain human intelligence tests. Clearly, there is still a lot of discoveries to be made using these techniques.

还可以将向量运算应用于解决不同的任务。例如，通过计算一组单词的平均向量，并找到最远的单词向量，可以观察到在选择列表之外的单词方面具有良好的准确性。这是某些人类智力测试中常见的问题类型。显然，使用这些技术还有很多发现可以进行。

6 结论Conclusion

总的来说，本文比较了不同模型学习词向量的效果。实验证明采用简单的CBOW和Skip-gram模型可以学习出高质量的高维词表示。计算效率更高。词向量已经在很多NLP任务上产生了良好结果。未来还有广阔的应用前景。

>> 本文比较了不同模型学习的词向量在语法和语义任务上的效果。

>> 实验发现采用简单的CBOW和Skip-gram模型可以学习出高质量的高维词向量。

>> 计算效率更高，可以处理规模更大的数据集。使用分布式框架可以处理万亿规模的数据集。

>> 词向量在任务如SemEval2012上优于传统技术，在情感分析和句子匹配任务上也有良好效果。

>> 作者认为词向量可以应用到知识库自动扩展和 fact 检查、机器翻译等任务。

>> 作者提供的测试集可以帮助研究社区进一步提高分布式单词表示的效果。

>> 高质量的词向量将成为未来NLP应用的重要组成部分。

In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks. We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set. Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary. That is several orders of magnitude larger than the best previously published results for similar models.

本文研究了各种模型派生的词向量在一系列句法和语义语言任务中的质量。我们观察到，与流行的神经网络模型（前馈和递归）相比，使用非常简单的模型架构可以训练出高质量的词向量。由于计算复杂性大大降低，可以从更大的数据集中计算出非常准确的高维词向量。使用DistBelief分布式框架，即使在具有万亿字词的语料库上，也可以训练CBOW和Skip-gram模型，词汇量基本上是无限的。这比之前发布的类似模型的最佳结果大几个数量级。

An interesting task where the word vectors have recently been shown to significantly outperform the previous state of the art is the SemEval-2012 Task 2 [11]. The publicly available RNN vectors were used together with other techniques to achieve over 50% increase in Spearman’s rank correlation over the previous best result [31]. The neural network based word vectors were previously applied to many other NLP tasks, for example sentiment analysis [12] and paraphrase detection [28]. It can be expected that these applications can benefit from the model architectures described in this paper.

Our ongoing work shows that the word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts. Results from machine translation experiments also look very promising. In the future, it would be also interesting to compare our techniques to Latent Relational Analysis [30] and others. We believe that our comprehensive test set will help the research community to improve the existing techniques for estimating the word vectors. We also expect that high quality word vectors will become an important building block for future NLP applications.

词向量最近在一个有趣的任务中显示出明显优于之前最先进技术的能力，即SemEval-2012任务2 [11]。公开可用的RNN向量与其他技术结合使用，使Spearman等级相关性相对于之前的最佳结果提高了50%以上[31]。基于神经网络的词向量之前已应用于许多其他自然语言处理任务，例如情感分析[12]和释义检测[28]。可以预期这些应用将受益于本文中描述的模型架构。

我们正在进行的工作表明，词向量可以成功应用于知识库中事实的自动扩展，并验证现有事实的正确性。机器翻译实验的结果也非常有前景。将我们的技术与潜在关系分析[30]和其他技术进行比较也是未来的有趣方向。我们相信，我们的全面测试集将帮助研究界改进对词向量的估计的现有技术。我们还预期高质量的词向量将成为未来自然语言处理应用的重要构建块。

7 后续工作Follow-Up Work

实体向量和后续研究也将继续推动分布式单词表示领域的进步。作者持续推进研究，不断提高模型效果。这是一个不断进步的过程。

>> 作者发布了一个单机多线程的C++代码，用于计算词向量。

>> 训练速度比本文报告的效率要高很多，可以每小时处理数十亿个词。

>> 发布了140多万实体向量，是在100多亿词上训练得到的。

After the initial version of this paper was written, we published single-machine multi-threaded C++ code for computing the word vectors, using both the continuous bag-of-words and skip-gram archi-tectures4. The training speed is significantly higher than reported earlier in this paper, i.e. it is in the order of billions of words per hour for typical hyperparameter choices. We also published more than 1.4 million vectors that represent named entities, trained on more than 100 billion words. Some of our follow-up work will be published in an upcoming NIPS 2013 paper [21].

在撰写本文的初稿之后，我们发布了用于计算词向量的单机多线程C++代码，使用连续词袋和Skip-gram架构[4]。相对于本文早期报告的情况，训练速度显著提高，即对于典型的超参数选择，每小时可以处理数十亿个词。我们还发布了超过1.4百万个代表命名实体的向量，这些向量是在超过1000亿个词上训练的。我们的后续工作的一部分将在即将发表的NIPS 2013论文中发布[21]。