Distributed Representations of Sentences and Documents(Doc2Vec简摘)

Distributed Representations of Sentences and Documents(Doc2Vec简摘)


Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-ofwords models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

许多机器学习算法需要将输入表示为固定长度的特征向量。在文本表示方面,最常见的固定长度特征之一是综合大量词汇的表示方法。尽管bag of words功能广受欢迎,但它有二个主要的缺点:***它们失去了词汇的顺序,也忽略了词汇的语义。***例如,「powerful」、「strong」和「Paris」都同样相差很大。在这篇文章中,我们提出一种无监督的段落向量表示法,该算法从可变长度的文本(如句、段和文档)学习固定长度的特征表示。我们的算法使用一个稠密向量来表示每个文档,并对其进行训练以预测文档中的词汇。其结构使我们的算法有潜力克服bag of words模型的弱点。实验结果表明,段落向量在文本表示方面优于bag-of-words模型及其他技术。

介绍

In our model, the vector representation is trained to be useful for predicting words in a paragraph. More precisely, we concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropagation (Rumelhart et al., 1986). While paragraph vectors are unique among paragraphs, the word vectors are shared. At prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until convergence.

在我们的模型中,文本的向量表示可以用来预测一篇文章中某个单词的出现概率。更准确地说,我们将段落向量与来自一个段落的多个词向量连接起来,并在给定的上下文中预测下一个词。词汇和段落都通过随机梯度下降和反向传播进行训练(Rumelhart等,1,986)。虽然段落向量在段落之间是唯一的,但词向量是共享的。在预测时,通过固定词汇向量来训练新的段落向量,直至收敛。

Our technique is inspired by the recent work in learning vector representations of words using neural networks (Bengio et al., 2006; Collobert & Weston, 2008; Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al., 2013a;c). In their formulation, each word is represented by a vector which is concatenated or averaged with other word vectors in a context, and the resulting vector is used to predict other words in the context. For example, the neural network language model proposed in (Bengio et al., 2006) uses the concatenation of several previous word vectors to form the input of a neural network, and tries to predict the next word. The outcome is that after the model is trained, the word vectors are mapped into a vector space such that semantically similar words have similar vector representations (e.g., “strong” is close to “powerful”).

我们的技术受到最近使用神经网络学习词汇表示工作的启发(Bengio et al.,2,006;Collobert&Weston,2,008;Mnih&Hinton,2,008;Turian et al.,2,010;Mikolov et al.,2,013a;c)。在他们的公式中,每个词由一个与上下文中的其他词相连接或求平均值的词表示,而所得的词被用于预测上下文中的其他词。例如,在(Bengio et al.,2,006)中提出的神经网络语言模型使用多个先前词的连接来形成一个神经网络的输入,并尝试预测下一个词。其结果是,在模型被训练后,词向量被映射到一个向量空间,使得在语义上相似的词具有相似的向量表示(例如,「strong」接近「powerful」)。

Following these successful techniques, researchers have tried to extend the models to go beyond word level to achieve phrase-level or sentence-level representations (Mitchell & Lapata, 2010; Zanzotto et al., 2010; Yessenalina & Cardie, 2011; Grefenstette et al., 2013; Mikolov et al., 2013c). For instance, a simple approach is using a weighted average of all the words in the document. A more sophisticated approach is combining the word vectors in an order given by a parse tree of a sentence, using matrix-vector operations (Socher et al., 2011b). Both approaches have weaknesses. The first approach, weighted averaging of word vectors, loses the word order in the same way as the standard bag-of-words models do. The second approach, using a parse tree to combine word vectors, has been shown to work for only sentences because it relies on parsing.

继这些成功的技术之后,研究人员尝试将模型扩展到词汇层面以外,以实现短语层面或句式层面的表示(米切尔&Lapata,2,010;Zanzotto et al.,2,010;Yessenalina&Cardie,2,011;Grefenstette et al.,2,013;Mikolov et al.,2,013c)。例如,一个简单的方法是使用文档中所有词汇的加权平均值。一种更为复杂的方法是使用矩阵向量操作,按照一个句型的分析树所给出的顺序组合词的向量(Socher et al.,2,011b)。两种方法都有缺点。第一种方法是对词汇进行加权平均,与标准的词袋模型一样,这种方法会损失词汇的顺序。第二种方法,使用一个分析树来结合词的载体,已经被证明只适用于句,因为它依赖于句法分析。

Paragraph Vector is capable of constructing representations of input sequences of variable length. Unlike some of the previous approaches, it is general and applicable to texts of any length: sentences, paragraphs, and documents. It does not require task-specific tuning of the word weighting function nor does it rely on the parse trees. Further in the paper, we will present experiments on several benchmark datasets that demonstrate the advantages of Paragraph Vector. For example, on sentiment analysis task, we achieve new stateof-the-art results, better than complex methods, yielding a relative improvement of more than 16% in terms of error rate. On a text classification task, our method convincingly beats bag-of-words models, giving a relative improvement of about 30%.

段落向量能够构造可变长度输入序列的表示。与之前的一些方法不同,它是通用的,适用于任何长度的文本:句、段和文件。它不需要特定于任务的字权重函数调优,也不依赖于分析树。此外,在本文中,我们将在多个基准数据集上进行实验,以展示段落载体的优势。例如,在情绪分析任务上,我们取得了新的最新技术成果,优于复杂方法,在错误率方面取得了16%以上的相对改进。在一个文本分类任务中,我们的方法令人信服地优于bag of words模型,相对改进了约30%。

Algorithms

Paragraph Vector: A distributed memory model

Our approach for learning paragraph vectors is inspired by the methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a prediction task about the next word in the sentence. So despite the fact that the word vectors are initialized randomly, they can eventually capture semantics as an indirect result of the prediction task. We will use this idea in our paragraph vectors in a similar manner. The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph.

我们学习段落向量的方法受到学习词汇向量方法的启发。其灵感来自于要求词汇向量对有关该句中下一个词汇的预测任务作出贡献。因此,尽管这些词的初始值是随机的,但它们最终能够捕捉到作为预测任务的间接结果的语义。我们将以类似的方式在段落载体中使用此思想。鉴于从段落中抽取的许多上下文,段落向量也被要求对下一个词的预测任务作出贡献。

In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors.

在我们的段落向量框架(参见图2)中,每个段落被映射到一个唯一的向量,由矩阵D中的一列表示,每个词也被映射到一个唯一的向量,由矩阵W中的一列表示。段落向量和词向量被平均化或连接以预测一上下文中的下一个词。在实验中,我们采用了串联的方法来合并这些向量。

在这里插入图片描述
Figure 2. A framework for learning paragraph vector.

Figure 2. A framework for learning paragraph vector.
More formally, the only change in this model compared to the word vector framework is in equation 1, where h is constructed from W and D.

更正式地说,该模型与字vector框架相比的唯一变化是在公式1中,其中h由W和D构造。

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).

段落标记可以被视为另一个词。它是一种记忆,能记起当前背景或段落主题所遗漏的内容。因此,我们通常将此模型称为段落向量的分布式存储模型(PV-DM)。

The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W , however, is shared across paragraphs. I.e., the vector for “powerful” is the same for all paragraphs.

上下文是固定长度的,并从段落上的一个滑窗采样。段落向量在同一段落生成的所有上下文中共享,而不是在段落之间共享。然而,字载体矩阵W在段落之间共享。一、 例如,所有段落的“powerful”都是相同的。

The paragraph vectors and word vectors are trained using stochastic gradient descent and the gradient is obtained via backpropagation. At every step of stochastic gradient descent, one can sample a fixed-length context from a random paragraph, compute the error gradient from the network in Figure 2 and use the gradient to update the parameters in our model.

采用随机梯度下降法对文本的段落和文字进行训练,并通过反向传播获得梯度。在随机梯度下降的每个步骤中,可以从一个随机段中抽取一个固定长度的上下文,从图2的网络中计算错误梯度,并使用梯度更新模型中的参数。

At prediction time, one needs to perform an inference step to compute the paragraph vector for a new paragraph. This is also obtained by gradient descent. In this step, the parameters for the rest of the model, the word vectors W and the softmax weights, are fixed.

在预测时,需要执行推断步骤以计算新段落的段落向量。这也可以通过梯度下降获得。在该步骤中,模型其余部分的参数(字载体W和softmax权重)是固定的。

Suppose that there are N paragraphs in the corpus, M words in the vocabulary, and we want to learn paragraph vectors such that each paragraph is mapped to p dimensions and each word is mapped to q dimensions, then the model has the total of N × p + M × q parameters (excluding the softmax parameters). Even though the number of parameters can be large when N is large, the updates during training are typically sparse and thus efficient.

假设语料库中有N个段落,词汇表中有M个词汇,我们希望学习段落向量,使每个段落映射到p维,每个词汇映射到q维,则该模型共有N×p+M×q个参数(不包括softmax参数)。尽管当N较大时,参数数目可能较大,但在训练过程中的更新通常很分散,因此效率很高。

After being trained, the paragraph vectors can be used as features for the paragraph (e.g., in lieu of or in addition to bag-of-words). We can feed these features directly to conventional machine learning techniques such as logistic regression, support vector machines or K-means.

经过训练后,段落载体可以用作段落的特征(例如,代替或词袋特征)。我们可以直接将这些特征输入到传统的机器学习技术中,如逻辑回归、支持向量机或K-means。

In summary, the algorithm itself has two key stages: 1) training to get word vectors W , softmax weights U, b and paragraph vectors D on already seen paragraphs; and 2) “the inference stage” to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U, b fixed. We use D to make a prediction about some particular labels using a standard classifier, e.g., logistic regression.

综上所述,该算法本身有二个关键阶段:1)训练以在已看到的段落上获得词语量W、softmax权重U、b和段落量D;2)通过在D中添加更多列并在D上按梯度下降而在W、U、b固定的情况下,通过“推断阶段”以在D上获得新段落(从未看到过)的段落量D。我们使用D对一些特定标签使用标准分类法进行预测,如逻辑回归。

优点

Paragraph vectors also address some of the key weaknesses of bag-of-words models. First, they inherit an important property of the word vectors: the semantics of the words. In this space, “powerful” is closer to “strong” than to “Paris.” The second advantage of the paragraph vectors is that they take into consideration the word order, at least in a small context, in the same way that an n-gram model with a large n would do. This is important, because the n-gram model preserves a lot of information of the paragraph, including the word order. That said, our model is perhaps better than a bag-of-n-grams model because a bag of n-grams model would create a very high-dimensional representation that tends to generalize poorly.

段落向量解决了bag of words模型的一些关键弱点。第一,它们继承了词的一个重要性质:词的语义。在这个空间中,「powerful」比「Paris」更接近「strong」,段落向量的第二个优点是,它们至少在小的范围内考虑了词序,这与n-gram模型相同。这一点很重要,因为n-gram模型保留了段落的大量信息,包括词序。也就是说,我们的模型可能词袋模型更好,因为词袋模型将创建一个非常高维的表示,而这种表示往往难以推广。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值