[已译完] Deep Structured Semantic Models (DSSM) 论文翻译

最新推荐文章于 2022-04-03 13:08:38 发布

訢詡

最新推荐文章于 2022-04-03 13:08:38 发布

阅读量979

点赞数

分类专栏：深度学习NLP方向文章标签： DSSM 自然语言处理 NLP 语义相似推荐算法

本文链接：https://blog.csdn.net/Andrwin/article/details/116449297

版权

使用点击数据学习用于Web搜索的深度结构化语义模型

Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

ABSTRACT 摘要

Latent semantic models, such as LSA, intend to map a query to its
relevant documents at the semantic level where keyword-based
matching often fails. In this study we strive to develop a series of
new latent semantic models with a deep structure that project
queries and documents into a common low-dimensional space
where the relevance of a document given a query is readily
computed as the distance between them. The proposed deep
structured semantic models are discriminatively trained by
maximizing the conditional likelihood of the clicked documents
given a query using the clickthrough data. To make our models
applicable to large-scale Web search applications, we also use a
technique called word hashing, which is shown to effectively
scale up our semantic models to handle large vocabularies which
are common in such tasks. The new models are evaluated on a
Web document ranking task using a real-world data set. Results
show that our best model significantly outperforms other latent
semantic models, which were considered state-of-the-art in the
performance prior to the work presented in this paper.

潜在的语义模型（例如LSA）打算在基于关键字匹配经常失败的语义级别上将查询映射到其相关文档。在这项研究中，我们努力开发一系列具有深层结构的新潜在语义模型，该模型将查询和文档投影到公共的低维空间中，在该低维空间中，给定查询的文档的相关性很容易计算为它们之间的距离。通过使用点击数据对查询进行查询，通过最大化点击文档的条件似然性，可以有区别地训练提出的深度结构化语义模型。为了使我们的模型适用于大规模Web搜索应用程序，我们还使用了一种称为单词哈希的技术，该技术可以有效地扩展我们的语义模型，以处理此类任务中常见的大词汇量。使用实际数据集在Web文档排名任务上评估新模型。结果表明，我们的最佳模型明显优于其他潜在语义模型，后者在本文提出之前被认为是性能方面的最新技术。

Keywords :
Deep Learning, Semantic Model, Clickthrough Data, Web Search

深度学习，语义模型，点击数据，网络搜索

1、INTRODUCTION 介绍

Modern search engines retrieve Web documents mainly by matching keywords in documents with those in search queries. However, lexical matching can be inaccurate due to the fact that a concept is often expressed using different vocabularies and language styles in documents and queries.

现代搜索引擎主要通过将文档中的关键字与搜索查询中的关键字进行匹配来检索Web文档。但是，由于通常在文档和查询中使用不同的词汇和语言样式来表达概念，因此词汇匹配可能不准确。

Latent semantic models such as latent semantic analysis (LSA) are able to map a query to its relevant documents at the semantic level where lexical matching often fails (e.g., [6][15][2][8][21]). These latent semantic models address the language discrepancy between Web documents and search queries by grouping different terms that occur in a similar context into the same semantic cluster. Thus, a query and a document, represented as two vectors in the lower-dimensional semantic space, can still have a high similarity score even if they do not share any term. Extending from LSA, probabilistic topic models such as probabilistic LSA (PLSA) and Latent Dirichlet Allocation (LDA) have also been proposed for semantic matching [15][2]. However, these models are often trained in an unsupervised manner using an objective function that is only loosely coupled with the evaluation metric for the retrieval task. Thus the performance of these models on Web search tasks is not as good as originally expected.

潜在语义模型（例如潜在语义分析（LSA））能够在词汇匹配经常失败的语义级别上将查询映射到其相关文档（例如[6] [15] [2] [8] [21]）。这些潜在的语义模型通过将在相似上下文中出现的不同术语归为同一语义簇来解决Web文档和搜索查询之间的语言差异。因此，在低维语义空间中表示为两个向量的查询和文档，即使它们不共享任何术语，仍然可以具有很高的相似性评分。从LSA扩展起，诸如语义LSA（PLSA）和潜在狄利克雷分配（LDA）之类的概率主题模型也已被提出用于语义匹配[15] [2]。但是，这些模型通常使用目标函数以不受监督的方式进行训练，该目标函数仅与检索任务的评估指标松散耦合。因此，这些模型在Web搜索任务上的性能不如最初预期的好。

Recently, two lines of research have been conducted to extend the aforementioned latent semantic models, which will be briefly reviewed below.

最近，已经进行了两方面的研究来扩展上述潜在的语义模型，下面将对其进行简要回顾。

First, clickthrough data, which consists of a list of queries and their clicked documents, is exploited for semantic modeling so as to bridge the language discrepancy between search queries and Web documents [9][10]. For example, Gao et al. [10] propose the use of Bi-Lingual Topic Models (BLTMs) and linear Discriminative Projection Models (DPMs) for query-document matching at the semantic level. These models are trained on clickthrough data using objectives that tailor to the document ranking task. More specifically, BLTM is a generative model that requires that a query and its clicked documents not only share the same distribution over topics but also contain similar factions of words assigned to each topic. In contrast, the DPM is learned using the S2Net algorithm [26] that follows the pairwise learningto-rank paradigm outlined in [3]. After projecting term vectors of queries and documents into concept vectors in a low-dimensional semantic space, the concept vectors of the query and its clicked documents have a smaller distance than that of the query and its unclicked documents. Gao et al. [10] report that both BLTM and DPM outperform significantly the unsupervised latent semantic models, including LSA and PLSA, in the document ranking task. However, the training of BLTM, though using clickthrough data, is to maximize a log-likelihood criterion which is sub-optimal for the evaluation metric for document ranking. On the other hand, the training of DPM involves large-scale matrix multiplications. The sizes of these matrices often grow quickly with the vocabulary size, which could be of an order of millions in Web search tasks. In order to make the training time tolerable, the vocabulary was pruned aggressively. Although a small vocabulary makes the models trainable, it leads to suboptimal performance.

首先，点击数据由查询列表及其点击的文档组成，用于语义建模，以弥合搜索查询和Web文档之间的语言差异[9] [10]。例如，Gao等。 [10]提出使用双语主题模型（BLTM）和线性判别投影模型（DPM）进行语义级别上的查询文档匹配。这些模型使用适合文档排名任务的目标针对点击数据进行训练。更具体地说，BLTM是一种生成模型，它要求查询及其单击的文档不仅在主题上共享相同的分布，而且还包含分配给每个主题的相似单词派系。相反，DPM是使用S2Net算法[26]来学习的，该算法遵循[3]中概述的成对学习排名范例。在将查询和文档的术语向量投影到低维语义空间中的概念向量之后，查询及其单击的文档的概念向量的距离比查询及其未单击的文档的距离小。高等。 [10]报告说，在文档排名任务中，BLTM和DPM均明显优于无监督的潜在语义模型，包括LSA和PLSA。但是，尽管通过使用点击数据来训练BLTM，但它的目的是使对数可能性标准最大化，这对于文档排名的评估指标而言不是次优的。另一方面，DPM的训练涉及大规模矩阵乘法。这些矩阵的大小通常随着词汇量的增长而迅速增长，在网络搜索任务中，词汇量的数量级可能达到数百万。为了使训练时间可以忍受，对词汇进行了大幅度的修剪。尽管词汇量少，使模型易于训练，但会导致性能欠佳。

In the second line of research, Salakhutdinov and Hinton extended the semantic modeling using deep auto-encoders [22]. They demonstrated that hierarchical semantic structure embedded in the query and the document can be extracted via deep learning. Superior performance to the conventional LSA is reported [22]. However, the deep learning approach they used still adopts an unsupervised learning method where the model parameters are optimized for the reconstruction of the documents rather than for differentiating the relevant documents from the irrelevant ones for a given query. As a result, the deep learning models do not significantly outperform the baseline retrieval models based on keyword matching. Moreover, the semantic hashing model also faces the scalability challenge regarding large-scale matrix multiplication. We will show in this paper that the capability of learning semantic models with large vocabularies is crucial to obtain good results in real-world Web search tasks.

在第二条研究中，Salakhutdinov和Hinton使用深度自动编码器扩展了语义建模[22]。他们证明了可以通过深度学习提取嵌入在查询和文档中的分层语义结构。据报告，其性能优于传统的LSA [22]。但是，他们使用的深度学习方法仍采用无监督学习方法，其中模型参数针对文档的重建进行了优化，而不是针对给定查询将相关文档与无关文档进行区分。因此，深度学习模型不会明显优于基于关键字匹配的基线检索模型。此外，语义哈希模型还面临有关大规模矩阵乘法的可伸缩性挑战。我们将在本文中证明，学习具有大量词汇的语义模型的能力对于在现实世界中的Web搜索任务中获得良好的结果至关重要。

In this study, extending from both research lines discussed above, we propose a series of Deep Structured Semantic Models (DSSM) for Web search. More specifically, our best model uses a deep neural network (DNN) to rank a set of documents for a given query as follows. First, a non-linear projection is performed to map the query and the documents to a common semantic space. Then, the relevance of each document given the query is calculated as the cosine similarity between their vectors in that semantic space. The neural network models are discriminatively trained using the clickthrough data such that the conditional likelihood of the clicked document given the query is maximized. Different from the previous latent semantic models that are learned in an unsupervised fashion, our models are optimized directly for Web document ranking, and thus give superior performance, as we will show shortly. Furthermore, to deal with large vocabularies, we propose the so-called word hashing method, through which the high-dimensional term vectors of queries or documents are projected to low-dimensional letter based n-gram vectors with little information loss. In our experiments, we show that, by adding this extra layer of representation in semantic models, word hashing enables us to learn discriminatively the semantic models with large vocabularies, which are essential for Web search. We evaluated the proposed DSSMs on a Web document ranking task using a real-world data set. The results show that our best model outperforms all the competing methods with a significant