Improving web-query processing through semantic knowledge and user feedback－１

最新推荐文章于 2021-06-18 09:32:50 发布

数据出境研究所

最新推荐文章于 2021-06-18 09:32:50 发布

阅读量870

点赞数

分类专栏：数据挖掘文章标签： semantic processing user query 语言 search

本文链接：https://blog.csdn.net/charcle/article/details/2427968

版权

数据挖掘专栏收录该内容

11 篇文章 0 订阅

订阅专栏

看一篇关于搜索引擎方面的文章，将它翻译一下，水平有限，。。

Improving web-query processing through semantic knowledge and user feedback

Abstract

Although search engines are very useful for obtaining information from the World Wide Web,users still have problems obtaining the most relevant information when processing their web queries.Prior research has attempted to use different types of knowledge to improve webquerying processing with various levels of success. This research presents a methodology for processing web queriesthat employs semantic knowledge about different application domains from ResearchCyc, as well as linguistic knowledge from WordNet. An analysis of different queriesfrom different application domains using the semantic and linguistic knowledge illustrates how more relevant results can be obtained.

Keywords: Web-query processing; Query expansion; ResearchCyc; Cyc; Semantic knowledge; Knowledge repositories

1. Introduction

The continued explosion of available information on the World Wide Web has lead to the need for processing queries intelligently to address more of the user’s intended requirements than previously possible [1]. Doing so, requires some notion of the context within which the query is being posed and the semantics of the query itself. In our context, intelligent means that the queries should be interpreted and extended in order to contextualize and disambiguate them.

Several knowledge repositories have been created to support agents (humans or programs) to increase the intelligence of their tasks. Examples include WordNet [2], Cyc [3], and ConceptNet [4]. Although all of these are useful for their intended purposes, they are limited as a general repository in several ways. Linguistic repositories, such as WordNet, do not capture the semantic relationships or integrity constraints between concepts. Semantic repositories such as Cyc do not represent linguistic relationships of the concepts (e.g. whether two concepts are synonyms). Some of the existing repositories are domain dependent and only represent information about certain aspects of the domains, not the complete domain. Research on query extension has used knowledge repositories to develop tools that assist the user in processing queries that capture the user’s intent [5], [6], [7], [8], [9], [10] and [11]. Most query extension approaches use only linguistic knowledge [12]. However, linguistic repositories lack semantic knowledge, so query expansion cannot deal with several issues: (1) knowledge related to the domain of the query, (2) common sense inferences, or (3) the semantic relationships in which the concepts of the query can participate. Grootjen and Weide [6] focus on creating a small lattice of concepts to support query expansion. In contrast, our approach focuses on grouping and using existing knowledge in large knowledge bases for query expansion in an efficient manner.

In this research, we consider semantic repositories to be repositories that represent semantic information about a domain. They are independent of syntax, word forms, and languages, but tend to be domain and culture dependent. Semantic repositories need linguistic knowledge to identify relevant concepts from the repository that represent a given term used in the query. Therefore, the integration of linguistic and semantic information into one repository could be useful to increase the contexts where knowledge in these repositories can be used successfully. Table 1 shows examples of improvement in search results using semantic and linguistic knowledge. The queries shown in Table 1 have been executed manually. The “% Rel” metric is the percentage of the number of documents that are relevant to the query in the first 10 documents returned by the search engine.

Table 1.

Improvement in search results using semantic and linguistic knowledge sources

Query	Domain	Results	% Rel	Source	Results	% Rel
Pets	Animals	53,600,000	70	ResearchCyc	24,600,000	100
				ResearchCyc	10,400,000	95
	Buying animals in Atlanta			ResearchCyc + WordNet	260,000	95
Nike Georgia	Bulling Sport Stuff	2,180,000	0	ResearchCyc	1,550,000	10
Flute Bohemian Drink	Drink	57,900	25	ResearchCyc	153,000	82.5
Bonderdorfers Atlanta	Music	73	50	WordNet	49	90
Which universities offer online degrees?	Education		50	WordNet		90
Find cookie stores	Restaurant	2,900,000	20	ConcepNet + textual sources	950,000	90

Full-size table

View Within Article

The Cyc ontology is a repository developed to capture and represent common sense. ReseachCyc (http://research.cyc.com) is a huge semantic repository. It should be possible to use techniques from Cyc [13], [14] and [15] to extend ResearchCyc with linguistic information from the WordNet lexicon, and factual information from the World Wide Web.

The objective of this research is to demonstrate that the use of semantic and linguistic knowledge together improves the query refinement process. To do so, we study the problems associated with the web-query process and show how ResearchCyc, in combination with WordNet, helps improve query results for web searches.

This research makes several contributions. First, it demonstrates that semantic and linguistic knowledge together improve query expansion. Second, the research identifies and formalizes web-query problems and presents a query classification scheme that explains why, in some cases, the query expansion may not be done successfully, even if the repository used to support such a task is complete. Such information is used to identify the structure and knowledge that an ontology should have to increase the chances of improving different kinds of queries.

2. Web queries

The purpose of a web query is to search for information that best reflects the user’s needed information. In this research, semantics is defined as the meaning, or essential message, of terms. To carry out useful research for dealing with semantics, symbols must be manipulated in ways that are meaningful and useful [16].

To process a web query, the expected result is E_R. This information, in general, belongs to several domains, intended domains D_I. Therefore, the expected result is contained in the knowledge defined for the intersection of all the intended domains E_R K(D_I1) ∩ K(D_I2) ∩ … , ∩ K(D_IN), where K(D) represents a function that returns the knowledge defined in the domain D as illustrated in Fig. 1.

Full-size image (33K)

Fig. 1. Constraining the search domains for web queries.

View Within Article

To perform a search, the user creates an initial query (Q_I) selecting some terms w₁, … , w_k (called query terms Q_w) to describe what he or she is searching for. The problem arises with the ambiguity of the languages humans use. The user considers a query, Q_w, within a given context (i.e., the context of the intended domains). Since words have several senses in several domains, query search techniques are not able to determine which of the senses of a given query term is the one in which the user is interested. Given this ambiguity, the result of the query tends to contain results that deal with a number of domains D_O₁, … , D_Om greater than the intended domains (m n)¹. The resultant domains are called obtained domains, with each depending on a subset of the query terms D_O(W), where W {w₁, … , w_k}.

Suppose a user lives in Georgia, USA., and wants to buy sports shoes with Nike brand. E_R is “Places in Georgia (USA) where I can find a pair of Nike shoes”. E_R is composed of domain information that deals with sport stuff (sport shoes), commercial information (brand Nike sells sport stuff), and geographical information (which commercial organizations in Georgia sell Nike products). These three domains are the intended domains. Suppose the user defines a query that contains two words Q_w = “Nike Georgia”. Then, some of the D_O will be D_O₁ (Nike) = Commercial Brand, D_O₂ (Nike) = Greek Mythology (Nike is the goddess of Victory in Greek Mythology), D_O₃(Georgia) = Central-European Country, D_O₄ (Georgia) = State in the United States of America, and D_O₅ (Georgia) = Football Team Georgia Bulldogs. The query results will be the web pages that deal with the presented D_O.

2.1. Web-query problems

The ambiguity of the language used in the query, the possible partial knowledge of the user, and the difficulty in determining what the user really wants, lead to the following problems that affect the processing of web queries.

– Identification of a good initial query: There is no systematic way, or guidelines that support the user in identifying the best terms for a query. A good selection of Q_I is important. Terms that are too general may result in too many irrelevant D_O, and results. Using very specific words may result in missing some of the results that match E_R because they use a plain language.

– Resolving language ambiguity: Documents that deal with the same domain can use different terms for describing the same concepts. Therefore, for a given concept (sport stuff) some documents may use the terms (sport stuff), other documents may use synonyms (sport material), and other terms that deal with the same concept, but more generally (playing sport artifact) or specifically (trainer).

– Identifying the relevant results: It may be difficult to detect whether a given result of the query is valid. A result is valid if it belongs to the expected result. The problem is that the expected result is in the mind of the user. A result is also valid if it belongs to the intersection of the intended domains. Unfortunately, we do not know what those domains are, and, due to word ambiguity problems, we cannot conclude that the obtained domains are the expected ones. Hence, it is not possible to identify which results are relevant for the user and which are irrelevant.

通过语义知识和用户反馈改进网页查询处理

摘要:

尽管搜索引擎在从互联网上获得信息方面很有用，但是，当用户进行网页查询时，想获得最相关的信息还是有很多问题。之前的研究已经试图在各种成功水平之上使用不同类型的知识来提高网页查询。本文提出了一个用在网页查询处理的方法，该方法使用了来自ResearchCyc( Cyc 是一个试图对日常生活常识综合建立综合的本体论和数据库的人工智能工程 , 其目标为是使人工智能具有和人类似的推理能力 .)关于不同应用领域的语义知识，也使用了来自WorldNet（ WordNet 是由 Princeton 大学的心理学家，语言学家和计算机工程师联合设计的一种基于认知语言学的英语词典。它不是光把单词以字母顺序排列，而且按照单词的意义组成一个 “ 单词的网络 ” 。它是一个覆盖范围宽广的英语词汇语义网。名词，动词，形容词和副词各自被组织成一个同义词的网络，每个同义词集合都代表一个基本的语义概念，并且这些集合之间也由各种关系连接）的知识。使用语义和语言知识对来自不同领域的不同查询词的分析说用了可以获得更相关的搜索结果。

关键词：网页查询处理　　查询词扩展　　ResearchCyc　Cyc　　语义知识知识贮藏库

1．介绍

互联网上可用的信息不断膨胀，需要智能的处理查询，以便比以前更能满足用户主观需求。所以，需要提交查询词所在的上下文环境中的一些概念和查询词本身的语义。在本文的上下文环境里，智能的意思是可以解释和扩展查询词，以便可以将查询词放在上下文环境中进行研究和消除歧义。

已经建立了几个知识库来支持代理（人类、程序）提高他们执行任务时的智能，例如 WordNet 、 Cyc 、 ConceptNet 。尽管所有这些知识库对于他们主观目的来说是有用的，但是它们在一些方面跟普通库一样有局限。比如像 WordNet 这样的语言库不能捕获两个概念之间的语义相关性和完整性约束。像 Cyc 这样的语义库不能捕获概念之间的语言相关性（不管这两个概念是否同义）。一些现有的库是基于领域的且只能表现关于领域某些方面的信息，而不是整个领域。对查询词扩展的研究已经使用了知识库来开发工具，辅助用户处理可以捕获用户自己意图的查询词。大部分查询词扩展方法只使用语言知识。但是，语言库缺少语义知识，所以查询词扩展不能处理以下几个问题：（１）与查询词领域相关的知识。（２）常识推论。（３）查询词的概念能参与的语义相关性。 Grootjen 和 Weide 的方法是创建一个小的概念格来支持查询词扩展。与之比较，我们的方法是用一种有效的方法，分组和使用大型知识库中已有的知识来支持查询词扩展。

在本文，我们把语义库看作是表示一个领域中语义信息的库。这些库独立于语法、词形、和语言，但依赖于领域和文化背景。语义库需要语言知识来识别库中的相关概念 , 这此概念表示在一次查询中给定的查询词 . 因此，一个库中的语言完整性和语义信息在改善成功使用库中知识的上下文环境方面是有用的。表１列出了使用语义和语言知识来提高搜索结果的例子。表中的查询是手工完成。“％Ｒ el “是跟查询相关的文档的数量比例的比例。

Ｃ yc 本体论（一个大词典，大知识库）是一个用来捕获和表示常识的知识库。Ｒ eseachCyc 是一个大型的语义库。它应该可以使用 Cyc 中的技术，结合 WordNet 词典的语言信息和互联网上的实际信息来扩展 ResearchCyc 。

本研究的目的就是证明使用语义和语言知识来提高查询优化处理。为此目的，我们研究了跟网页查询过程相关的问题，展示了 ResearchCyc 和 WordNet 怎样有助于改善查询结果。

本研究有几个贡献。第一，它展示了语义和语言知识一起改进查询扩展。第二，本研究识别和形式化网页查询，提出一个查询分类计划来解释这一些情况下，为什么查询词扩展难以成功实现，尽管过去用来支持这样任务的库是完整的。这样的信息用来识别这样的结构和知识，一个本体应该增加提高不同查询的机率。

２网页查询

网页查询的目的是查找最能反映用户需要的信息。本研究中，语义别定义为词的意义或词的本质信息。为了对语义做出有用的研究，必须以有意义和有用的方式对符号进行管理控制。

进行一次网页查询，期望结果为ＥＲ。总得来说，这个ＥＲ属于几个领域，预期的领域ＤＩ。因些，期望结果包括在预期领域的交集中，ＥＲ，，，，，，，，，，其中Ｋ（Ｄ）表示一个返回领域Ｄ中知识的方程。

为了进行一次查询，用户建立一个初始查询系列（ＱＩ）挑出一个查询词来描绘用户要查找的东西。问题随着人类使用的语言的模糊性而产生。使用者在一个给定的上来文环境来考虑查询。由于一个词在不同领域里有不同的意思，查询词查询技术不能决定给定的查询词的那个意思是用户感兴趣的。这样的模糊性，导致了查询的结果包括比预期的领域大得多的领域。这个结果领域叫做获得领域，每一个领域都依赖于查询词Ｗ的一个子集，其中Ｗ。。。。。

假如一个用户住在美国佐治亚州，想买一双ＮＩＫＥ的体育鞋，期望结果ＥＲ就是“佐治亚能买到ＮＩＫＥ鞋的地方”。ＥＲ包括体育、商业、地理信息。这三个领域是预期的领域。假如用户定义了查询包含两个查询词“Ｎike Georgia”。这样，获得领域中，有一些是Ｄ（ＮＩＫＥ）=Commercial Brand,有一些是Ｄ（ＮＩＫＥ）=希腊神话。有一些是Ｄ（Georgia）=中欧一个国家。有一些是Ｄ（Gergia）=美国一个州。有一些是Ｄ（Georgia）=足球队。查询的结果是涉及这获得领域的网页。

2．1网页查询问题

查询中语言的模糊性，用户知识的片面性和确定用户真正所需的困难导致以下影响网页查询的问题。

--确定一个好的初始查询：没有系统的方法或指南来支持用户确定最好的查询词。一个好的查询词的选择是重要的。太普通的词会产生太多相关的领域和结果。使用太特别的词会遗失一些与ER相匹配的结果，因为使用的语言太清晰。

--解决语言的模糊性：涉及到一个相同领域的文档可以使用不同的词来描绘相同的概念。因此，因此，给定一个概念（sport stuff）一些文档使用词（sprot stuff），一些使用近义词（sport material）或者其它描述相同概念的词，只是更一般或者更特别。

--确定相关结果：很难判别一个给定的查询结果是否合有效。如果它属于ER，那么这个结果是有效的。问题是ER是存在于用户的意识中。如果它属于预期领域的交集，这结果也是有效的。不幸的是，我们不知道这些领域是什么，同时，根据词的模糊性问题，我们不能下这样的定论：获得领域就是我们期望的。因此，不可能识别那些结果对用户来说是相关的，那些是不相关的。

(待续）