看一篇关于搜索引擎方面的文章,将它翻译一下,水平有限,。。
Improving web-query processing through semantic knowledge and user feedback
Keywords: Web-query processing; Query expansion; ResearchCyc; Cyc; Semantic knowledge; Knowledge repositories
1. Introduction
The continued explosion of available information on the World Wide Web has lead to the need for processing queries intelligently to address more of the user’s intended requirements than previously possible [1]. Doing so, requires some notion of the context within which the query is being posed and the semantics of the query itself. In our context, intelligent means that the queries should be interpreted and extended in order to contextualize and disambiguate them.
Several knowledge repositories have been created to support agents (humans or programs) to increase the intelligence of their tasks. Examples include WordNet [2], Cyc [3], and ConceptNet [4]. Although all of these are useful for their intended purposes, they are limited as a general repository in several ways. Linguistic repositories, such as WordNet, do not capture the semantic relationships or integrity constraints between concepts. Semantic repositories such as Cyc do not represent linguistic relationships of the concepts (e.g. whether two concepts are synonyms). Some of the existing repositories are domain dependent and only represent information about certain aspects of the domains, not the complete domain. Research on query extension has used knowledge repositories to develop tools that assist the user in processing queries that capture the user’s intent [5], [6], [7], [8], [9], [10] and [11]. Most query extension approaches use only linguistic knowledge [12]. However, linguistic repositories lack semantic knowledge, so query expansion cannot deal with several issues: (1) knowledge related to the domain of the query, (2) common sense inferences, or (3) the semantic relationships in which the concepts of the query can participate. Grootjen and Weide [6] focus on creating a small lattice of concepts to support query expansion. In contrast, our approach focuses on grouping and using existing knowledge in large knowledge bases for query expansion in an efficient manner.
In this research, we consider semantic repositories to be repositories that represent semantic information about a domain. They are independent of syntax, word forms, and languages, but tend to be domain and culture dependent. Semantic repositories need linguistic knowledge to identify relevant concepts from the repository that represent a given term used in the query. Therefore, the integration of linguistic and semantic information into one repository could be useful to increase the contexts where knowledge in these repositories can be used successfully. Table 1 shows examples of improvement in search results using semantic and linguistic knowledge. The queries shown in Table 1 have been executed manually. The “% Rel” metric is the percentage of the number of documents that are relevant to the query in the first 10 documents returned by the search engine.
Improvement in search results using semantic and linguistic knowledge sources
Query | Domain | Results | % Rel | Source | Results | % Rel |
Pets | Animals | 53,600,000 | 70 | ResearchCyc | 24,600,000 | 100 |
|
|
|
| ResearchCyc | 10,400,000 | 95 |
| Buying animals in Atlanta |
|
| ResearchCyc + WordNet | 260,000 | 95 |
Nike Georgia | Bulling Sport Stuff | 2,180,000 | 0 | ResearchCyc | 1,550,000 | 10 |
Flute Bohemian Drink | Drink | 57,900 | 25 | ResearchCyc | 153,000 | 82.5 |
Bonderdorfers Atlanta | Music | 73 | 50 | WordNet | 49 | 90 |
Which universities offer online degrees? | Education |
| 50 | WordNet |
| 90 |
Find cookie stores | Restaurant | 2,900,000 | 20 | ConcepNet + textual sources | 950,000 | 90 |
The Cyc ontology is a repository developed to capture and represent common sense. ReseachCyc (http://research.cyc.com) is a huge semantic repository. It should be possible to use techniques from Cyc [13], [14] and [15] to extend ResearchCyc with linguistic information from the WordNet lexicon, and factual information from the World Wide Web.
The objective of this research is to demonstrate that the use of semantic and linguistic knowledge together improves the query refinement process. To do so, we study the problems associated with the web-query process and show how ResearchCyc, in combination with WordNet, helps improve query results for web searches.
This research makes several contributions. First, it demonstrates that semantic and linguistic knowledge together improve query expansion. Second, the research identifies and formalizes web-query problems and presents a query classification scheme that explains why, in some cases, the query expansion may not be done successfully, even if the repository used to support such a task is complete. Such information is used to identify the structure and knowledge that an ontology should have to increase the chances of improving different kinds of queries.
The purpose of a web query is to search for information that best reflects the user’s needed information. In this research, semantics is defined as the meaning, or essential message, of terms. To carry out useful research for dealing with semantics, symbols must be manipulated in ways that are meaningful and useful [16].
To process a web query, the expected result is ER. This information, in general, belongs to several domains, intended domains DI. Therefore, the expected result is contained in the knowledge defined for the intersection of all the intended domains ER K(DI1) ∩ K(DI2) ∩ … , ∩ K(DIN), where K(D) represents a function that returns the knowledge defined in the domain D as illustrated in Fig. 1.
Full-size image (33K) |
Fig. 1. Constraining the search domains for web queries.
To perform a search, the user creates an initial query (QI) selecting some terms w1, … , wk (called query terms Qw) to describe what he or she is searching for. The problem arises with the ambiguity of the languages humans use. The user considers a query, Qw, within a given context (i.e., the context of the intended domains). Since words have several senses in several domains, query search techniques are not able to determine which of the senses of a given query term is the one in which the user is interested. Given this ambiguity, the result of the query tends to contain results that deal with a number of domains DO1, … , DOm greater than the intended domains (m n)1. The resultant domains are called obtained domains, with each depending on a subset of the query terms DO(W), where W {w1, … , wk}.
Suppose a user lives in Georgia, USA., and wants to buy sports shoes with Nike brand. ER is “Places in Georgia (USA) where I can find a pair of Nike shoes”. ER is composed of domain information that deals with sport stuff (sport shoes), commercial information (brand Nike sells sport stuff), and geographical information (which commercial organizations in Georgia sell Nike products). These three domains are the intended domains. Suppose the user defines a query that contains two words Qw = “Nike Georgia”. Then, some of the DO will be DO1 (Nike) = Commercial Brand, DO2 (Nike) = Greek Mythology (Nike is the goddess of Victory in Greek Mythology), DO3(Georgia) = Central-European Country, DO4 (Georgia) = State in the United States of America, and DO5 (Georgia) = Football Team Georgia Bulldogs. The query results will be the web pages that deal with the presented DO.
The ambiguity of the language used in the query, the possible partial knowledge of the user, and the difficulty in determining what the user really wants, lead to the following problems that affect the processing of web queries.
– Identification of a good initial query: There is no systematic way, or guidelines that support the user in identifying the best terms for a query. A good selection of QI is important. Terms that are too general may result in too many irrelevant DO, and results. Using very specific words may result in missing some of the results that match ER because they use a plain language.
– Resolving language ambiguity: Documents that deal with the same domain can use different terms for describing the same concepts. Therefore, for a given concept (sport stuff) some documents may use the terms (sport stuff), other documents may use synonyms (sport material), and other terms that deal with the same concept, but more generally (playing sport artifact) or specifically (trainer).
– Identifying the relevant results: It may be difficult to detect whether a given result of the query is valid. A result is valid if it belongs to the expected result. The problem is that the expected result is in the mind of the user. A result is also valid if it belongs to the intersection of the intended domains. Unfortunately, we do not know what those domains are, and, due to word ambiguity problems, we cannot conclude that the obtained domains are the expected ones. Hence, it is not possible to identify which results are relevant for the user and which are irrelevant.