Semantic Search: The Myth and Reality 及其中文翻译

 
Written by Alex Iskold / May 29, 2008 2:15 PM / 15 Comments

For a few years now people have been talking about semantic search. Any technology that stands a chance to dethrone Google is of great interest to all of us, particularly one that takes advantage of long-awaited and much-hyped semantic technologies. But no matter how much progress has been made, most of us are still underwhelmed by the results. In head-to-head comparisons with Google, the results have not come out much different. What are we doing wrong?

For example, when asked, What is the capital of France? both approaches come back with the correct answer - Paris. Also, a lot of queries that we are used to typing into Google in abbreviated form, come back with similar results if we type them using natural language. Clearly something is off. We all know that semantic technologies are powerful, but how and why? In this post we will show that the problem is that we are asking wrong questions.

The mistake is that semantic search engines present us with Google-like search box and allow us to enter free form queries. So we type the things that we are used to asking - primitive queries. It never occurs to us to type in What actor starred in both Pulp Fiction and Saturday Night Fever? or What two US Senators received donations from a foreign entity? We type simple questions, but this is not where the power of semantic search lies. Lets look at the spectrum of semantic technologies from Google, to SearchMonkey, to Powerset, and Freebase to understand what is going on.

What Problem Are We Trying to Solve?

The first confusion in the space comes from the fact that semantic search is being positioned as the answer to all possible problems - from modern search, currently dominated by Google, to problems that are computationally impossible. The situation is made more difficult by the fact that right now there is only a thin range of problems where semantic search can clearly do better. This range is complex queries involving inferencing and reasoning over a complex data set.

As shown in the diagram above basic queries are easily handled by Google. Sadly, natural language processing gives little advantage when it comes to this category of problems. Google correctly answers the question about Leonardo Da Vinci's birthday leaving no opportunities to improve the search by understanding the nouns and the verbs that user typed in.

Before looking at the problems that are perfect for semantic search, lets look at the hardest problems. These are computationally challenging problems that really have nothing to do with understanding semantics. The misconception has been perpetuated since early days of the Semantic Web that somehow, because we will annotate the web, we will be able to solve these super complex problems. This is simply not true. There are fundamental limits to what we can compute, and a class of problems that have an exponential number of possible solutions is not going to be magically solved because we represent data as RDF.

The good news is that there is a set of problems that are great for semantic search. These are the problems we have been solving so wonderfully with relational database. Way too often we forget that semantic technologies are here to help us represent relational data spread over the entire web - so it should be no surprise to us that it is relational queries that semantic search engines would excel at.

The Spectrum of Semantic Search Players

But semantic search is not just about the questions that we are asking. Because the web is just a bunch of unstructured HTML pages, semantic search is also about the underlying data. At its most structured extreme we find Freebase - the semantic database of everything. Freebase is accessible via free text search, but more importantly via MQL (Metaweb Query Language). MQL is essentially JSON with wildcards. Using it you can construct any query against Freebase and the result will be the same query with answers filled in.

Powerset, in a way, is just a relational database. It operates against certain, structured information. On the other end of the spectrum is Google, which is all about statistical frequencies and very little semantics. The recently launched SearchMonkey from Yahoo! is an interesting twist. It does not add anything to the result set, but instead uses semantic annotations to present a richer, more interactive and useful user interface.

Companies like Hakia and Powerset are probably working the hardest. These companies are trying to simultaneously build Freebase-like structures on the fly and then do natural language queries on top of them. The difference is that Hakia is using (likely similar) technology to query over the entire web, while Powerset has (probably shrewdly) chosen to restrict the search to Wikipedia.

Are Hakia, Powerset and Freebase All That Different?

This analysis brings up a question - which of these technologies are different and which are essentially the same? Lets get the easy one down first. Yahoo!'s SearchMonkey is no different from Google or any other search, as far as the core search technology is concerned. The difference is simply in the presentation layer. SearchMonkey is smart about creating a better user experience by letting publishers present the search results to the users in the best possible way.

But when it comes to Hakia, Powerset and Freebase the situation is much more complicated. On the surface all these products are different - Hakia lets you search the whole web, Powerset is restricted to Wikipedia (and Freebase!) and Freebase itself has two search interfaces - the search box and query language. Here is the problem - the natural language interface has nothing to do with the underlying data representation.

The fact is that all of these semantic search technologies allow people to type in arbitrarily complex questions and then interpret these queries and execute them against their databases. Fundamentally, Hakia, Powerset, and Freebase are databases. Fundamentally, all of them have some kind of Natural Language Processing that translates the question into a canonical query over the database.

To gain insight into all of this, think about Freebase and its query language MQL. Unlike natural language, which allows all sorts of constructs, MQL is non-ambiguous. This JSON-like language allows users to construct precise statements against Freebase. The fact that Powerset allows natural language queries does not mean that inside Powerset there is no database. For sure, though, there is a similar kind of database as there is beneath the Freebase search box. What is really different about Freebase and Powerset is the data gathering approach and user experience.

Back to the Future: It's All About UI

Probably the most striking revelation about the semantic search space is User Interface. First, to go on the tangent, Powerset got it right by realizing that semantics needs to be surfaced in the UI. After a user searches Powerset, a contextual gadget, aware of the semantics of the results, helps the user complete the search experience.

Yet the biggest mistake that I think Powerset is making is also in the UI. The search box that everyone is familiar with via traditional web search engines needs to go. Having a simplistic search interface hurts Powerset and Hakia, and to a lesser extent Freebase, which is not positioning itself as generic search.

Think about the recent launch of Powerset. The company released a vastly better way to interact with one of the most important sources of information on the web - Wikipedia. But what did the critics say? Lets see if this is a Google killer. And the answer to that is "no."

But what if Powerset restricted what can be searched? What if instead of a search box there was another interface or what if they told users not to look up things that they can find easily on Google? Why is it that new companies are expected to improve on the algorithm that has ruled the web for over a decade? Instead, the expectation should really be to solve the problems that can not be solved by Google today.

Conclusion

Semantic search is an upcoming technology that has set the expectations way too high. We have all been misled into thinking that these technologies are here to dethrone Google by delivering better search results. Neither of those things are true. What is true, however is that semantic search is going to be big and it is going to help us answer questions that we simply cannot answer today - complex, inferencing queries asked over the entire web as if it was a database.

In order for these semantic search technologies to make a dent in the market, they need to clean up their messaging and most importantly, their user interface. Presenting a search box is both misleading and detrimental, as people associate it with the simplistic questions that Google solves without any problems. To really showcase semantic search, these companies need to come up with innovative UIs that will help users to understand the power that is being put at their fingers.

As always, please tell us what you think. What should semantic search companies do to gain their place in the marketplace?

近年来,人们一直谈论语意搜索,任何能够同 Google 抗衡的技术都倍受关注,尤其那些期待已久的语意搜索技术。但不管人们在这方面获得了何种进步,我们仍然对结果失望,在与 Google 做的并列搜索结果对比中,我们发现二者的差别并不大。

例如,当我们问,法国的首都是哪里?两种搜索技术都返回正确答案,巴黎。同样,我们在 Google 中搜索的时候,不管使用自然语言还是缩写式搜索语言,返回的结果都差不多。我们都知道语意搜索技术很强大,但强大在哪里?本文中我们会看到问题出在我们的搜索提问方式不对。

自然语言搜索引擎为我们提供了一个和 Google 一样的搜索框,我们在这个搜索框中输入搜索问题的时候,不自觉地使用了那些最原始的提问方式,如,法国的首都是哪里,我们很少问,同时在 Pulp Fiction 和 Saturday Night Fever 中主演的是哪个演员?,或者,那两个接收国外政治献金的美国议员是谁?我们输入的问题太简单,这无法体现语意搜索的强大,我们下面会谈到 Google, SearchMonkey, Powerset 以及 Freebase 等搜索技术在语意技术上的对比。

我们要解决的问题

第一个困惑来自这样的事实,就是,语意搜索已经被推向可以解决一切问题的位置,从以 Google 为代表的现代搜索问题,到一些计算机根本无法解决的问题,更严重的是,目前语意搜索只能在一个狭小的范围内做得比较好,就是那些牵扯到对复杂数据进行推理的查询。

象上图中显示的那样,基础查询,Google 很容易处理,不幸的是,自然语言在这里几乎没有优势,Google 可以准确的回答达芬奇的生日,但它没有办法理解用户输入的名词和动词,也没有办法因此提高搜索质量。

我们在观察语意搜索能完美解决的问题之前,先让我们看看最困难的部分。在理解语意之外,有一些计算上的挑战,有一个延续了很久的对语意 Web 的误解是,既然我们可以注解 Web,那我们就能够解决那些超级复杂的问题,这是不对的。我们在计算上有一些本质的限制,那些可能有很多解决方法的问题未必会因为我们以 RDF 表现数据就能得到解决。

一个好消息是,有一些问题对语意搜索来说是得心应手的,就是那些我们已经通过关系数据库完美解决了的问题。我们经常忘记了语意技术是来帮助我们在整个 Web 世界表现关系数据的,所以,就不奇怪语意搜索将超越关系搜索。

当前的语意搜索商

但语意搜索并不是我们问问题,因为 Web 事实上是一堆非结构化 HTML 页,语意搜索与这些 HTML 页背后的数据有关。这其中最极端的例子是 Freebase 。 Freebase 可以通过文字搜索进行访问,但更主要通过 MQL (Metaweb 查询语言)访问。使用 MQL,你可以从 Freebase 查询任何东西。

Powerset 从某个方面来说,仅仅是关系数据库,它基于特定的结构信息。Google 则全然是统计意义上的频度问题,几乎没有语意思在内。Yahoo! 最近发布的 SearchMonkey 是对二者有趣的结合,它并不在结果集中加入任何东西,而是使用语意注解,来表现更丰富,更交互,更有用的用户界面。

HakiaPowerset 是在这些技术上最努力的公司,他们企图建立一个类似 Freebase 的结构,然后使用自然语言进行查询。不同的是,Hakia 面向整个 Web,而 Powerset 只面向 Wikipedia

Hakia, Powerset 和 Freebase 到底有多大区别

现在有一个问题,上面提到这几种技术中哪些是不同的,哪些从本质上是一样的?我们先从简单的入手,从搜索的核心技术来说,Yahoo 的 SearchMonkey 和 Google 以及其它搜索没有什么不同,不同的是展示层。SearchMonkey 通过将搜索结果以最好的方式展示给用户,而为用户创建一个更好的使用体验。

但 Hakia, Powerset 和 Freebase 的情形却复杂的多,从表面看,这些技术都不一样,Hakia 让你搜索整个网络,Powerset 只限于 Wikipedia (和 Freebase),Freebase 自己拥有两种界面,搜索框式界面和查询语言式界面,这就是问题所在,自然语言界面与其背后的数据展示没有任何关系。

事实上,所有这些语意搜索技术允许用户输入复杂的问题,然后将这些查询问题进行分析并向数据库进行查询。本质上,Hakia, Powerset 和 Freebase 是数据库,他们都是一种自然语言处理引擎,将用户的问题翻译成对数据库的查询。

要彻底看清这些技术的内部,不妨想一想 Freebase 和它的查询语言 MQL。和自然语言不同,MQL 允许各种查询结构,MQL 不会产生歧义,这种类似 JSON 的语言允许用户构造精确的查询语句。我们说 Powerset 允许自然语言查询并不是说 Powerset 的内部有一个数据库,当然,它的内部有一个和 Freebase 搜索框后面的数据库类似的东西,真正不同的是他们搜集集合数据的方式以及用户的体验。

搜索的未来:用户界面是一切

也许语意搜索最大的革命是用户界面,首先,Powerset 正确地认识到语意是用户界面最上面的一层,用户在 Powerset 搜索的时候,一个可以进行上下文关联的工具会意识到结果的语意,并提供一些有用的信息来帮助用户完成搜索体验。

但我认为 Powerset 犯的最大错误也是用户界面,那个和传统搜索一模一样的搜索框应当去掉,提供一个简化的搜索界面会伤害 Powerset,Hakia 以及 Freebase。

联想到 Powerset ,它使用了一种总体上更好的方式同网络中最好的资源 Wikipedia 进行交互,但批评着是怎么说的,Powerset 是 Google 杀手吗,不是。

但,如果 Powerset 缩小自己的搜索范围会呢?如果 Powerset 用另一种界面替换那个搜索框,或者告诉用户在 Powerset 找那些 Google 无法简单找到的东西会怎么样呢?这些新公司为什么要改进那些已经存在了10年的技术,而不是为那些 Google 解决不了的问题提供方案?

结论

语意搜索是一种过分抬高了人们的期望值的技术,我们都误以为这些技术是 Google 的替代品,会带来更好的搜索结果。事实上不是,这些技术的出现,是为了解决目前 Google 等传统搜索引擎所无法解决的问题,那些复杂的,牵扯到推理的,将整个 Web 当作数据库进行查询的问题。

本文国际来源:http://www.readwriteweb.com/archives/semantic_search_the_myth_and_reality.php
中文翻译来源:COMSHARP CMS 官方网站


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值