Doug Cutting 访谈录 -- 关于搜索引擎的开发

最新推荐文章于 2023-06-01 21:06:57 发布

lee_eric

最新推荐文章于 2023-06-01 21:06:57 发布

阅读量1.2k

点赞数

分类专栏：搜索引擎文章标签：搜索引擎 lucene search google 网络爬虫 yahoo

搜索引擎专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Doug Cutting Interview

Doug Cutting is primary developer of the Lucene and Nutch open source search projects. He has worked in the search technology field for nearly two decades, including five years at Xerox PARC, three years at Apple, and four years at Excite.

What do you do for a living, and how are you involved in search engine development?

I work from home on Lucene & Nutch, two open source search projects. I get paid by various contracts related to these projects. I have an ongoing relationship with Yahoo! Labs that funds me to work part-time on Nutch. I take other short-term contract work related to both projects.

Could you briefly tell us about Nutch, and where you are trying to take it?

First, let me say what Lucene is, to provide context. Lucene is a software library for full full-text search. It's not an application, but rather technology that can be incorporated in applications. It's an Apache project and is widely used. A small subset of folks using Lucene are listed at wiki.apache.org/ jakarta-lucene/PoweredBy.

Nutch builds on Lucene to implement web search. Nutch is an application: you can download it and run it. It adds a crawler and other web-specific stuff to Lucene. Nutch aims to scale from simple intranet searching to search of the entire web, like Google and Yahoo!. To rival these guys you need a lot of tricks. We've demoed it on over 100M pages, and it's designed to scale to over 1B pages. But it also works well on a single machine, searching just a few servers.

From your perspective, what are the core principles of search engine architecture? What are the main things to consider and the big modules search engine software can be broken up into?

Let's see, the major bits are:

fetching – downloading lists of pages that have been referenced.
database – keeping track of what pages you've fetched, when you fetched them, what they've linked to, etc.
link analysis – analyzing the database to assign a priori scores to pages (e.g., PageRank & WebRank) and to prioritize fetching. The value of this is somewhat overrated. Indexing anchor text is probably more important (that's what makes, e.g., Google Bombing so effective).
indexing – combines content from the fetcher, incoming links from the database, and link analysis scores into a datastructure that's quickly searchable.
searching – ranks pages against a query using an index.

To scale to billions of pages, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines.

You are saying people can download Nutch to run on their machines. Is there a possibility for small time webmasters who don't have full control over their Apache servers to make use of Nutch?

Unfortunately, most of them probably won't. Nutch requires a Java servlet container, which some ISPs support, but most do not.

Can I combine Lucene and the Google Web API, or Lucene and some other application I wrote?

A couple of folks have contributed Google-like APIs for Nutch, but none has yet made it into the system. We should have something like this soon, however.

What do you think is the biggest hurdle to overcome when implementing a search engine – is it the hardware and storage barrier, or the ranking algorithms? Also, how much space do you need to assure the search engine makes some sense, say, by writing an engine restricted to search a million RSS feeds?

Nutch requires around a total of 10kb per web page. RSS feeds tend to point to small pages, so you'd probably do better than that. Nutch doesn't yet have specific support for RSS.

Is it easy to get funded by Yahoo! Labs? Who can apply, and what do you need to give back in return?

I was invited, I didn't apply. I have no idea what the process is.

Did Google Inc. ever show interest in Nutch?

I've talked with folks there, including Larry Page. The'd like to help, but they can't see a way to do that without also helping their competitors.

In Nutch, do you implement your own PageRank or WebRank system? What considerations go into ranking?

Yes, Nutch has a link analysis module. Use of it is optional. For intranet search we find that it's really not needed.

I guess you heard it before, but doesn't an open-source search engine open itself up for blackhat Search Engine Optimization?

Potentially.

Let's say it takes spammers six weeks to reverse engineer a closed-source search engines latest spam detecting algorithm. With an open source engine, this can be done much faster. But in either case, the spammers will eventually figure out how it works; the only difference is how quickly. So the best anti-spam techniques, open or closed source, are those that continue to work even when their mechanism is known.

Also, if you, e.g., remove detected spammers from the index for six months, then there's not much they can do, once detected, to change their sites to elude detection. And if your spam detectors are based on statistical analyses of good and bad example sites, then you can, overnight, notice new patterns and remove the spammers before they have a chance to respond.

So open source can make it a little harder to stop spam, but it doesn't make it impossible. And closed-source search engines have not been able to use secrecy to solve this problem. I think the closed-source advantage here is not as great as folks imagine it to be.

How does Nutch relate to distributed Web crawler Grub, and what do you think of it?

As far as I can tell, Grub is a project that lets folks donate their hardware and bandwidth to LookSmart's crawling effort. Only the client is open source, not the server, so folks can neither deploy their own version of Grub, nor can they access the data that Grub gathers.

What about distributed crawling more generally? When a search engine gets big, crawl-related expenses are dwarfed by search-related expenses. So a distributed crawler doesn't significantly improve costs, rather it makes more complicated something that is already relatively inexpensive. That's not a good tradeoff.

Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine.

You are emphasizing the importance of speed in a search engine. I'm often puzzled by how fast Google returns a result. Do you have an idea how they do it, and what are your experience with Nutch?

I believe Google does roughly what Nutch does: they broadcast queries to a number of nodes, each which returns the top-results over a set of pages. With a couple of million pages per node then disk accesses can be avoided for most queries and each node can process tens to hundreds of queries per second. If you want to search billions of pages then you have to broadcast each query to thousands of nodes. That's a lot of network traffic.

Some of this is described in www.computer.org/ micro/mi2003/ m2022.pdf

When you mention spam, do you have any spam-fighting algorithms in Nutch? How can one differentiate between spam patterns like linkfarms, and sites which just happen to be very popular?

We haven't yet had time to start working on this, but it's obviously an important area. Before we get to link farms we need to do the simple stuff: look for word stuffing, white-on-white text, etc.

I think the key to search quality in general (of which spam detection is a sub-problem) is to have a trusted supply of hand-evaluated search results. With this, one can train a ranking algorithm to generate better results. (Spammy results are just one kind of bad results.) Commercial search engines get trusted evaluations by hiring folks. It remains to be seen how Nutch will do this. We obviously cannot just accept all evaluations donated, or else spammers will spam the evaluations. So we need a means of establishing the trustability of volunteer evaluators. I think a peer-review system, perhaps something like Slashdots's karma system, could work here.

Where do you see search engines heading in the near and far future, and what do you think are the biggest hurdles to overcome from a developer's perspective?

Sorry, I'm not very imaginative here. My prediction is that the web search in the coming decade is going to look more-or-less like web search of today. It's a safe bet. Web search evolved quickly for the first few years. It started in 1994 with WebCrawler, using standard information retrieval methods. The development of more web-specific methods took a few years, culminating in Google's 1998 launch. Since then, the introduction of new methods has slowed dramatically. The low-hanging fruit has been harvested. Innovation is easy when an area is young, and becomes more difficult as it field matures. Web search grew up in the 1990s, is now a cash cow, and will soon be a commodity.

As far as development challenges, I think operational reliability is a big one. We're working on developing something like GFS, the Google Filesystem. Stuff like this is essential to large-scale web search: you cannot let a failure of any single component cause a major hiccough; you must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and you can't require an army of operators – things should largely fix themselves.

译文 /Dedian

作为Lucene和Nutch两大Apach Open Source Project的始创人(其实还有Lucy, Lucene4C 和Hadoop等相关子项目)，Doug Cutting 一直为搜索引擎的开发人员所关注。他终于在为Yahoo以Contractor的身份工作4年后，于今年正式以Employee的身份加入Yahoo

下面是笔者在工作之余,翻译其一篇2年前的访谈录，原文(Doug Cutting Interview)在网上Google一下就容易找到。希望对搜索引擎开发的初学者起到一个抛砖引玉的效果。

(注：翻译水平有限，不求雅，只求信，达。希望见谅)

1。请问你以何为生？你是如何开始从事搜索引擎开发的？

我主要在家从事两个与搜索有关的开源项目的开发: Lucene和Nutch. 钱主要来自于一些与这些项目相关的一些合同中。目前Yahoo! Labs 有一部分赞助在Nutch上。这两个项目还有一些其他的短期合同。

2。你能大概给我们讲解一下Nutch吗？以及你将在哪方面运用它？

我还是先说一下Lucene吧。Lucene其实是一个提供全文文本搜索的函数库，它不是一个应用软件。它提供很多API函数让你可以运用到各种实际应用程序中。现在，它已经成为Apache的一个项目并被广泛应用着。这里列出一些已经使用Lucene的系统。

Nutch是一个建立在Lucene核心之上的Web搜索的实现，它是一个真正的应用程序。也就是说，你可以直接下载下来拿过来用。它在Lucene的基础上加了网络爬虫和一些和Web相关的东东。其目的就是想从一个简单的站内索引和搜索推广到全球网络的搜索上，就像Google和Yahoo一样。当然，和那些巨人竞争，你得动一些脑筋，想一些办法。我们已经测试过100M的网页，并且它的设计用在超过1B的网页上应该没有问题。当然，让它运行在一台机器上，搜索一些服务器，也运行的很好。

3。在你看来，什么是搜索引擎的核心元素？也就说，一般的搜索引擎软件可以分成哪几个主要部分或者模块？

让我想想，大概是如下几块吧：

-- 攫取(fetching)：就是把被指向的网页下载下来。
-- 数据库：保存攫取的网页信息，比如那些网页已经被攫取，什么时候被攫取的以及他们又有哪些链接的网页等等。
-- 链接分析：对刚才数据库的信息进行分析，给每个网页加上一些权值(比如PageRank,WebRank什么的)，以便对每个网页的重要性有所估计。不过，在我看来，索引那些网页标记(Anchor)里面的内容更为重要。(这也是为什么诸如Google Bombing如此高效的原因)
-- 索引(Indexing): 就是对攫取的网页内容，以及链入链接，链接分析权值等信息进行索引以便迅速查询。
-- 搜索(Searching): 就是通过一个索引进行查询然后按照网页排名显示。

当然，为了让搜索引擎能够处理数以亿计的网页，以上的模块都应该是分布式的。也就是说，可以在多台机器上并行运行。

4。你刚才说大家可以立马下载Nutch运行在自己的机器上。这是不是说，即便那些对Apache服务器没有掌控权的网站管理员在短时间内就可以使用Nutch?

很不幸，估计他们大都没戏。因为Nutch还是需要一个Java servlet的容器(笔者注：比如Tomcat)。而这个有些ISP支持，但大都不支持。(笔者注: 只有对Apache服务器有掌控权，你才能在上面安装一个Tomcat之类的东东)

5。我可以把Lucene和Google Web API结合起来吗？或者和其他的一些我先前写过的应用程序结合起来？

有那么一帮人已经为Nutch写了一些类似Google的API, 但还没有一个融入现在的系统。估计不久的将来就行了。

6。你认为目前实现一个搜索引擎最大的障碍在哪里？是硬件，存储障碍还是排名算法？还有，你能不能告诉我大概需要多大的空间搜索引擎才能正常工作，就说我只想写一个针对搜索成千上百万的RSS feeds的一个搜索引擎吧。

Nutch大概一个网页总共需要10kb的空间吧。Rss feeds的网页一般都比较小(笔者注: Rss feeds都是基于xml的文本网页，所以不会很大)，所以应该更好处理吧。当然Nutch目前还没有针对RSS的支持。(笔者注：实际上，API里面有针对RSS的数据结构和解析)

7。从Yahoo! Labs拿到资金容易吗？哪些人可以申请？你又要为之做出些什么作为回报？

我是被邀请的，我没有申请。所以我不是很清楚个中的流程。

8。Google有没有表示对Nutch感兴趣？

我和那边的一些家伙谈过，包括Larry Page(笔者注: Google两个创始人之一)。他们都很愿意提供一些帮助，但是他们也无法找到一种不会帮助到他们竞争对手的合适方式。

9。你有实现你自己的PageRank或者WebRank算法系统在你的Nutch里吗？什么是你做网页排名(Ranking)的考虑？

是的，Nutch里面有一个链接分析模块。它是可选的，因为对于站内搜索来说，网页排名是不需要的。

10。我想你以前有听说过，就是对于一个开源的搜索引擎，是不是意味着同样会给那些搞搜索引擎优化(SEO)的黑客们有机可趁？

恩，有可能。
就说利用反向工程破解的非开源搜索引擎中的最新的反垃圾信息检测算法需要大概6个月的时间。对于一个开放源码的搜索引擎来说，破解将会更快。但不管怎么说，那些制造垃圾信息者最终总能找到破解办法，唯一的区别就是破解速度问题。所以最好的反垃圾信息技术，不管开源也好闭源也好，就是让别人知道了其中的机制之后也能继续工作那一种。

还有，如果这六月中你是把检测出来的垃圾信息从你的索引中移除，他们无计可施，他们只能改变他们的站点。如果你的垃圾信息检测是基于对一些网站中好的和坏的例子的统计分析，你可以彻夜留意那些新的垃圾信息模式并在他们有机会反应之前将他们移除。

开源会使得禁止垃圾信息的任务稍稍艰巨一点，但不是使之成为不可能。况且，那些闭源的搜索引擎也并没有秘密地解决这些问题。我想闭源的好处就是不让我们看到它其实没有我们想象的那么好。

11。Nutch和分布式的网络爬虫Grub相比怎么样？你是怎么想这个问题的？

我能说的就是，Grub是一个能够让网民们贡献一点自己的硬件和带宽给巨大的LookSmart的爬行任务的一个工程。它只有客户端是开源，而服务端没有。所以大家并不能配置自己的Grub服务，也不能访问到Grub收集的数据。

更一般意义的分布式网络爬行又如何？当一个搜索引擎变得很大的时候，其爬行上的代价相对搜索上需要付出的代价将是小巫见大巫。所以，一个分布式爬虫并不能是显著降低成本，相反它会使得一些已经不是很昂贵的东西变得很复杂(笔者注：指pc和硬盘之类的硬件)。所以这不是一个便宜的买卖。

广泛的分布式搜索是一件很有趣的事，但我不能肯定它能否实现并保持速度足够的快。一个更快的搜索引擎就是一个更好的搜索引擎。当大家可以任意快速更改查询的时候，他们就更能在他们失去耐心之前频繁找到他们所需的东西。但是，要建立一个不到1秒内就可以搜索数以亿计的网页的广泛的分布式搜索引擎是很难的一件事，因为其中网络有很高的延时。大都的半秒时间或者像Google展示它的查询那样就是在一个数据中心的网络延时。如果你让同样一个系统运行在千家万户的家里的PC上，即便他们用的是DSL和Cable上网，网络的延时将会更高从而使得一个查询很可能要花上几秒钟甚至更长的时间。从而他也不可能会是一个好的搜索引擎。

12。你反复强调速度对于搜索引擎的重要性，我经常很迷惑Google怎么就能这么快地返回查询结果。你认为他们是怎么做到的呢？还有你在Nutch上的经验看法如何？

我相信Google的原理和Nutch大抵相同：就是把查询请求广播到一些节点上，每个节点返回一些页面的顶级查询结果。每个节点上保存着几百万的页面，这样可以避免大多查询的磁盘访问，并且每个节点可以每秒同时处理成十上百的查询。如果你想获得数以亿计的页面，你可以把查询广播到成千的节点上。当然这里会有不少网络流量。

具体的在这篇文章（ www.computer.org/ micro/mi2003/ m2022.pdf）中有所描述。

13。你刚才有提到垃圾信息，在Nutch里面是不是也有类似的算法？怎么区别垃圾信息模式比如链接场(Linkfarms)(笔者注：就是一群的网页彼此互相链接，这是当初在1999年被一帮搞SEO弄出来的针对lnktomi搜索引擎的使网页的排名得到提高的一种Spamdexing方法)和那些正常的受欢迎的站点链接。

这个，我们还没有腾出时间做这块。不过，很显然这是一个很重要的领域。在我们进入链接场之前，我们需要做一些简单的事情：察看词汇填充(Word stuffing)(笔者注：就是在网页里嵌入一些特殊的词汇，并且出现很多的次，甚至上百次，有些是人眼看不到的，比如白板写白字等伎俩，这也是Spamdexing方法的一种)，白板写白字(White-on-white text)，等等。

我想在一般意义上来说(垃圾信息检测是其中的一个子问题)，搜索质量的关键在于拥有一个对查询结果手工可靠评估的辅助措施。这样，我们可以训练一个排名算法从而产生更好的查询结果(垃圾信息的查询结果是一种坏的查询结果)。商业的搜索引擎往往会雇佣一些人进行可靠评估。Nutch也会这样做，但很显然我们不能只接受那些友情赞助的评估，因为那些垃圾信息制造者很容易会防止那些评估。因此我们需要一种手段去建立一套自愿评估者的信任体制。我认为一个平等评论系统(peer-review system),有点像Slashdot的karma系统, 应该在这里很有帮助。

14。你认为搜索引擎在不久的将来路在何方？你认为从一个开发者的角度来看，最大的障碍将在哪里？

很抱歉，我不是一个想象力丰富的人。我的预测就是在未来的十年里web搜索引擎将和现在的搜索引擎相差无几。现在应该属于平稳期。在最初的几年里，网络搜索引擎确实曾经发展非常迅速。源于1994年的网络爬虫使用了标准的信息析取方法。直到1998年Google的出现，其间更多的基于Web的方法得到了发展。从那以后，新方法的引入大大放慢了脚步。那些树枝低的果实已被收获。创新只有在刚发展的时候比较容易，越到后来越成熟，越不容易创新。网络搜索引擎起源于上个世纪90年代，现在俨然已成一颗摇钱树，将来很快会走进人们的日常生活中。

至于开发上的挑战，我认为操作上的可靠性将是一个大的挑战。我们目前正在开发一个类似GFS(Google的文件系统)的东西。它是巨型搜索引擎不可缺少的基石：你不能让一个小组件的错误导致一个大的瘫痪。你应该很容易的让系统扩展，只需往硬件池里加更多硬件而不需繁缛的重新配置。还有，你不需要一大坨的操作人员完成，所有的一切将大都自己搞定