
Design and Implementation of a High-Performance Distributed Web Crawler

Vladislav Shkapenyuk    Torsten Suel

CIS Department

Polytechnic University

Brooklyn , NY 11201

vshkap@research.att.com, suel@poly.edu

  Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance.In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.
  In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations.The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance.We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

1  Introduction  

The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present.Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years, including crawling strategies, storage, indexing, ranking techniques, and a significant amount of work on the structural analysis of the web and web graph; see [1, 7] for overviews of some recent work and [26, 2] for basic techniques.
   Thus, highly efficient crawling systems are needed in order to download the hundreds of millions of web pages indexed by the major search engines. In fact, search engines compete against each other primarily based on the size and currency of their underlying database, in addition to the quality and response time of their ranking function. Even the largest search engines, such as Google or AltaVista, currently cover only limited parts of the web, and much of their data is several months out of date. (We note, however, that crawling speed is not the only obstacle to increased search engine size, and that the scaling of query throughput and response time to larger collections is also a major issue.)
    A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers. There has been some recent academic interest in the first issue, including work on strategies for crawling important pages first [12, 21], crawling pages on a particular topic or of a particular type [9, 8, 23, 13], recrawling (refreshing) pages in order to optimize the overall “freshness” of a collection of pages [11, 10], or scheduling of crawling activity over time [25].
   In contrast, there has been less work on the second issue. Clearly, all the major search engines have highly optimized crawling systems, although details of these systems are usually proprietary.  The only system described in detail in the literature appears to be the Mercator system of Heydon and Najork at DEC/Compaq [16], which is used by AltaVista. (Some details are also known about the first version of the Google crawler [5] and the system used by the Internet Archive [6].) While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.
    Most of the recent work on crawling strategies does not address these performance issues at all, but instead attempts to minimize the number of pages that need to be downloaded, or maximize the benefit obtained per downloaded page. (An exception is the work in [8] that considers the system performance of a focused crawler built on top of a general-purpose database system, although the throughput of that system is still significantly below that of a high-performance bulk crawler.)In the case of applications that have only very limited bandwidth that is acceptable. However, in the case of a larger search engine, we need to combine good crawling strategy and optimized system design.
   In this paper, we describe the design and implementation of such an optimized system on a network of workstations. The choice of crawling strategy is largely orthogonal to our work. We describe the system using the example of a simple breadth-first crawl, although the system can be adapted to other strategies. We are primarily interested in the I/O and network efficiency aspects of such a system, and in scalability issues in terms of crawling speed and number of participating nodes. We are currently using the crawler to acquire large data sets for work on other aspects of web search technology such as indexing, query processing, and link analysis. We note that high-performance crawlers are currently not widely used by academic researchers, and hence few groups have run experiments on a scale similar to that of the major commercial search engines (one exception being the WebBase project [17] and related work at Stanford). There are many interesting questions in this realm of massive data sets that deserve more attention by academic researchers.


    像很多专门的搜索工具一样,各主要的 Web 搜索引擎依靠 Web 爬行器来获取大量的网页用来建立索引和分析。 Web 爬行器会在几个星期或几个月内和上百万主机交互,因此讨论它的健壮性、灵活性和可操纵性有着重要的意义。 另外,还必须考虑 I/O 性能、网络资源和操作系统限制等因素,用合理的成本取得高性能。
    在这篇论文中,我们将描述一种运行在网络工作站上的分布式 Web 爬行器的设计和实现。 这个爬行器(每秒至少可以爬几百个网页)可以处理系统崩溃和其他问题,并且适用于各种爬行应用系统。 我们将介绍这个系统的软件体系结构,讨论性能瓶颈,并描述取得高性能的有效技术。 我们还将报告爬行 5 百万主机上的 1 亿 2 千个网页的初步实验结果。

1 引言


因此,需要高效的爬行系统来下载被大型搜索引擎索引的数亿个网页。实际上,搜索引擎之间的竞争除了在排序方法的质量和响应时间上外,主要是在他们的底层数据库的规模和流量上。即使是一些大型的搜索引擎(比如Google或者Alta Vista),目前也只能覆盖网络的有限部分,而且他们的很多数据已经过时几个月了。(注意,爬行速度不是增加搜索引擎规模的仅有障碍,对大量搜索结果的查询吞吐率和响应时间也是一个大问题。)


可是,就第二个问题所作的工作很少。很明显,所有大型搜索引擎都有非常优化的爬行系统,但是这些系统的详细资料是保密的。在文献中详细描述的仅有系统应该是DEC/CompaqHeydonNajork开发的被AltaVista采用的Mercator系统【16】。(关于Google爬行器的第一个版本【5】和Internet Archive所采用的系统【6】也有一些详细资料。)虽然构造一个在短期内应用的每秒下载少许网页的缓慢爬行器相当简单,但是构造一个可以在几周内下载数亿网页的高性能系统在系统设计、I/O和网络性能、健壮性和可操纵性方面存在大量难题。







当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


